Skip navigation.

DBMS2

Syndicate content
Choices in data management and analysis
Updated: 18 hours 13 min ago

Kafka and more

Mon, 2016-01-25 05:28

In a companion introduction to Kafka post, I observed that Kafka at its core is remarkably simple. Confluent offers a marchitecture diagram that illustrates what else is on offer, about which I’ll note:

  • The red boxes — “Ops Dashboard” and “Data Flow Audit” — are the initial closed-source part. No surprise that they sound like management tools; that’s the traditional place for closed source add-ons to start.
  • “Schema Management”
    • Is used to define fields and so on.
    • Is not equivalent to what is ordinarily meant by schema validation, in that …
    • … it allows schemas to change, but puts constraints on which changes are allowed.
    • Is done in plug-ins that live with the producer or consumer of data.
    • Is based on the Hadoop-oriented file format Avro.

Kafka offers little in the way of analytic data transformation and the like. Hence, it’s commonly used with companion products. 

  • Per Confluent/Kafka honcho Jay Kreps, the companion is generally Spark Streaming, Storm or Samza, in declining order of popularity, with Samza running a distant third.
  • Jay estimates that there’s such a companion product at around 50% of Kafka installations.
  • Conversely, Jay estimates that around 80% of Spark Streaming, Storm or Samza users also use Kafka. On the one hand, that sounds high to me; on the other, I can’t quickly name a counterexample, unless Storm originator Twitter is one such.
  • Jay’s views on the Storm/Spark comparison include:
    • Storm is more mature than Spark Streaming, which makes sense given their histories.
    • Storm’s distributed processing capabilities are more questionable than Spark Streaming’s.
    • Spark Streaming is generally used by folks in the heavily overlapping categories of:
      • Spark users.
      • Analytics types.
      • People who need to share stuff between the batch and stream processing worlds.
    • Storm is generally used by people coding up more operational apps.

If we recognize that Jay’s interests are obviously streaming-centric, this distinction maps pretty well to the three use cases Cloudera recently called out.

Complicating this discussion further is Confluent 2.1, which is expected late this quarter. Confluent 2.1 will include, among other things, a stream processing layer that works differently from any of the alternatives I cited, in that:

  • It’s a library running in client applications that can interrogate the core Kafka server, rather than …
  • … a separate thing running on a separate cluster.

The library will do joins, aggregations and so on, and while relying on core Kafka for information about process health and the like. Jay sees this as more of a competitor to Storm in operational use cases than to Spark Streaming in analytic ones.

We didn’t discuss other Confluent 2.1 features much, and frankly they all sounded to me like items from the “You mean you didn’t have that already??” list any young product has.

Related links

Categories: Other

Kafka and Confluent

Mon, 2016-01-25 05:27

For starters:

  • Kafka has gotten considerable attention and adoption in streaming.
  • Kafka is open source, out of LinkedIn.
  • Folks who built it there, led by Jay Kreps, now have a company called Confluent.
  • Confluent seems to be pursuing a fairly standard open source business model around Kafka.
  • Confluent seems to be in the low to mid teens in paying customers.
  • Confluent believes 1000s of Kafka clusters are in production.
  • Confluent reports 40 employees and $31 million raised.

At its core Kafka is very simple:

  • Kafka accepts streams of data in substantially any format, and then streams the data back out, potentially in a highly parallel way.
  • Any producer or consumer of data can connect to Kafka, via what can reasonably be called a publish/subscribe model.
  • Kafka handles various issues of scaling, load balancing, fault tolerance and so on.

So it seems fair to say:

  • Kafka offers the benefits of hub vs. point-to-point connectivity.
  • Kafka acts like a kind of switch, in the telecom sense. (However, this is probably not a very useful metaphor in practice.)

Jay also views Kafka as something like a file system. Kafka doesn’t actually have a file-system-like interface for managing streams, but he acknowledges that as a need and presumably a roadmap item.

The most noteworthy technical point for me was that Kafka persists data, for reasons of buffering, fault-tolerance and the like. The duration of the persistence is configurable, and can be different for different feeds, with two main options:

  • Guaranteed to have the last update of anything.
  • Complete for the past N days.

Jay thinks this is a major difference vs. messaging systems that have come before. As you might expect, given that data arrives in timestamp order and then hangs around for a while:

  • Kafka can offer strong guarantees of delivering data in the correct order.
  • Persisted data is automagically broken up into partitions.

Technical tidbits include:

  • Data is generally fresh to within 1.5 milliseconds.
  • 100s of MB/sec/server is claimed. I didn’t ask how big a server was.
  • LinkedIn runs >1 trillion messages/day through Kafka.
  • Others in that throughput range include but are not limited to Microsoft and Netflix.
  • A message is commonly 1 KB or less.
  • At a guesstimate, 50%ish of messages are in Avro. JSON is another frequent format.

Jay’s answer to any concern about performance overhead for current or future features is usually to point out that anything other than the most basic functionality:

  • Runs in different processes from core Kafka …
  • … if it doesn’t actually run on a different cluster.

For example, connectors have their own pools of processes.

I asked the natural open source question about who contributes what to the Apache Kafka project. Jay’s quick answers were:

  • Perhaps 80% of Kafka code comes from Confluent.
  • LinkedIn has contributed most of the rest.
  • However, as is typical in open source, the general community has contributed some connectors.
  • The general community also contributes “esoteric” bug fixes, which Jay regards as evidence that Kafka is in demanding production use.

Jay has a rather erudite and wry approach to naming and so on.

  • Kafka got its name because it was replacing something he regarded as Kafkaesque. OK.
  • Samza is an associated project that has something to do with transformations. Good name. (The central character of The Metamorphosis was Gregor Samsa, and the opening sentence of the story mentions a transformation.)
  • In his short book about logs, Jay has a picture caption “ETL in Ancient Greece. Not much has changed.” The picture appears to be of Sisyphus. I love it.
  • I still don’t know why he named a key-value store Voldemort. Perhaps it was something not to be spoken of.

What he and his team do not yet have is a clear name for their product category. Difficulties in naming include:

Confluent seems to be using “stream data platform” as a placeholder. As per the link above, I once suggested Data Stream Management System, or more concisely Datastream Manager. “Event”, “event stream” or “event series” could perhaps be mixed in as well. I don’t really have an opinion yet, and probably won’t until I’ve studied the space in a little more detail.

And on that note, I’ll end this post for reasons of length, and discuss Kafka-related technology separately.

Related links

Categories: Other

Cloudera in the cloud(s)

Fri, 2016-01-22 01:46

Cloudera released Version 2 of Cloudera Director, which is a companion product to Cloudera Manager focused specifically on the cloud. This led to a discussion about — you guessed it! — Cloudera and the cloud.

Making Cloudera run in the cloud has three major aspects:

  • Cloudera’s usual software, ported to run on the cloud platform(s).
  • Cloudera Director, which for example launches cloud instances.
  • Points of integration, e.g. taking information about security-oriented roles from the platform and feeding then to the role-based security that is specific to Cloudera Enterprise.

Features new in this week’s release of Cloudera Director include:

  • An API for job submission.
  • Support for spot and preemptable instances.
  • High availability.
  • Kerberos.
  • Some cluster repair.
  • Some cluster cloning.

I.e., we’re talking about some pretty basic/checklist kinds of things. Cloudera Director is evidently working for Amazon AWS and Google GCP, and planned for Windows Azure, VMware and OpenStack.

As for porting, let me start by noting:

  • Shared-nothing analytic systems, RDBMS and Hadoop alike, run much better in the cloud than they used to.
  • Even so, it seems that the future of Hadoop in the cloud is to rely on object storage, such as Amazon S3.

That makes sense in part because:

  • The applications where shared nothing most drastically outshines object storage are probably the ones in which data can just be filtered from disk — spinning-rust or solid-state as the case may be — and processed in place.
  • By way of contrast, if data is being redistributed a lot then the shared nothing benefit applies to a much smaller fraction of the overall workload.
  • The latter group of apps are probably the harder ones to optimize for.

But while it makes sense, much of what’s hardest about the ports involves the move to object storage. The status of that is roughly:

  • Cloudera already has a lot of its software running on Amazon S3, with Impala/Parquet in beta.
  • Object storage integration for Windows Azure is “in progress”.
  • Object storage integration for Google GCP it is “to be determined”.
  • Security for object storage — e.g. encryption — is a work in progress.
  • Cloudera Navigator for object storage is a roadmap item.

When I asked about particularly hard parts of porting to object storage, I got three specifics. Two of them sounded like challenges around having less detailed control, specifically in the area of consistency model and capacity planning. The third I frankly didn’t understand,* which was the semantics of move operations, relating to the fact that they were constant time in HDFS, but linear in size on object stores.

*It’s rarely obvious to me why something is o(1) until it is explained to me.

Naturally, we talked about competition, differentiation, adoption and all that stuff. Highlights included:

  • In general, Cloudera’s three big marketing messages these days can be summarized as “Fast”, “Easy”, and “Secure”.
  • Notwithstanding the differences as to which parts of the Cloudera stack run on premises, on Amazon AWS, on Microsoft Azure or on Google GCP, Cloudera thinks it’s important that its offering is the “same” on all platforms, which allows “hybrid” deployment.
  • In general, Cloudera still sees Hortonworks as a much bigger competitor than MapR or IBM.
  • Cloudera fondly believes that Cloudera Manager is a significant competitive advantage vs. Ambari. (This would presumably be part of the “Easy” claim.)
  • In particular, Cloudera asserts it has better troubleshooting/monitoring than the cloud alternatives do, because of superior drilldown into details.
  • Cloudera’s big competitor on the Amazon platform is Elastic MapReduce (EMR). Cloudera points out that EMR lacks various capabilities that are in the Cloudera stack. Of course, versions of these capabilities are sometimes found in other Amazon offerings, such as Redshift.
  • Cloudera’s big competitor on Azure is HDInsight. Cloudera sells against that via:
    • General Cloudera vs. Hortonworks distinctions.
    • “Hybrid”/portability.

Cloudera also offered a distinction among three types of workload:

  • ETL (Extract/Transform/Load) and “modeling” (by which Cloudera seems to mean predictive modeling).
    • Cloudera pitches this as batch work.
    • Cloudera tries to deposition competitors as being good mainly at these kinds of jobs.
    • This can be reasonably said to be the original sweet spot of Hadoop and MapReduce — which fits with Cloudera’s attempt to portray competitors as technical laggards. :)
    • Cloudera observes that these workloads tend to call for “transient” jobs. Lazier marketers might trot out the word “elasticity”.
  • BI (Business Intelligence) and “analytics”, by which Cloudera seems to mainly mean Impala and Spark.
  • “Application delivery”, by which Cloudera means operational stuff that can’t be allowed to go down. Presumably, this is a rough match to what I — and by now a lot of other folks as well — call short-request processing.

While I don’t agree with terminology that says modeling is not analytics, the basic distinction being drawn here make considerable sense.

Categories: Other

BI and quasi-DBMS

Thu, 2016-01-14 06:42

I’m on two overlapping posting kicks, namely “lessons from the past” and “stuff I keep saying so might as well also write down”. My recent piece on Oracle as the new IBM is an example of both themes. In this post, another example, I’d like to memorialize some points I keep making about business intelligence and other analytics. In particular:

  • BI relies on strong data access capabilities. This is always true. Duh.
  • Therefore, BI and other analytics vendors commonly reinvent the data management wheel. This trend ebbs and flows with technology cycles.

Similarly, BI has often been tied to data integration/ETL (Extract/Transform/Load) functionality.* But I won’t address that subject further at this time.

*In the Hadoop/Spark era, that’s even truer of other analytics than it is of BI.

My top historical examples include:

  • The 1970s analytic fourth-generation languages (RAMIS, NOMAD, FOCUS, et al.) commonly combined reporting and data management.
  • The best BI visualization technology of the 1980s, Executive Information Systems (EIS), was generally unsuccessful. The core reason was a lack of what we’d now call drilldown. Not coincidentally, EIS vendors — notably leader Comshare — didn’t do well at DBMS-like technology.
  • Business Objects, one of the pioneers of the modern BI product category, rose in large part on the strength of its “semantic layer” technology. (If you don’t know what that is, you can imagine it as a kind of virtual data warehouse modest enough in its ambitions to actually be workable.)
  • Cognos, the other pioneer of modern BI, depending on capabilities for which it needed a bundled MOLAP (Multidimensional OnLine Analytic Processing) engine.
  • But Cognos later stopped needing that engine, which underscores my point about technology ebbing and flowing.

I’m not as familiar with the details for MicroStrategy, but I do know that it generates famously complex SQL so as to compensate for the inadequacies of some DBMS, which had the paradoxical effect of creating performance challenges for MicroStrategy used over more capable analytic DBMS, which in turn led at least Teradata to do special work to optimize MicroStrategy processing. Again, ebbs and flows.

More recent examples of serious DBMS-like processing in BI offerings may be found in QlikView, Zoomdata, Platfora, ClearStory, Metamarkets and others. That some of those are SaaS (Software as a Service) doesn’t undermine the general point, because in each case they have significant data processing technology that lies strictly between the visualization and data store layers.

Related link

Categories: Other

Oracle as the new IBM — has a long decline started?

Thu, 2015-12-31 03:15

When I find myself making the same observation fairly frequently, that’s a good impetus to write a post based on it. And so this post is based on the thought that there are many analogies between:

  • Oracle and the Oracle DBMS.
  • IBM and the IBM mainframe.

And when you look at things that way, Oracle seems to be swimming against the tide.

Drilling down, there are basically three things that can seriously threaten Oracle’s market position:

  • Growth in apps of the sort for which Oracle’s RDBMS is not well-suited. Much of “Big Data” fits that description.
  • Outright, widespread replacement of Oracle’s application suites. This is the least of Oracle’s concerns at the moment, but could of course be a disaster in the long term.
  • Transition to “the cloud”. This trend amplifies the other two.

Oracle’s decline, if any, will be slow — but I think it has begun.

 

Oracle/IBM analogies

There’s a clear market lead in the core product category. IBM was dominant in mainframe computing. While not as dominant, Oracle is definitely a strong leader in high-end OTLP/mixed-use (OnLine Transaction Processing) RDBMS.

That market lead is even greater than it looks, because some of the strongest competitors deserve asterisks. Many of IBM’s mainframe competitors were “national champions” — Fujitsu and Hitachi in Japan, Bull in France and so on. Those were probably stronger competitors to IBM than the classic BUNCH companies (Burroughs, Univac, NCR, Control Data, Honeywell).

Similarly, Oracle’s strongest direct competitors are IBM DB2 and Microsoft SQL Server, each of which is sold primarily to customers loyal to the respective vendors’ full stacks. SAP is now trying to play a similar game.

The core product is stable, secure, richly featured, and generally very mature. Duh.

The core product is complicated to administer — which provides great job security for administrators. IBM had JCL (Job Control Language). Oracle has a whole lot of manual work overseeing indexes. In each case, there are many further examples of the point. Edit: A Twitter discussion suggests the specific issue with indexes has been long fixed.

Niche products can actually be more reliable than the big, super-complicated leader. Tandem Nonstop computers were super-reliable. Simple, “embeddable” RDBMS — e.g. Progress or SQL Anywhere — in many cases just work. Still, if you want one system to run most of your workload 24×7, it’s natural to choose the category leader.

The category leader has a great “whole product” story. Here I’m using “whole product” in the sense popularized by Geoffrey Moore, to encompass ancillary products, professional services, training, and so on, from the vendor and third parties alike. There was a time when most serious packaged apps ran exclusively on IBM mainframes. Oracle doesn’t have quite the same dominance, but there are plenty of packaged apps for which it is the natural choice of engine.

Notwithstanding all the foregoing, there’s strong vulnerability to alternative product categories. IBM mainframes eventually were surpassed by UNIX boxes, which had grown up from the minicomputer and even workstation categories. Similarly, the Oracle DBMS has trouble against analytic RDBMS specialists, NoSQL, text search engines and more.

 

IBM’s fate, and Oracle’s

Given that background, what does it teach us about possible futures for Oracle? The golden age of the IBM mainframe lasted 25 or 30 years — 1965-1990 is a good way to think about it, although there’s a little wiggle room at both ends of the interval. Since then it’s been a fairly stagnant cash-cow business, in which a large minority or perhaps even small majority of IBM’s customers have remained intensely loyal, while others have aligned with other vendors.

Oracle’s DBMS business seems pretty stagnant now too. There’s no new on-premises challenger to Oracle now as strong as UNIX boxes were to IBM mainframes 20-25 years ago, but as noted above, traditional competitors are stronger in Oracle’s case than they were in IBM’s. Further, the transition to the cloud is a huge deal, currently in its early stages, and there’s no particular reason to think Oracle will hold any more share there than IBM did in the transition to UNIX.

Within its loyal customer base, IBM has been successful at selling a broad variety of new products (typically software) and services, often via acquired firms. Oracle, of course, has also extended its product lines immensely from RDBMS, to encompass “engineered systems” hardware, app server, apps, business intelligence and more. On the whole, this aspect of Oracle’s strategy is working well.

That said, in most respects Oracle is weaker at account control than peak IBM.

  • Oracle’s core competitors, IBM and Microsoft, are stronger than IBM’s were.
  • DB2 and SQL Server are much closer to Oracle compatibility than most mainframes were to IBM. (Amdahl is an obvious exception.) This is especially true as of the past 10-15 years, when it has become increasingly clear that reliance on stored procedures is a questionable programming practice. Edit: But please see the discussion below challenging this claim.
  • Oracle (the company) is widely hated, in a way that IBM generally wasn’t.
  • Oracle doesn’t dominate a data center the way hardware monopolist IBM did in a hardware-first era.

Above all, Oracle doesn’t have the “Trust us; we’ll make sure your IT works” story that IBM did. Appliances, aka “engineered systems”, are a step in that direction, but those are only — or at least mainly — to run Oracle software, which generally isn’t everything a customer has.

 

But think of the apps!

Oracle does have one area in which it has more account control power than IBM ever did — applications. If you run Oracle apps, you probably should be running the Oracle RDBMS and perhaps an Exadata rack as well. And perhaps you’ll use Oracle BI too, at least in use cases where you don’t prefer something that emphasizes a more modern UI.

As a practical matter, most enterprise app rip-and-replace happens in a few scenarios:

  • Merger/acquisition. An enterprise that winds up with different apps for the same functions may consolidate and throw the loser out. I’m sure Oracle loses a few customers this way to SAP every year, and vice-versa.
  • Drastic obsolescence. This can take a few forms, mainly:
    • Been there, done that.
    • Enterprise outgrows the capabilities of the current app suite. Oracle’s not going to lose much business that way.
    • Major platform shift. Going forward, that means SaaS/”cloud” (Software as a Service).

And so the main “opportunity” for Oracle to lose application market share is in the transition to the cloud.

 

Putting this all together …

A typical large-enterprise Oracle customer has 1000s of apps running on Oracle. The majority would be easy to port to some other system, but the exceptions to that rule are numerous enough to matter — a lot. Thus, Oracle has a secure place at that customer until such time as its applications are mainly swept away and replaced with something new.

But what about new apps? In many cases, they’ll arise in areas where Oracle’s position isn’t strong.

  • New third-party apps are likely to come from SaaS vendors. Oracle can reasonably claim to be a major SaaS vendor itself, and salesforce.com has a complex relationship with the Oracle RDBMS. But on the whole, SaaS vendors aren’t enthusiastic Oracle adopters.
  • New internet-oriented apps are likely to focus on customer/prospect interactions (here I’m drawing the (trans)action/interaction distinction) or even more purely machine-generated data (“Internet of Things”). The Oracle RDBMS has few advantages in those realms.
  • Further, new apps — especially those that focus on data external to the company — will in many cases be designed for the cloud. This is not a realm of traditional Oracle strength.

And that is why I think the answer to this post’s title question is probably “Yes”.

 

Related links

A significant fraction of my posts, in this blog and Software Memories alike, are probably at least somewhat relevant to this sweeping discussion. Particularly germane is my 2012 overview of Oracle’s evolution. Other posts to call out are my recent piece on transitioning to the cloud, and my series on enterprise application history.

Categories: Other

Readings in Database Systems

Thu, 2015-12-10 06:26

Mike Stonebraker and Larry Ellison have numerous things in common. If nothing else:

  • They’re both titanic figures in the database industry.
  • They both gave me testimonials on the home page of my business website.
  • They both have been known to use the present tense when the future tense would be more accurate. :)

I mention the latter because there’s a new edition of Readings in Database Systems, aka the Red Book, available online, courtesy of Mike, Joe Hellerstein and Peter Bailis. Besides the recommended-reading academic papers themselves, there are 12 survey articles by the editors, and an occasional response where, for example, editors disagree. Whether or not one chooses to tackle the papers themselves — and I in fact have not dived into them — the commentary is of great interest.

But I would not take every word as the gospel truth, especially when academics describe what they see as commercial market realities. In particular, as per my quip in the first paragraph, the data warehouse market has not yet gone to the extremes that Mike suggests,* if indeed it ever will. And while Joe is close to correct when he says that the company Essbase was acquired by Oracle, what actually happened is that Arbor Software, which made Essbase, merged with Hyperion Software, and the latter was eventually indeed bought by the giant of Redwood Shores.**

*When it comes to data warehouse market assessment, Mike seems to often be ahead of the trend.

**Let me interrupt my tweaking of very smart people to confess that my own commentary on the Oracle/Hyperion deal was not, in retrospect, especially prescient.

Mike pretty much opened the discussion with a blistering attack against hierarchical data models such as JSON or XML. To a first approximation, his views might be summarized as: 

  • Logical hierarchical models can be OK in certain cases. In particular, JSON could be a somewhat useful datatype in an RDBMS.
  • Physical hierarchical models are horrible.
  • Rather, you should implement the logical hierarchical model over a columnar RDBMS.

My responses start:

  • Nested data structures are more important than Mike’s discussion seems to suggest.
  • Native XML and JSON stores are apt to have an index on every field. If you squint, that index looks a lot like a column store.
  • Even NoSQL stores should and I think in most cases will have some kind of SQL-like DML (Data Manipulation Language). In particular, there should be some ability to do joins, because total denormalization is not always a good choice.

In no particular order, here are some other thoughts about or inspired by the survey articles in Readings in Database Systems, 5th Edition.

  • I agree that OLTP (OnLine Transaction Processing) is transitioning to main memory.
  • I agree with the emphasis on “data in motion”.
  • While I needle him for overstating the speed of the transition, Mike is right that columnar architectures are winning for analytics. (Or you could say they’ve won, if you recognize that mop-up from the victory will still take 1 or 2 decades.)
  • The guys seem to really hate MapReduce, which is an old story for Mike, but a bit of a reversal for Joe.
  • MapReduce is many things, but it’s not a data model, and it’s also not something that Hadoop 1.0 was an alternative to. Saying each of those things was sloppy writing.
  • The guys characterize consistency/transaction isolation as a rather ghastly mess. That part was an eye-opener.
  • Mike is a big fan of arrays. I suspect he’s right in general, although I also suspect he’s overrating SciDB. I also think he’s somewhat overrating the market penetration of cube stores, aka MOLAP.
  • The point about Hadoop (in particular) and modern technologies in general showing the way to modularization of DBMS is an excellent one.
  • Joe and Mike disagreed about analytics; Joe’s approach rang truer for me. My own opinion is:
  • The challenge of whether anybody wants to do machine learning (or other advanced analytics) over a DBMS is sidestepped in part by the previously mentioned point about the modularization of a DBMS. Hadoop, for example, can be both an OK analytic DBMS (although not fully competitive with mature, dedicated products) and of course also an advanced analytics framework.
  • Similarly, except in the short-term I’m not worried about the limitations of Spark’s persistence mechanisms. Almost every commercial distribution of Spark I can think of is part of a package that also contains a more mature data store.
  • Versatile DBMS and analytic frameworks suffer strategic contention for memory, with different parts of the system wanting to use it in different ways. Raising that as a concern about the integration of analytic DBMS with advanced analytic frameworks is valid.
  • I used to overrate the importance of abstract datatypes, in large part due to Mike’s influence. I got over it. He should too. :) They’re useful, to the point of being a checklist item, but not a game-changer. A big part of the problem is what I mentioned in the previous point — different parts of a versatile DBMS would prefer to do different things with memory.
  • I used to overrate the importance of user-defined functions in an analytic RDBMS. Mike had nothing to do with my error. :) I got over it. He should too. They’re useful, to the point of being a checklist item, but not a game-changer. Looser coupling between analytics and data management seems more flexible.
  • Excellent points are made about the difficulties of “First we build the perfect schema” data warehouse projects and, similarly, MDM (Master Data Management).
  • There’s an interesting discussion that helps explain why optimizer progress is so slow (both for the industry in general and for each individual product).

Related links

  • I did a deep dive into MarkLogic’s indexing strategy in 2008, which informed my comment about XML/JSON stores above.
  • Again with MarkLogic as the focus, in 2010 I was skeptical about document stores not offering joins. MarkLogic has since capitulated.
  • I’m not current on SciDB, but I did write a bit about it in 2010.
  • I’m surprised that I can’t find a post to point to about modularization of DBMS. I’ll leave this here as a placeholder until I can.
  • Edit: As promised, I’ve now posted about the object-relational/abstract datatype boom of the 1990s.
Categories: Other

Transitioning to the cloud(s)

Mon, 2015-12-07 11:48

There’s a lot of talk these days about transitioning to the cloud, by IT customers and vendors alike. Of course, I have thoughts on the subject, some of which are below.

1. The economies of scale of not running your own data centers are real. That’s the kind of non-core activity almost all enterprises should outsource. Of course, those considerations taken alone argue equally for true cloud, co-location or SaaS (Software as a Service).

2. When the (Amazon) cloud was newer, I used to hear that certain kinds of workloads didn’t map well to the architecture Amazon had chosen. In particular, shared-nothing analytic query processing was necessarily inefficient. But I’m not hearing nearly as much about that any more.

3. Notwithstanding the foregoing, not everybody loves Amazon pricing.

4. Infrastructure vendors such as Oracle would like to also offer their infrastructure to you in the cloud. As per the above, that could work. However:

  • Is all your computing on Oracle’s infrastructure? Probably not.
  • Do you want to move the Oracle part and the non-Oracle part to different clouds? Ideally, no.
  • Do you like the idea of being even more locked in to Oracle than you are now? [Insert BDSM joke here.]
  • Will Oracle do so much better of a job hosting its own infrastructure that you use its cloud anyway? Well, that’s an interesting question.

Actually, if we replace “Oracle” by “Microsoft”, the whole idea sounds better. While Microsoft doesn’t have a proprietary server hardware story like Oracle’s, many folks are content in the Microsoft walled garden. IBM has fiercely loyal customers as well, and so may a couple of Japanese computer manufacturers.

5. Even when running stuff in the cloud is otherwise a bad idea, there’s still:

  • Test and dev(elopment) — usually phrased that way, although the opposite order makes more sense.
  • Short-term projects — the most obvious examples are in investigative analytics.
  • Disaster recovery.

So in many software categories, almost every vendor should have a cloud option of some kind.

6. Reasons for your data to wind up in a plurality of remote data centers include:

  • High availability, and similarly disaster recovery. Duh.
  • Second-source/avoidance of lock-in.
  • Geo-compliance.
  • Particular SaaS offerings being hosted in different places.
  • Use of both true cloud and co-location for different parts of your business.

7. “Mostly compatible” is by no means the same as “compatible”, and confusing the two leads to tears. Even so, “mostly compatible” has stood the IT industry in good stead multiple times. My favorite examples are:

  • SQL
  • UNIX (before LINUX).
  • IBM-compatible PCs (or, as Ben Rosen used to joke, Compaq-compatible).
  • Many cases in which vendors upgrade their own products.

I raise this point for two reasons:

  • I think Amazon/OpenStack could be another important example.
  • A vendor offering both cloud and on-premises versions of their offering, with minor incompatibilities between the two, isn’t automatically crazy.

8. SaaS vendors, in many cases, will need to deploy in many different clouds. Reasons include:

That said, there are of course significant differences between, for example:

  • Deploying to Amazon in multiple regions around the world.
  • Deploying to Amazon plus a variety of OpenStack-based cloud providers around the world, e.g. some “national champions” (perhaps subsidiaries of the main telecommunications firms).*
  • Deploying to Amazon, to other OpenStack-based cloud providers, and also to an OpenStack-based system that resides on customer premises (or in their co-location facility).

9. The previous point, and the last bullet of the one before that, are why I wrote in a post about enterprise app history:

There’s a huge difference between designing applications to run on one particular technology stack, vs. needing them to be portable across several. As a general rule, offering an application across several different brands of almost-compatible technology — e.g. market-leading RDBMS or (before the Linux era) proprietary UNIX boxes — commonly works out well. The application vendor just has to confine itself to relying on the intersection of the various brands’ feature sets.*

*The usual term for that is the spectacularly incorrect phrase “lowest common denominator”.

Offering the “same” apps over fundamentally different platform technologies is much harder, and I struggle to think of any cases of great success.

10. Decisions on where to process and store data are of course strongly influenced by where and how the data originates. In broadest terms:

  • Traditional business transaction data at large enterprises is typically managed by on-premises legacy systems. So legacy issues arise in full force.
  • Internet interaction data — e.g. web site clicks — typically originates in systems that are hosted remotely. (Few enterprises run their websites on premises.) It is tempting to manage and analyze that data where it originates. That said:
    • You often want to enhance that data with what you know from your business records …
    • … which is information that you may or may not be willing to send off-premises.
  • “Phone-home” IoT (Internet of Things) data, from devices at — for example — many customer locations, often makes sense to receive in the cloud. Once it’s there, why not process and analyze it there as well?
  • Machine-generated data that originates on your premises may never need to leave them. Even if their origins are as geographically distributed as customer devices are, there’s a good chance that you won’t need other cloud features (e.g. elastic scalability) as much as in customer-device use cases.

Related link

Categories: Other

Machine learning’s connection to (the rest of) AI

Tue, 2015-12-01 03:28

This is part of a four post series spanning two blogs.

1. I think the technical essence of AI is usually:

  • Inputs come in.
  • Decisions or actions come out.
  • More precisely — inputs come in, something intermediate is calculated, and the intermediate result is mapped to a decision or action.
  • The intermediate results are commonly either numerical (a scalar or perhaps a vector of scalars) or a classification/partition into finitely many possible intermediate outputs.

Of course, a lot of non-AI software can be described the same way.

To check my claim, please consider:

  • It fits rules engines/expert systems so simply it’s barely worth saying.
  • It fits any kind of natural language processing; the intermediate results might be words or phrases or concepts or whatever.
  • It fits machine vision beautifully.

To see why it’s true from a bottom-up standpoint, please consider the next two points.

2. It is my opinion that most things called “intelligence” — natural and artificial alike — have a great deal to do with pattern recognition and response. Examples of what I mean include: 

  • Think of what’s on an IQ test, or a commonly accepted substitute for same. (The SAT sometimes substitutes.) A lot of that is pattern recognition.
  • When the “multiple intelligences” or just “emotional intelligence” concepts gained currency, the core idea was the recognition of various different kinds of pattern. (E.g., reading somebody else’s emotions, something that I’m not nearly as good at as I am at the skills measured by standard IQ tests.)
  • The central mechanism of neurotransmission is a neuron recognizing that an action potential has crossed a certain threshold, and firing as a result.
  • Traditional areas of AI include natural language recognition, machine vision, and so on.
  • Another traditional area of AI is rules-based processing — conditions in, decision out.
  • Back in the 1980s (less so today), it was thought that a core underpinning for AI technology was knowledge representation. That said, as much as I like interesting data structures, I have my doubts.
    • The Semantic Web grew out of this idea.
    • Also, the single most enduring proponent of the centrality of knowledge representation was probably Doug Lenat, who gave his name to a famed unit of bogosity.
    • While the previous two points are probably just coincidence, the juxtaposition is suggestive. :)

3. In most computational cases, pattern recognition and response boil down to scoring and/or classification (whether in a narrow machine learning sense of “classification” or otherwise). What I mean by this is:

  • I’m thinking of scoring as a function that maps inputs into scalar values. (Or a vector of scalars.)
  • I’m thinking of classification as a function that maps inputs into a finite range of possible values. (Note that this is mathematically equivalent to a finite partition on the set of inputs.)
  • I’m also assuming that the system maps each possible score or classification to a decision or response (deterministically or probabilistically as the case may be).
  • Then if you compose the two maps, you wind up with a function from {possible input patterns} to {possible responses}.

4. If you want a good algorithm for classification, of course, it’s natural to pursue it via machine learning. And the same is true of scoring, at least if we recall that the domains of machine learning and statistics have essentially merged.

5. It took people remarkably long to figure out the previous point. Through at least the end of the previous century, it was generally assumed that the way to come up with clever algorithms for, for example, text analytics or machine vision was — well, to think them up.

6. As spelled out in my overview of present-day commercial AI, there’s a somewhat paradoxical industry structure, in that:

  • Even though machine learning is a sine qua non of many businesses, tech and non-tech alike …
  • … the rest of AI is largely concentrated at few behemoth technology companies.

Of course, there are plenty of startups hoping to change that structure. I hope some of them succeed.

Categories: Other

What is AI, and who has it?

Tue, 2015-12-01 03:25

This is part of a four post series spanning two blogs.

1. “Artificial intelligence” is a term that usually means one or more of:

  • “Smart things that computers can’t do yet.”
  • “Smart things that computers couldn’t do until recently.”
  • “Technology that has emerged from the work of computer scientists who said they were doing AI.”
  • “Underpinnings for other things that might be called AI.”

But that covers a lot of ground, especially since reasonable people might disagree as to what constitutes “smart”.

2. Examples of what has been called “AI” include:

  • Rule-based processing, especially if it is referred to as “expert systems”.
  • Machine learning.
  • Many aspects of “natural language processing” — a term almost as overloaded as “artificial intelligence” — including but not limited to:
    • Text search.
    • Speech recognition, especially but not only if it seems somewhat lifelike.
    • Automated language translation.
    • Natural language database query.
  • Machine vision.
  • Autonomous vehicles.
  • Robots, especially but not only ones that seem somewhat lifelike.
  • Automated theorem proving.
  • Playing chess at an ELO rating of 1600 or better.
  • Beating the world champion at chess.
  • Beating the world champion at Jeopardy.
  • Anything that IBM brands or rebrands as “Watson”.

That last bit is awkward, as IBM is doing the industry a major disservice via its recklessly confusing Watson marketing, which is instantiating Monash’s First Law of Commercial Semantics — Bad jargon drowns out good. I suspect there’s an interesting debate under it all, in which IBM stands almost alone against the whole rest of the industry by sticking to the old academic belief that sophisticated knowledge representation is the key to AI. But it’s hard to be sure, because IBM’s Watson marketing is so full of smoke that reality, if any, doesn’t show through.

3. When I think of present-day AI commercialization, what comes to mind is mainly:

  • Multiple efforts in speech recognition, from Google, Microsoft, Apple, and Nuance Communications. (I’m not sure whether Apple’s is mainly in-house or mainly outsourced.)
  • Other natural language efforts, such as Google’s in machine translation.
  • Technology related to robots and autonomous vehicles, specifically in machine vision, other senses (e.g. touch), and reactions (e.g. driving decisions).
    • Google is the most visible player here. It’s gotten a lot of press for driverless automobiles, and it bought up a lot of robotics companies when they were hurting due to a hiatus in DARPA funding.
    • Large auto companies will surely compete.
  • Gesture interpretation and similar kinds of recognition.
    • Microsoft has the most visibility here, due to Kinect, and is trying to bring similar technology to general computing.
    • Facebook, Google et al. are making major investments into the closely related area of virtual reality. Facebook is also building an AI team.
  • Machine learning.
    • Machine learning in general can be regarded as part of AI, at least historically.
    • Machine learning is a key component of many AI efforts. Google in particular has made a big fuss about it, suggesting that data is generally more important than algorithms.
  • Whatever parts of the IBM story, if any, are actually real.

So with one big exception, commercial AI seems to be concentrated at a small number of behemoth companies. The exception is machine learning itself, which is being adopted and developed on a much broader basis.

4. AngelList seems to say I’m wrong, citing 576 different AI startups. CrunchBase offers 436 AI startups. So maybe some of those startups will succeed. We’ll see.

5. Some of the reasons for AI’s concentrated industry structure lie in general business and economics.

  • A large company can risk research with unclear payoffs a lot more easily than a small one can.
  • AI is prestigious and/or cool. Some large companies like to indulge in stuff like that.

Yes, those reasons are somewhat counteracted by the facts that:

  • VCs know they’re investing in companies whose eventual exit will likely be an acquisition.
  • Some of those acquisitions are for a LOT of money.

But I think they apply even so. And by the way — to date, most AI companies have not been acquired for very high prices.

6. Some of the reasons for AI industry concentration are more specifically technological.

  • Some AI — e.g. speech recognition or autonomous vehicle navigation — could be the “sizzle” that differentiates offerings in huge business sectors. Thus, a “win” in AI could have more value to an already-large electronics, search or automobile company than to a startup.
  • The largest companies in those huge sectors can afford huge amounts of training data, or may even get it as a byproduct of their other activities. Hence they can more easily afford massive exercises in the relevant machine learning.

My paradigmatic example for the latter point is Google with anything connected to search, such as translation (which it does of search results) or natural language recognition (which it does of search queries).

If you want to do an AI startup, those are some of the competitive factors that you need to beat.

Related links

  • An earlier version of some of this material was in my January, 2014 post on The games of Watson.
  • Earlier this year, I posted about robotics.
  • There is quite a bit of AI humor.
Categories: Other

Splunk engages in stupid lawyer tricks

Wed, 2015-11-25 08:14

Using legal threats as an extension of your marketing is a bad idea. At least, it’s a bad idea in the United States, where such tactics are unlikely to succeed, and are apt to backfire instead. Splunk seems to actually have had some limited success intimidating Sumo Logic. But it tried something similar against Rocana, and I was set up to potentially be collateral damage. I don’t think that’s working out very well for Splunk.

Specifically, Splunk sent a lawyer letter to Rocana, complaining about a couple of pieces of Rocana marketing collateral. Rocana responded publicly, and posted both the Splunk letter and Rocana’s lawyer response. The Rocana letter eviscerated Splunk’s lawyers on matters of law, clobbered them on the facts as well, exposed Splunk’s similar behavior in the past, and threw in a bit of snark at the end.

Now I’ll pile on too. In particular, I’ll note that, while Splunk wants to impose a duty of strict accuracy upon those it disagrees with, it has fewer compunctions about knowingly communicating falsehoods itself.

1. Splunk’s letter insinuates that Rocana might have paid me to say what I blogged about them. Those insinuations are of course false.

Splunk was my client for a lot longer, and at a higher level of annual retainer, than Rocana so far has been. Splunk never made similar claims about my posts about them. Indeed, Splunk complained that I did not write about them often or favorably enough, and on at least one occasion seemed to delay renewing my services for that reason.

2. Similarly, Splunk’s letter makes insinuations about quotes I gave Rocana. But I also gave at least one quote to Splunk when they were my client. As part of the process — and as is often needed — I had a frank and open discussion with them about my quote policies. So Splunk should know that their insinuations are incorrect.

3. Splunk’s letter actually included the sentences 

Splunk can store data in, and analyze data across, Splunk, SQL, NoSQL, and Hadoop data repositories. Accordingly, the implication that Splunk cannot scale like Hadoop is misleading and inaccurate.

I won’t waste the time of this blog’s readers by explaining how stupid that is, except to point out that I don’t think Splunk executes queries entirely in Hadoop. If you want to consider the matter further, you might consult my posts regarding Splunk HPAS and Splunk Hunk.

4. I and many other people have heard concerns about the cost of running Splunk for high volumes of data ingest. Splunk’s letter suggests we’re all making this up. This post suggests that Splunk’s lawyers can’t have been serious.

Related links

Categories: Other

The questionably named Cloudera Navigator Optimizer

Thu, 2015-11-19 05:55

I only have mixed success at getting my clients to reach out to me for messaging advice when they’re introducing something new. Cloudera Navigator Optimizer, which is being announced along with Cloudera 5.5, is one of my failures in that respect; I heard about it for the first time Tuesday afternoon. I hate the name. I hate some of the slides I saw. But I do like one part of the messaging, namely the statement that this is about “refactoring” queries.

All messaging quibbles aside, I think the Cloudera Navigator Optimizer story is actually pretty interesting, and perhaps not just to users of SQL-on-Hadoop technologies such as Hive (which I guess I’d put in that category for simplicity) or Impala. As I understand Cloudera Navigator Optimizer:

  • It’s all about analytic SQL queries.
  • Specifically, it’s about reducing duplicated work.
  • It is not an “optimizer” in the ordinary RDBMS sense of the word.
  • It’s delivered via SaaS (Software as a Service).
  • Conceptually, it’s not really tied to SQL-on-Hadoop. However, …
  • … in practice it likely will be used by customers who want to optimize performance of Cloudera’s preferred styles of SQL-on-Hadoop, either because they’re already using SQL-on-Hadoop or in connection with an initial migration.

It grows out of Xplain.io, which started with the intention of being a general workload optimizer for Hadoop and wound up with this beta announcement of a tuning adviser for analytic SQL.

Right now, the Cloudera Navigator Optimizer service is:

  • Query code in.
  • Information and advice out.

Naturally, Cloudera’s intention — perhaps as early as at first general availability — is for the output to start including something that’s more like automation, e.g. hints for the Impala optimizer.

As Anupam Singh describes it, there are basically four kinds of problems that Cloudera Navigator Optimizer can help with:

  • ETL (Extract/Transform/Load) might repeat the same operation over and over again, e.g. joining to a reference table to help with data cleaning. It can be an optimization to consolidate some of that work. (The same would surely also be true in cases where the workload is more properly described as ELT.)
  • For business intelligence it is often helpful to materialize aggregates or result sets. (This is, of course, why materialized views were invented in the first place.)
  • Queries-from-hell — perhaps thousands of lines of SQL long — can perhaps be usefully rewritten into a sequence of much shorter queries.
  • Ad-hoc query workloads can have enough repetition that there’s opportunity for similar optimizations. Anupam thinks his technology has enough intelligence to detect some of these patterns.

Actually, all four of these cases can involve materializing tables so that they don’t need to keep being in part or whole recreated.

In essence, then, this is a way to add in more query pipelining than the underlying data store automagically provides on its own. And that seems to me like a very good idea to try. The whole thing might be worth trying out at least once, even if your analytic RDBMS installation has nothing to do with SQL at all.

Categories: Other

CDH 5.5

Thu, 2015-11-19 05:52

I talked with Cloudera shortly ahead of today’s announcement of Cloudera 5.5. Much of what we talked about had something or other to do with SQL data management. Highlights include:

  • Impala and Kudu are being donated to Apache. This actually was already announced Tuesday. (Due to Apache’s rules, if I had any discussion with Cloudera speculating on the likelihood of Apache accepting the donations, I would not be free to relay it.)
  • Cloudera is introducing SQL extensions so that Impala can query nested data structures. More on that below.
    • The basic idea for the nested datatype support is that there are SQL extensions with a “dot” notation to let you get at the specific columns you need.
    • From a feature standpoint, we’re definitely still in the early days.
      • When I asked about indexes on these quasi-columns, I gathered that they’re not present in beta but are hoped for by the time of general availability.
      • Basic data skipping, also absent in beta, seems to be more confidently expected in GA.
    • This is for Parquet first, Avro next, and presumably eventually native JSON as well.
    • This is said to be Dremel-like, at least in the case of Parquet. I must confess that I’m not familiar enough with Apache Drill to compare the two efforts.
  • Cloudera is increasing its coverage of Spark in several ways.
    • Cloudera is adding support for MLlib.
    • Cloudera is adding support for SparkSQL. More on that below.
    • Cloudera is adding support for Spark going against S3. The short answer to “How is this different from the Databricks service?” is:
      • More “platform” stuff from the Hadoop stack (e.g. for data ingest).
      • Less in the way of specific Spark usability stuff.
    • Cloudera is putting into beta what it got in the Xplain.io acquisition, which it unfortunately is naming Cloudera Navigator Optimizer. More on that in a separate post.
    • Impala and Hive are getting column-level security via Apache Sentry.
    • There are other security enhancements.
    • Some policy-based information lifecycle management is being added as well.

While I had Cloudera on the phone, I asked a few questions about Impala adoption, specifically focused on concurrency. There was mention of:

  • Petabyte scale databases — at least one clear case for Impala/business intelligence only, and the likelihood that the Impala/BI part of other bigger installations was also in that range.
  • Hundreds of nodes.
  • 10s of simultaneous queries in dashboard use cases.
  • 1 – 3 million queries/month as a common figure.

Cloudera also expressed the opinions that:

  • An “overwhelming majority” of Cloudera customers have adopted Impala. (I imagine there’s a bit of hyperbole in that — for one thing, Cloudera has a pricing option in which Impala is not included.)
  • It is common for Impala customers to use Hive for “data preparation”.
  • SparkSQL has “order of magnitude” less performance than Impala, but a little more than performance than Hive running over either Spark or Tez.
  • SparkSQL’s main use cases are (and these overlap heavily):
    • As part of an analytic process (as opposed to straightforwardly DBMS-like use).
    • To persist data outside the confines of a single Spark job.

 

Categories: Other

Issues in enterprise application software

Wed, 2015-11-11 07:39

1. I think the next decade or so will see much more change in enterprise applications than the last one. Why? Because the unresolved issues are piling up, and something has to give. I intend this post to be a starting point for a lot of interesting discussions ahead.

2. The more technical issues I’m thinking of include:

  • How will app vendors handle analytics?
  • How will app vendors handle machine-generated data?
  • How will app vendors handle dynamic schemas?
  • How far will app vendors get with social features?
  • What kind of underlying technology stacks will app vendors drag along?

We also always have the usual set of enterprise app business issues, including:

  • Will the current leaders — SAP, Oracle and whoever else you want to include — continue to dominate the large-enterprise application market?
  • Will the leaders in the large-enterprise market succeed in selling to smaller markets?
  • Which new categories of application will be important?
  • Which kinds of vendors and distribution channels will succeed in serving small enterprises?

And perhaps the biggest issue of all, intertwined with most of the others, is:

  • How will the move to SaaS (Software as a Service) play out?

3. I’m not ready to answer those questions yet, but at least I’ve been laying some groundwork.

Along with this post, I’m putting up a three post series on the history of enterprise apps. Takeaways include but are not limited to:

  • Application software is a very diverse area. Different generalities apply to different parts of it.
  • A considerable fraction of application software has always been sold with the technology stack being under vendor control. Examples include most app software sold to small and medium enterprises, and much the application software that Oracle sells.
  • Apps that are essentially distributed have often relied on different stacks than single-site apps. (Duh.)

4. Reasons I see for the enterprise apps area having been a bit dull in recent years include:

5. But I did do some work in the area even so. :) Besides posts linked above, other things I wrote relevant to the present discussion include:

 

Categories: Other

Differentiation in business intelligence

Mon, 2015-10-26 13:34

Parts of the business intelligence differentiation story resemble the one I just posted for data management. After all:

  • Both kinds of products query and aggregate data.
  • Both are offered by big “enterprise standard” behemoth companies and also by younger, nimbler specialists.
  • You really, really, really don’t want your customer data to leak via a security breach in either kind of product.

That said, insofar as BI’s competitive issues resemble those of DBMS, they are those of DBMS-lite. For example:

  • BI is less mission-critical than some other database uses.
  • BI has done a lot less than DBMS to deal with multi-structured data.
  • Scalability demands on BI are less than those on DBMS — indeed, they’re the ones that are left over after the DBMS has done its data crunching first.

And full-stack analytic systems — perhaps delivered via SaaS (Software as a Service) — can moot the BI/data management distinction anyway.

Of course, there are major differences between how DBMS and BI are differentiated. The biggest are in user experience. I’d say:

  • For many people, BI is the user experience over the underlying data store(s).
  • Two crucial aspects of user experience are navigational power and speed of response.
    • At one extreme, people hated the old green paper reports.
    • At the other, BI in the QlikView/Tableau era is one of the few kinds of enterprise software that competes on the basis of being
    • This is also somewhat true with respect to snazzy BI demos, such as interactive maps or way-before-their-day touch screens.*
  • Features like collaboration and mobile UIs also matter.
  • Since BI is commonly adopted via quick departmental projects — at least as the hoped-for first-step of a “land-and-expand” campaign — administrative usability is at a premium as well.

* Computer Pictures and thus Cullinet used a touch screen over 30 years ago. Great demo, but not so useful as an actual product, due to the limitations on data structure.

Where things get tricky is in my category of accuracy. In the early 2000s, I pitched and wrote a white paper arguing that BI helps bring “integrity” to an enterprise in various ways. But I don’t think BI vendors have done a good job of living up to that promise.

  • They’ve moved slowly in accuracy-intensive areas such as alerting or predictive modeling.
  • “Single source of truth” and similar protestations turned out to be much oversold.

Indeed, it’s tempting to say that business intelligence has been much too stupid. :) I really like some attempts to make BI sharper, e.g. at Rocana or ClearStory, but it remains to be seen whether many customer care about their business intelligence actually being smart.

So how does all this fit into my differentiation taxonomy/framework? Referring liberally to what has already been written above, we get:

  • Scope:
    • For traditional tabular analysis, BI products compete on a bunch of UI features.
    • Non-tabular analysis is much more primitive. Event series interfaces may be the closest thing to an exception.
    • Collaboration is in the mix as well.
  • Accuracy: I discussed this one above.
  • Other trustworthiness:
    • Security is a big deal.
    • Mission-critical robustness is usually, in truth, just a nice-to-have. But some (self-)important executives may disagree. :)
  • Speed:
    • For some functionality — e.g. cross-database joins — BI tools almost have to rely on their own DBMS-like engines for performance.
    • For other it’s more optional. You can do single-RDBMS query straight against the underlying system, or you can pre-position some of the data in memory.
    • Please also see the adoption and administration section below.
  • User experience: I discussed this one above.
  • Adoption and administration:
    • When BI is “owned” by a department, especially one that also doesn’t manage the underlying data, set-up and administration need to be super-easy.
    • Sometimes, departmental BI is used as an excuse to pressure central IT into making data available.
    • Much like analytic DBMS, BI adoption can sometimes be tied to huge first-time-data-warehouse building projects.
    • Administration of big enterprise-standard BI is, to re-use a term, much like DBMS-lite.
  • Cost: The true cost of BI usage is commonly governed more by the underlying data management (and data acquisition) than by the BI software (and supporting servers) itself. That said:
    • BI “hard” costs — licenses, servers, cloud fees, whatever — commonly have to fit into departmental budgets.
    • So do BI people costs.
    • BI people requirements also often have to fit into departmental skillets.
Categories: Other

Differentiation in data management

Mon, 2015-10-26 13:32

In the previous post I broke product differentiation into 6-8 overlapping categories, which may be abbreviated as:

  • Scope
  • Accuracy
  • (Other) trustworthiness
  • Speed
  • User experience
  • Cost

and sometimes also issues in adoption and administration.

Now let’s use this framework to examine two market categories I cover — data management and, in separate post, business intelligence.

Applying this taxonomy to data management:

  • Scope: Different subcategories of data management technology are suitable for different kinds of data, different scale of data, etc. To a lesser extent that may be true within a subcategory as well.
  • Scope: Further, products may differ in what you can do with the data, especially analytically.
  • Accuracy: Don’t … lose … data.
  • Other trustworthiness:
    • Uptime, availability and so on are big deals in many data management sectors.
    • Security is hugely important for data that both belongs to other people — usually your customers — and is accessible via the internet. It’s important in numerous other database use cases as well.
    • Awkwardly, the CAP Theorem teaches us that there can be a bit of a trade-off between availability and (temporary) accuracy.
  • Speed:
    • Different kinds of data management products perform differently in different use cases.
    • If your use case is down the middle of what a mature data management subsector focuses on, performance may not vary much among individual leading products.
    • Even then, tuning effort may be quite different for different products.
  • User experience:
    • Users rarely interact directly with database management products.
    • There can be clear differentiation in database administration UIs. (The most dramatic example was perhaps the rise of Microsoft SQL Server.)
    • Data manipulation languages (DMLs) can make a huge difference in programmers’ lives.
  • Cost:
    • License and maintenance costs can be a huge issue, especially if you’re buying from traditional vendors.
    • Performance affects cost in a few ways: hardware costs for sure, tuning effort in some cases, and occasionally even vendor license/maintenance fees.
    • Ongoing operations costs can vary greatly by database product in general, and by your pre-existing in-house expertise in particular.
    • Ease of programming can sometimes lead to significant programming cost differences as well.
  • Adoption: This one is often misunderstood.
    • The effort of adopting new database technology for new applications is often overrated. When projects are huge, it’s often because of what you’re doing with the technology, not because of the technology itself.
    • Migration, however, is usually a bitch.

For reasons of length, I’m doing a separate post on differentiation in business intelligence.

Categories: Other

Sources of differentiation

Mon, 2015-10-26 13:31

Obviously, a large fraction of what I write about involves technical differentiation. So let’s try for a framework where differentiation claims can be placed in context. This post will get through the generalities. The sequels will apply them to specific cases.

Many buying and design considerations for IT fall into six interrelated areas: 

  • Scope: What does the technology even purport to do? This consideration applies to pretty much everything. :)
    • Usually, this means something like features.
    • However, there’s an important special case in which the important features are the information content. (Examples: Arguably Google, and the Bloomberg service for sure.)
  • Accuracy: How correctly does the technology do it? This can take multiple forms.
    • Sometimes, a binary right/wrong distinction pretty much suffices, with an acceptable error rate of zero. If you’re writing data, it shouldn’t get lost. If you’re doing arithmetic, it should be correct. Etc.
    • Sometimes, there’s a clear right/wrong distinction, but error rates are necessarily non-zero, often with a trade-off between the rates for false positives and false negatives. (In text search and similar areas, those rates are measured respectively as precision and recall.) Security is a classic example. Many other cases arise when trying to identify problems or
    • Sometimes accuracy is on a scale. Predictive modeling results are commonly of that kind. So are text search, voice recognition and so on.
  • Other trustworthiness.
    • Reliability, availability and security are considerations in almost any IT scenario.
    • Also crucial are any factors that are perceived as affecting the risk of project failure. Sometimes, these are lumped together as (part of) maturity.
  • Speed. There’s a great real and/or perceived “need for speed”.
    • On the user level:
      • There are many advantages to quick results, “real time” or otherwise.
      • In particular, analysis is often more accurate if you have time for more iterations or intermediate steps.
      • Please recall that speed can actually have multiple kinds of benefit. For example, it can reduce costs, it can improve accuracy, it can improve user experience, or it can enable capabilities that would otherwise be wholly impractical.
    • There can also be considerations of time to (initial) value, although people sometimes overrate how often this is a function of the technology itself.
    • Consistency of performance can be an important aspect of product maturity.
  • User experience. Ideally, using a system is easy and pleasurable, or at least not unpleasant.
    • Ease of use often equates to ease of (re)learning …
    • … but there are exceptions, generally for what might be considered “power users”.
    • Speed and performance can avoid a lot of unpleasant frustration.
    • In some cases you can compel somebody — usually an employee — to use your interface. Often, however, you can’t, and that’s when user experience may matter most.
    • An important category of user experience that doesn’t directly equate to ease or is Of course, the more accurate the recommendations are, the better.
    • Most systems have at least two categories of user experience — one for the true users, and one for the IT folks who manage it. The IT folks’ experience often depends not just on true UI features, but on how hard or difficult the underlying system is to deal with in the first place.
  • Cost, or more precisely TCO (Total Cost of Ownership). Cost is always important, and especially so if there are numerous viable alternatives.
    • Sometimes money paid to the vendor really is the largest component of TCO.
    • Often, however, hardware or IT personnel expenditures are the lion’s share of overall cost.
    • Administrators’ user experience can affect a large chunk of TCO.

Related links

Categories: Other

Cassandra and privacy requirements

Thu, 2015-10-15 09:18

For starters:

But when I made that connection and checked in accordingly with my client Patrick McFadin at DataStax, I discovered that I’d been a little confused about how multi-data-center Cassandra works. The basic idea holds water, but the details are not quite what I was envisioning.

The story starts:

  • Cassandra groups nodes into logical “data centers” (i.e. token rings).
  • As a best practice, each physical data center can contain one or more logical data center, but not vice-versa.
  • There are two levels of replication — within a single logical data center, and between logical data centers.
  • Replication within a single data center is planned in the usual way, with the principal data center holding a database likely to have a replication factor of 3.
  • However, copies of the database held elsewhere may have different replication factors …
  • … and can indeed have different replication factors for different parts of the database.

In particular, a remote replication factor for Cassandra can = 0. When that happens, then you have data sitting in one geographical location that is absent from another geographical location; i.e., you can be in compliance with laws forbidding the export of certain data. To be clear (and this contradicts what I previously believed and hence also implied in this blog):

  • General multi-data-center operation is not what gives you geo-compliance, because the default case is that the whole database is replicated to each data center.
  • Instead, you get that effect by tweaking your specific replication settings.

The most visible DataStax client using this strategy is apparently ING Bank.

If you have a geo-compliance issue, you’re probably also concerned about security. After all, the whole reason the issue arises is because one country’s government might want to look at another country’s citizens’ or businesses’ data. The DataStax security story is approximately:

  • Encryption in flight, for any Cassandra.
  • Encryption at rest, specifically with DataStax Enterprise.
  • No cell-level or row-level security until Cassandra 3.0 is introduced and established. (I didn’t actually ask whether something similar to HBase coprocessors is coming for Cassandra, but that would be my first guess.)
  • Various roles and permissions stuff.

While flexible, Cassandra’s multi-data-center features do add some complexity. Tunable-consistency choices are baked into Cassandra programs at each point data is accessed, and more data centers make for more choices. (Default best practice = write if you get a local quorum, running the slight risk of logical data centers being out of sync with each other.)

One way in which the whole thing does seem nice and simple is that you can have different logical data centers running on different kinds of platforms — cloud, colocation, in-house, whatever — without Cassandra caring.

I’m not going to call the DataStax Enterprise approach to geo-compliance the “gold standard”, because some of it seems pretty clunky or otherwise feature-light. On the other hand, I’m not aware of competitors who exceed it, in features or track record, so “silver standard” seems defensible.

Categories: Other

Basho and Riak

Thu, 2015-10-15 09:18

Basho was on my (very short) blacklist of companies with whom I refuse to speak, because they have lied about the contents of previous conversations. But Tony Falco et al. are long gone from the company. So when Basho’s new management team reached out, I took the meeting.

For starters:

  • Basho management turned over significantly 1-2 years ago. The main survivors from the old team are 1 each in engineering, sales, and services.
  • Basho moved its headquarters to Bellevue, WA. (You get one guess as to where the new CEO lives.) Engineering operations are very distributed geographically.
  • Basho claims that it is much better at timely product shipments than it used to be. Its newest product has a planned (or at least hoped-for) 8-week cadence for point releases.
  • Basho’s revenue is ~90% subscription.
  • Basho claims >200 enterprise clients, vs. 100-120 when new management came in. Unfortunately, I forgot to ask the usual questions about divisions vs. whole organizations, OEM sell-through vs. direct, etc.
  • Basho claims an average contract value of >$100K, typically over 2-3 years. $9 million of that (which would be close to half the total, actually), comes from 2 particular deals of >$4 million each.

Basho’s product line has gotten a bit confusing, but as best I understand things the story is:

  • There’s something called Riak Core, which isn’t even a revenue-generating product. However, it’s an open source project with some big users (e.g. Goldman Sachs, Visa), and included in pretty much everything else Basho promotes.
  • Riak KV is the key-value store previously known as Riak. It generates the lion’s share of Basho’s revenue.
  • Riak S2 is an emulation of Amazon S3. Basho thinks that Riak KV loses efficiency when objects get bigger than 1 MB or so, and that’s when you might want to use Riak S2 in addition or instead.
  • Riak TS is for time series, and just coming out now.
  • Also in the mix are some (extra charge) connectors for Redis and Spark. Presumably, there are more of these to come.
  • There’s an umbrella marketing term of “Basho Data Platform”.

Technical notes on some of that include: 

  • Riak Core doesn’t do data management. It just manages distributed operation of — well, whatever you want to operate. In part, Basho sees Riak Core as a better Apache ZooKeeper.
    • That is the essence of the Riak/Spark pitch — something better than ZooKeeper for cluster management, and I presume some help in persisting Spark RDDs as well.
    • The Riak/Redis pitch is even simpler — cluster management for Redis, and persistent backing as well.
    • Basho’s criticisms of ZooKeeper start with “Cluster manager, manage thyself” claims about ZooKeeper availability, as in the PagerDuty ZooKeeper critique.
  • Riak KV has secondary indexing. Performance is somewhat questionable. It also has Solr indexing, which is fast.
  • At least in its 1.0 form, Riak TS assumes:
    • There’s some kind of schema or record structure.
    • There are explicit or else easily-inferred timestamps.
    • Microsecond accuracy, perfect ordering and so on are not essential.
  • Thus, Riak TS 1.0 is not ideal for the classic Splunk use case where you text index/search on a lot of log emissions. It also is not ideal for financial tick storage.
  • Riak TS has range-based partitioning, where the range is in terms of time. Basho refers to this as “locality”.
  • Riak TS has a SQL subset. Evidently there’s decent flexibility as to which part of the database carries which schema.
  • Riak has a nice feature of allowing you stage a change to network topology before you push it live.
  • Riak’s vector clock approach to wide-area synchronization is more controversial.

Finally, notes on what Basho sees as use cases and competition include:

  • Riak KV is generally used to store usual-suspect stuff — log data, user/profile data and so on.
  • Basho thinks NoSQL is a 4-horse race — Basho/Riak KV, DataStax/Cassandra, MongoDB, Couchbase. (I would be surprised if there was much agreement with that view from, for example, MongoDB, DataStax, Aerospike, MapR or the HBase community.)
  • Basho competes on availability, scalability (including across geography) and so on, or in simplest terms:
    • “Availability and correctness”
    • Simple operation
  • Unsurprisingly, Basho thinks its closest competitor is DataStax. (However, DataStax tells me they don’t compete much with Basho.)
  • Basho thinks Riak KV has ease-of-operation advantages vs. Cassandra.
  • An example of a mission-critical Riak app is the UK National Health Service storing prescription information.
  • An example of Riak S2 and Riak KV being used together is Turner Broadcasting storing video in the former and associated metadata in the latter.
  • Riak TS is initially pointed at two use cases:
    • “Internet of Things”
    • “Metrics”, which seems to mean monitoring of system metrics.
  • Basho sees the competition for Riak TS as starting with HBase, Cassandra, and InfluxDB.
Categories: Other

Couchbase 4.0 and related subjects

Thu, 2015-10-15 09:17

I last wrote about Couchbase in November, 2012, around the time of Couchbase 2.0. One of the many new features I mentioned then was secondary indexing. Ravi Mayuram just checked in to tell me about Couchbase 4.0. One of the important new features he mentioned was what I think he said was Couchbase’s “first version” of secondary indexing. Obviously, I’m confused.

Now that you’re duly warned, let me remind you of aspects of Couchbase timeline.

  • 2 corporate name changes ago, Couchbase was organized to commercialize memcached. memcached, of course, was internet companies’ default way to scale out short-request processing before the rise of NoSQL, typically backed by manually sharded MySQL.
  • Couchbase’s original value proposition, under the name Membase, was to provide persistence and of course support for memcached. This later grew into a caching-oriented pitch even to customers who weren’t already memcached users.
  • A merger with the makers of CouchDB ensued, with the intention of replacing Membase’s SQLite back end with CouchDB at the same time as JSON support was introduced. This went badly.
  • By now, however, Couchbase sells for more than distributed cache use cases. Ravi rattled off a variety of big-name customer examples for system-of-record kinds of use cases, especially in session logging (duh) and also in travel reservations.
  • Couchbase 4.0 has been in beta for a few months.

Technical notes on Couchbase 4.0 — and related riffs :) — start:

  • There’s a new SQL-like language called N1QL (pronounced like “nickel”). I’m hearing a lot about SQL-on-NoSQL these days. More on that below.
  • “Index”, “data” and “query” are three different services/tiers.
    • You can run them all on the same nodes or separately. Couchbase doesn’t have enough experience yet with the technology to know which choice will wind up as a best practice.
    • I’m hearing a lot about heterogeneous-node/multi-tier DBMS architectures these days, and would no longer stand by my 2009 statement that they are unusual. Other examples include Oracle Exadata, MySQL, MongoDB (now that it has pluggable storage engines), MarkLogic, and of course the whole worlds of Hadoop and Spark.
  • To be clear — the secondary indexes are global, and not tied to the same nodes as the data they index.
  • There’s a new back end called ForestDB, but if I understood correctly, it’s used just for the indexes, not for the underlying data.
  • ForestDB represents Couchbase indexes in something that resembles b-trees, but also relies on tries. Indeed, if I’m reading the relevant poster correctly, it’s based on a trie of b-trees.
  • In another increasingly common trend, Couchbase uses Bloom filters to help decide which partitions to retrieve for any particular query.

Up to a point, SQL-on-NoSQL stories can be fairly straightforward.

  • You define some kind of a table,* perhaps in a SQL-like DDL (Data Description Language).
  • SELECT, FROM and WHERE clauses work in the usual way.
  • Hopefully, if a column is going to have a lot of WHERE clauses on it, it also has an index.

For example, I think that’s the idea behind most ODBC/JDBC drivers for NoSQL systems. I think it’s also the idea behind most “SQL-like” languages that NoSQL vendors ship.

*Nobody I talk to about this ever wants to call it a “view”, but it sure sounds like a view to me — not a materialized view, of course, but a view nonetheless.

JOIN syntax can actually be straightforward as well under these assumptions. As for JOIN execution, Couchbase pulls all the data into the relevant tier, and nested loop execution there. My new clients at SequoiaDB have a similar strategy, by the way, although in their case there’s a hash join option as well.

But if things stopped there, they would miss an important complication: NoSQL has nested data. I.e., a value can actually be an array, whose entries are arrays themselves, and so on. That said, the “turtles all the way down” joke doesn’t quite apply, because at some point there are actual scalar or string values, and those are the ones SQL wants to actually operate on.

Most approaches I know of to that problem boil down to identifying particular fields as table columns, with or without aliases/renaming; I think that’s the old Hadapt/Vertica strategy, for example. Couchbase claims to be doing something a little different however, with a SQL-extending operator called UNNEST. Truth be told, I’m finding the N1QL language reference a bit terse, and haven’t figured out what the practical differences vs. the usual approach are, if any. But it sounds like there may be some interesting ideas in there somewhere.

Categories: Other

Notes on privacy and surveillance, October 11, 2015

Sun, 2015-10-11 04:44

1. European Union data sovereignty laws have long had a “Safe Harbour” rule stating it was OK to ship data to the US. Per the case Maximilian Schrems v Data Protection Commissioner, this rule is now held to be invalid. Angst has ensued, and rightly so.

The core technical issues are roughly:

  • Data is usually in one logical database. Data may be replicated locally, for availability and performance. It may be replicated remotely, for availability, disaster recovery, and performance. But it’s still usually logically in one database.
  • Now remote geographic partitioning may be required by law. Some technologies (e.g. Cassandra) support that for a single logical database. Some don’t.
  • Even under best circumstances, hosting and administrative costs are likely to be higher when a database is split across more geographies (especially when the count is increased from 1 to 2).

Facebook’s estimate of billions of dollars in added costs is not easy to refute.

My next set of technical thoughts starts:

  • This is about data storage, not data use; for example, you can analyze Austrian data in the US, but you can’t store it there.
  • Of course, that can be a tricky distinction to draw. We can only hope that intermediate data stores, caches and so on can be allowed to use data from other geographies.
  • Assuming the law is generous in this regard, scan-heavy analytics are more problematic than other kinds.
  • But if there are any problems in those respects — well, if analytics can be parallelized in general, then in particular one should be able to parallelize across geographies. (Of course, this could require replicating one’s whole analytic stack across geographies.)

2. US law enforcement is at loggerheads with major US tech companies, because it wants the right to subpoena data stored overseas. The central case here is a request to get at Microsoft’s customer data stored in Ireland. A government victory would be catastrophic for the US tech industry, but I’m hopeful that sense will — at least to some extent — prevail.

3. Ed Snowden, Glenn Greenwald and numerous other luminaries are pushing something called the Snowden Treaty, as a model for how privacy laws should be set up. I’m a huge fan of what Snowden and Greenwald have done in general, but this particular project has not started well. First, they’ve rolled the thing out while actually giving almost no details, so they haven’t really contributing anything except a bit of PR. Second, one of the few details they did provide contains a horrific error.

Specifically, they “demand”

freedom from damaging publicity, public scrutiny …

To that I can only say: “Have you guys lost your minds???????” As written, that’s a demand that can only be met by censorship laws. I’m sure this error is unintentional, because Greenwald is in fact a stunningly impassioned and articulate opponent of censorship. Even so, that’s an appallingly careless mistake, which for me casts the whole publicity campaign into serious doubt.

4. As a general rule — although the details of course depend upon where you live — it is no longer possible to move around and be confident that you won’t be tracked. This is true even if you’re not a specific target of surveillance. Ways of tracking your movements include but are not limited to:

  • Electronic records of you paying public transit fares or tolls, as relevant. (Ditto rental car fees, train or airplane tickets, etc.)
  • License plate cameras, which in the US already have billions of records on file.
  • Anything that may be inferred from your mobile phone.

5. The previous point illustrates that the strong form of the Snowden Treaty is a pipe dream — it calls for a prohibition on mass surveillance, and that will never happen, because:

  • Governments will insist on trying to prevent “terrorism” before the fact. That mass surveillance is generally lousy at doing so won’t keep them from trying.
  • Governments will insist on being able to do general criminal forensics after the fact. So they’ll want mass surveillance data sitting around just in case they find that they need it.
  • Businesses share consumers’ transaction and interaction data, and such sharing is central to the current structure of the internet industry. That genie isn’t going back into the bottle. Besides, if it did, a few large internet companies would have even more of an oligopolistic advantage vs. the others than they now do.

The huge problem with these truisms, of course, is scope creep. Once the data exists, it can be used for many more purposes than the few we’d all agree are actually OK.

6. That, in turn, leads me back to two privacy posts that I like to keep reminding people of, because they make points that aren’t commonly found elsewhere:

Whether or not you basically agree with me about privacy and surveillance, those two posts may help flesh out whatever your views on the subject actually are.

Categories: Other