Skip navigation.


Syndicate content
Choices in data management and analysis
Updated: 5 hours 19 min ago

Consumer data management

Mon, 2015-10-05 00:27

Don’t plan to fish in your personal data lake.

Perhaps the biggest mess in all of IT is the management of individual consumers’ data. Our electronic data is thoroughly scattered. Most individual portions are poorly managed. There’s no integration. The data that’s on paper is even worse. For example:

  • Do you have access to your medical records? Do you even know when you were last vaccinated for what?
  • Several enterprises have comprehensive records of all your credit card purchases, in easy-to-analyze form. Do you have such records too?
  • How easily can you find old emails? How about old paper correspondence?

For the most part, the technology community is barely trying to solve those problems. But even when it does try, success is mixed at best. For example:

And those are some of the most successful names.

There are numerous reasons for this dismal state of affairs. 

  • The problem is generically hard. There are many types of data, in both the narrow and broad senses of “type”. There are many use cases. There are many possible devices that would need to be supported. There even are a bunch of different regulatory implications.
  • Consumers aren’t going to organize data themselves. A solution that actually worked would need really great usability and automation.
  • Companies see their data about customers as an asset. They don’t want to share — even with the customers themselves.

The toughest problem, I think, is in my middle bullet point — people hate organizing their own information. That’s true, by the way, of consumers and individual employees alike. Canonical examples on the enterprise side include knowledge management, taxonomy building,* or getting salespeople to properly fill in sales force automation software forms. On the consumer side, personal computers were pitched in their very early days as a way to store recipes; how did that ever work out? Thus, the standard for usability for people to actually like personal data management technology is very high, and very difficult to meet.

*Well, canonical at least among text search geeks. :)

Despite all this negativity, I think there are two areas in which it is inevitable that consumers will wind up with access to well-organized online data stores — health and money. The first reason is simply perceived value. Health and money are both important, and people know it, and so those have always been the two areas in which consumers have willingly paid quite a bit for information and advice.

I happen to have picked up that truism in the 1990s, when I published a subscription newsletter, and the only categories in which consumer newsletters sold well were health and money. But if you don’t believe me, you could note:

  • Consumers pay a lot for money management/wealth management.
  • Consumers pay a lot to physicians who — surgeons and so on aside — strictly speaking aren’t providing anything except information and advice.

My more precise reasons for believing consumer financial data management will eventually be straightened out start:

  • Nothing is easier to organize than financial records.
  • Both your tax authority and credit card data resellers try to pull all your financial information together.

As for health care:

  • Integrated health records make lots of sense for your health care provider.
  • Ultimately, they won’t be able to deny you access to them.
  • Besides, they don’t really have much reason to deny such access.

But that’s most of the good news. Oh, I do think Apple will one of these decades come up with a decent way to manage what’s on your Apple devices. A few other point solutions will be similarly competent. But personal data lakes, or anything like that? I don’t think those are going to happen in any kind of foreseeable time frame.

Categories: Other

The potential significance of Cloudera Kudu

Mon, 2015-09-28 01:54

This is part of a three-post series on Kudu, a new data storage system from Cloudera.

Combined with Impala, Kudu is (among other things) an attempt to build a no-apologies analytic DBMS (DataBase Management System) into Hadoop. My reactions to that start:

  • It’s plausible; just not soon. What I mean by that is:
    • Success will, at best, be years away. Please keep that in mind as you read this otherwise optimistic post.
    • Nothing jumps out at me to say “This will never work!”
    • Unlike when it introduced Impala — or when I used to argue with Jeff Hammerbacher pre-Impala :) — this time Cloudera seems to have reasonable expectations as to how hard the project is.
  • There’s huge opportunity if it works.
    • The analytic RDBMS vendors are beatable. Teradata has a great track record of keeping its product state-of-the-art, but it likes high prices. Most other strong analytic RDBMS products were sold to (or originated by) behemoth companies that seem confused about how to proceed.
    • RDBMS-first analytic platforms didn’t do as well as I hoped. That leaves a big gap for Hadoop.

I’ll expand on that last point. Analytics is no longer just about fast queries on raw or simply-aggregated data. Data transformation is getting ever more complex — that’s true in general, and it’s specifically true in the case of transformations that need to happen in human real time. Predictive models now often get rescored on every click. Sometimes, they even get retrained at short intervals. And while data reduction in the sense of “event extraction from high-volume streams” isn’t that a big deal yet in commercial apps featuring machine-generated data — if growth trends continue as much of us expect, it’s only a matter of time before that changes.

Of course, this is all a bullish argument for Spark (or Flink, if I’m wrong to dismiss its chances as a Spark competitor). But it also all requires strong low-latency analytic data underpinnings, and I suspect that several kinds of data subsystem will prosper. I expect Kudu-supported Hadoop/Spark to be a strong contender for that role, along with the best of the old-school analytic RDBMS, Tachyon-supported Spark, one or more contenders from the Hana/MemSQL crowd (i.e., memory-centric RDBMS that purport to be good at analytics and transactions alike), and of course also whatever Cloudera’s strongest competitor(s) choose to back.

Categories: Other

Cloudera Kudu deep dive

Mon, 2015-09-28 01:52

This is part of a three-post series on Kudu, a new data storage system from Cloudera.

Let’s talk in more detail about how Kudu stores data.

  • As previously noted, inserts land in an in-memory row store, which is periodically flushed to the column store on disk. Queries are federated between these two stores. Vertica taught us to call these the WOS (Write-Optimized Store) and ROS (Read-Optimized Store) respectively, and I’ll use that terminology here.
  • Part of the ROS is actually another in-memory store, aka the DeltaMemStore, where updates and deletes land before being applied to the DiskRowSets. These stores are managed separately for each DiskRowSet. DeltaMemStores are checked at query time to confirm whether what’s in the persistent store is actually up to date.
  • A major design goal for Kudu is that compaction should never block – nor greatly slow — other work. In support of that:
    • Compaction is done, server-by-server, via a low-priority but otherwise always-on background process.
    • There is a configurable maximum to how big a compaction process can be — more precisely, the limit is to how much data the process can work on at once. The current default figure = 128 MB, which is 4X the size of a DiskRowSet.
    • When done, Kudu runs a little optimization to figure out which 128 MB to compact next.
  • Every tablet has its own write-ahead log.
    • This creates a practical limitation on the number of tablets …
    • … because each tablet is causing its own stream of writes to “disk” …
    • … but it’s only a limitation if your “disk” really is all spinning disk …
    • … because multiple simultaneous streams work great with solid-state memory.
  • Log retention is configurable, typically the greater of 5 minutes or 128 MB.
  • Metadata is cached in RAM. Therefore:
    • ALTER TABLE kinds of operations that can be done by metadata changes only — i.e. adding/dropping/renaming columns — can be instantaneous.
    • To keep from being screwed up by this, the WOS maintains a column that labels rows by which schema version they were created under. I immediately called this MSCC — Multi-Schema Concurrency Control :) — and Todd Lipcon agreed.
  • Durability, as usual, boils down to “Wait until a quorum has done the writes”, with a configurable option as to what constitutes a “write”.
    • Servers write to their respective write-ahead logs, then acknowledge having done so.
    • If it isn’t too much of a potential bottleneck — e.g. if persistence is on flash — the acknowledgements may wait until the log has been fsynced to persistent storage.
  • There’s a “thick” client library which, among other things, knows enough about the partitioning scheme to go straight to the correct node(s) on a cluster.

Leaving aside the ever-popular possibilities of:

  • Cluster-wide (or larger) equipment outages
  • Bugs

the main failure scenario for Kudu is:

  • The leader version of a tablet (within its replica) set goes down.
  • A new leader is elected.
  • The workload is such that the client didn’t notice and adapt to the error on its own.

Todd says that Kudu’s MTTR (Mean Time To Recovery) for write availability tests internally at 1-2 seconds in such cases, and shouldn’t really depend upon cluster size.

Beyond that, I had some difficulties understanding details of the Kudu write path(s). An email exchange ensued, and Todd kindly permitted me to post some of his own words (edited by me for clarification and format).

Every tablet has its own in-memory store for inserts (MemRowSet). From a read/write path perspective, every tablet is an entirely independent entity, with its own MemRowSet, rowsets, etc. Basically the flow is:

  • The client wants to make a write (i.e. an insert/update/delete), which has a primary key.
    • The client applies the partitioning algorithm to determine which tablet that key belongs in.
    • The information about which tablets cover which key ranges (or hash buckets) is held in the master. (But since it is cached by the clients, this is usually a local operation.)
    • It sends the operation to the “leader” replica of the correct tablet (batched along with any other writes that are targeted to the same tablet).
  • Once the write reaches the tablet leader:
    • The leader enqueues the write to its own WAL (Write-Ahead Log) and also enqueues it to be sent to the “follower” replicas.
    • Once it has reached a majority of the WALs (i.e. 2/3 when the replication factor = 3), the write is considered “replicated”. That is to say, it’s durable and would always be rolled forward, even if the leader crashed at this point.
    • Only now do we enter the “storage” part of the system, where we start worrying about MemRowSets vs DeltaMemStores, etc.

Put another way, there is a fairly clean architectural separation into three main subsystems:

  • Metadata and partitioning (map from a primary key to a tablet, figure out which servers host that tablet).
  • Consensus replication (given a write operation, ensure that it is durably logged and replicated to a majority of nodes, so that even if we crash, everyone will agree whether it should be applied or not).
  • Tablet storage (now that we’ve decided a write is agreed upon across replicas, actually apply it to the database storage).

These three areas of the code are separated as much as possible — for example, once we’re in the “tablet storage” code, it has no idea that there might be other tablets. Similarly, the replication and partitioning code don’t know much anything about MemRowSets, etc – that’s entirely within the tablet layer.

As for reading — the challenge isn’t in the actual retrieval of the data so much as in figuring out where to retrieve it from. What I mean by that is:

  • Data will always be either in memory or in a persistent column store. So I/O speed will rarely be a problem.
  • Rather, the challenge to Kudu’s data retrieval architecture is finding the relevant record(s) in the first place, which is slightly more complicated than in some other systems. For upon being told the requested primary key, Kudu still has to:
    • Find the correct tablet(s).
    • Find the record(s) on the (rather large) tablet(s).
    • Check various in-memory stores as well.

The “check in multiple places” problem doesn’t seem to be of much concern, because:

  • All that needs to be checked is the primary key column.
  • The on-disk data is front-ended by Bloom filters.
  • The cases in which a Bloom filter returns a false positive are generally the same busy ones where the key column is likely to be cached in RAM.
  • Cloudera just assumes that checking a few different stores in RAM isn’t going to be a major performance issue.

When it comes to searching the tablets themselves:

  • Kudu tablets feature data skipping among DiskRowSets, based on value ranges for the primary key.
  • The whole point of compaction is to make the data skipping effective.

Finally, Kudu pays a write-time (or compaction-time) cost to boost retrieval speeds from inside a particular DiskRowSet, by creating something that Todd called an “ordinal index” but agreed with me would be better called something like “ordinal offset” or “offset index”. Whatever it’s called, it’s an index that tells you the number of rows you would need to scan before getting the one you want, thus allowing you to retrieve (except for the cost of an index probe) at array speeds.

Categories: Other

Introduction to Cloudera Kudu

Mon, 2015-09-28 01:50

This is part of a three-post series on Kudu, a new data storage system from Cloudera.

Cloudera is introducing a new open source project, Kudu,* which from Cloudera’s standpoint is meant to eventually become the single best underpinning for analytics on the Hadoop stack. I’ve spent multiple hours discussing Kudu with Cloudera, mainly with Todd Lipcon. Any errors are of course entirely mine.

*Like the impala, the kudu is a kind of antelope. I knew that, because I enjoy word games. What I didn’t know — and which is germane to the naming choice — is that the kudu has stripes. :)

For starters:

  • Kudu is an alternative to HDFS (Hadoop Distributed File System), or to HBase.
  • Kudu is meant to be the underpinning for Impala, Spark and other analytic frameworks or engines.
  • Kudu is not meant for OLTP (OnLine Transaction Processing), at least in any foreseeable release. For example:
    • Kudu doesn’t support multi-row transactions.
    • There are no active efforts to front-end Kudu with an engine that is fast at single-row queries.
    • Kudu is rather columnar, except for transitory in-memory stores.
  • Kudu’s core design points are that it should:
    • Accept data very quickly.
    • Immediately make that data available for analytics.
  • More specifically, Kudu is meant to accept, along with slower forms of input:
    • Lots of fast random writes, e.g. of web interactions.
    • Streams, viewed as a succession of inserts.
    • Updates and inserts alike.
  • The core “real-time” use cases for which Kudu is designed are, unsurprisingly:
    • Low-latency business intelligence.
    • Predictive model scoring.
  • Kudu is designed to work fine with spinning disk, and indeed has been tested to date mainly on disk-only nodes. Even so, Kudu’s architecture is optimized for the assumption that there will be at least some flash on the node.
  • Kudu is designed primarily to support relational/SQL processing. However, Kudu also has a nested-data roadmap, which of course starts with supporting the analogous capabilities in Impala.

Also, it might help clarify Kudu’s status and positioning if I add:

  • Kudu is in its early days — heading out to open source and beta now, with maturity still quite a way off. Many obviously important features haven’t been added yet.
  • Kudu is expected to be run with a replication factor (tunable, usually =3). Replication is via the Raft protocol.
  • Kudu and HDFS can run on the same nodes. If they do, they are almost entirely separate from each other, with the main exception being some primitive workload management to help them share resources.
  • Permanent advantages of older alternatives over Kudu are expected to include:
    • Legacy. Older, tuned systems may work better over some HDFS formats than over Kudu.
    • Pure batch updates. Preparing data for immediate access has overhead.
    • Ultra-high update volumes. Kudu doesn’t have a roadmap to completely catch up in write speeds with NoSQL or in-memory SQL DBMS.

Kudu’s data organization story starts:

  • Storage is right on the server (this is of course also the usual case for HDFS).
  • On any one server, Kudu data is broken up into a number of “tablets”, typically 10-100 tablets per node.
  • Inserts arrive into something called a MemRowSet and are soon flushed to something called a DiskRowSet. Much as in Vertica:
    • MemRowSets are managed by an in-memory row store.
    • DiskRowSets are managed by a persistent column store.*
    • In essence, queries are internally federated between the in-memory and persistent stores.
  • Each DiskRowSet contains a separate file for each column in the table.
  • DiskRowSets are tunable in size. 32 MB currently seems like the optimal figure.
  • Page size default is 256K, but can be dropped as low as 4K.
  • DiskRowSets feature columnar compression, with a variety of standard techniques.
    • All compression choices are specific to a particular DiskRowSet.
    • So, in the case of dictionary/token compression, is the dictionary.
    • Thus, data is decompressed before being operated on by a query processor.
    • Also, selected columns or an entire DiskRowSet can be block-compressed.
  • Tables and DiskRowSets do not expose any kind of RowID. Rather, tables have primary keys in the usual RDBMS way.
  • Kudu can partition data in the three usual ways: randomly, by range or by hash.
  • Kudu does not (yet) have a slick and well-tested way to broadcast-replicated a small table across all nodes.

*I presume there are a few ways in which Kudu’s efficiency or overhead seem more row-store-like than columnar. Still, Kudu seems to meet the basic requirements to be called a columnar system.

Categories: Other

Rocana’s world

Thu, 2015-09-17 05:49

For starters:

  • My client Rocana is the renamed ScalingData, where Rocana is meant to signify ROot Cause ANAlysis.
  • Rocana was founded by Omer Trajman, who I’ve referenced numerous times in the past, and who I gather is a former boss of …
  • … cofounder Eric Sammer.
  • Rocana recently told me it had 35 people.
  • Rocana has a very small number of quite large customers.

Rocana portrays itself as offering next-generation IT operations monitoring software. As you might expect, this has two main use cases:

  • Actual operations — figuring out exactly what isn’t working, ASAP.
  • Security.

Rocana’s differentiation claims boil down to fast and accurate anomaly detection on large amounts of log data, including but not limited to:

  • The sort of network data you’d generally think of — “everything” except packet-inspection stuff.
  • Firewall output.
  • Database server logs.
  • Point-of-sale data (at a retailer).
  • “Application data”, whatever that means. (Edit: See Tom Yates’ clarifying comment below.)

In line with segment leader Splunk’s pricing, data volumes in this area tend to be described in terms of new data/day. Rocana seems to start around 3 TB/day, which not coincidentally is a range that would generally be thought of as:

  • Challenging for Splunk, and for the budgets of Splunk customers.
  • Not a big problem for well-implemented Hadoop.

And so part of Rocana’s pitch, familiar to followers of analytic RDBMS and Hadoop alike, is “We keep and use all your data, unlike the legacy guys who make you throw some of it away up front.”

Since Rocana wants you to keep all your data, 3 TB/day is about 1 PB/year.

But really, that’s just saying that Rocana is an analytic stack built on Hadoop, using Hadoop for what people correctly think it’s well-suited for, done by guys who know a lot about Hadoop.

The cooler side of Rocana, to my tastes, is the actual analytics. Truth be told, I find almost any well thought out event-series analytics story cool. It’s an area much less mature than relational business intelligence, and accordingly with much more scope for innovation. On the visualization side, crucial aspects start:

  • Charting over time (duh).
  • Comparing widely disparate time intervals (e.g., current vs. historical/baseline).
  • Whichever good features from relational BI apply to your use case as well.

Other important elements may be more data- or application-specific — and the fact that I don’t have a long list of particulars illustrates just how immature the area really is.

Even cooler is Rocana’s integration of predictive modeling and BI, about which I previously remarked:

The idea goes something like this:

  • Suppose we have lots of logs about lots of things. Machine learning can help:
    • Notice what’s an anomaly.
    • Group together things that seem to be experiencing similar anomalies.
  • That can inform a BI-plus interface for a human to figure out what is happening.

Makes sense to me.

So far as I can tell, predictive modeling is used to notice aberrant data (raw or derived). This is quickly used to define a subset of data to drill down to (e.g., certain kinds of information from certain machines in a certain period of time). Event-series BI/visualization then lets you see the flows that led to the aberrant result, which was any luck will allow you to find the exact place where the data first goes wrong. And that, one hopes, is something that the ops guys can quickly fix.

I think similar approaches could make sense in numerous application segments.

Related links

Categories: Other

DataStax and Cassandra update

Mon, 2015-09-14 00:02

MongoDB isn’t the only company I reached out to recently for an update. Another is DataStax. I chatted mainly with Patrick McFadin, somebody with whom I’ve had strong consulting relationships at a user and vendor both. But Rachel Pedreschi contributed the marvelous phrase “twinkling dashboard”.

It seems fair to say that in most cases:

  • Cassandra is adopted for operational applications, specifically ones with requirements for extreme uptime and/or extreme write speed. (Of course, it should also be the case that NoSQL data structures are a good fit.)
  • Spark, including SparkSQL, and Solr are seen primarily as ways to navigate or analyze the resulting data.

Those generalities, in my opinion, make good technical sense. Even so, there are some edge cases or counterexamples, such as:

  • DataStax trumpets British Gas‘ plans collecting a lot of sensor data and immediately offering it up for analysis.*
  • Safeway uses Cassandra for a mobile part of its loyalty program, scoring customers and pushing coupons at them.
  • A large title insurance company uses Cassandra-plus-Solr to manage a whole lot of documents.

*And so a gas company is doing lightweight analysis on boiler temperatures, which it regards as hot data. :)

While most of the specifics are different, I’d say similar things about MongoDB, Cassandra, or any other NoSQL DBMS that comes to mind:

  • You can get any kind of data into them very fast; indeed, that’s a central part of what they were designed for.
  • In the general case, getting it back out for low-latency analytics is problematic …
  • … but there’s an increasing list of exceptions.

For DataStax Enterprise, exceptions start:

  • Formally, you can do almost anything in at least one of Solr or Spark/SparkSQL. So if volumes are low enough, you’re fine. In particular, Spark offers the potential to do many things at in-memory speeds.
  • Between Spark, the new functions, and general scripting, there are several ways to do low-latency aggregations. This can lead to “twinkling dashboards”.*
  • DataStax is alert to the need to stream data into Cassandra.
    • That’s central to the NoSQL expectation of ingesting internet data very quickly.
    • Kafka, Storm and Spark Streaming all seem to be in the mix.
  • Solr over Cassandra has a searchable RAM buffer, which can give the effect of real-time text indexing within a second or so of ingest.

*As much as I love the “twinkling dashboard” term — it reminds me of my stock analyst days — it does raise some concerns. In many use cases, human real-time BI should be closely integrated with the more historical kind.

DataStax Enterprise:

  • Is based on Cassandra 2.1.
  • Will probably never include Cassandra 2.2, waiting instead for …
  • ….Cassandra 3.0, which will feature a storage engine rewrite …
  • … and will surely include Cassandra 2.2 features of note.

This connects to what I said previously in that Cassandra 2.2 adds some analytic features, specifically in the area of user-defined functions. Notes on Cassandra 2.2 UDFs include:

  • These are functions — not libraries, a programming language, or anything like that.
  • The “user-defined” moniker notwithstanding, the capability has been used to implement COUNT, SUM, AVG, MAX and so on.
  • You are meant to run user-defined functions on data in a single Cassandra partition; run them across partitions at your own performance risk.

And finally, some general tidbits:

  • A while ago, Apple said it had >75,000 Cassandra nodes. The figure is surely bigger now.
  • There are at least several other petabyte range Cassandra installations, and several more half-petabyte ones.
  • Netflix is not one of those. Instead, it has many 10s of smaller Cassandra clusters.
  • There are Cassandra users with >1 million reads+writes per second.

Finally a couple of random notes:

  • One of the text search use cases for Solr/Cassandra is to — in one query — get at information that originated in multiple places, e.g. for reasons of time period or geography. (I hear this about text search across lots of database technologies, relational and non-relational alike.)
  • As big a change as Cassandra 3.0 will be, it will not require that you take down your applications for an upgrade. That hasn’t been necessary since Cassandra 0.7.
Categories: Other

MongoDB update

Thu, 2015-09-10 04:33

One pleasure in talking with my clients at MongoDB is that few things are NDA. So let’s start with some numbers:

  • >2,000 named customers, the vast majority of which are unique organizations who do business with MongoDB directly.
  • ~75,000 users of MongoDB Cloud Manager.
  • Estimated ~1/4 million production users of MongoDB total.

Also >530 staff, and I think that number is a little out of date.

MongoDB lacks many capabilities RDBMS users take for granted. MongoDB 3.2, which I gather is slated for early November, narrows that gap, but only by a little. Features include:

  • Some JOIN capabilities.
    • Specifically, these are left outer joins, so they’re for lookup but not for filtering.
    • JOINs are not restricted to specific shards of data …
    • … but do benefit from data co-location when it occurs.
  • A BI connector. Think of this as a MongoDB-to- SQL translator. Using this does require somebody to go in and map JSON schemas and relational tables to each other. Once that’s done, the flow is:
    • Basic SQL comes in.
    • Filters and GroupBys are pushed down to MongoDB. A result set … well, it results. :)
    • The result set is formatted into a table and returned to the system — for example a business intelligence tool — that sent the SQL.
  • Database-side document validation, in the form of field-specific rules that combine into a single expression against which to check a document.
    • This is fairly simple stuff — no dependencies among fields in the same document, let alone foreign key relationships.
    • MongoDB argues, persuasively, that this simplicity makes it unlikely to recreate the spaghetti code maintenance nightmare that was 1990s stored procedures.
    • MongoDB concedes that, for performance, it will ordinarily be a good idea to still do your validation on the client side.
    • MongoDB points out that enforcement can be either strict (throw errors) or relaxed (just note invalid documents to a log). The latter option is what makes it possible to install this feature without breaking your running system.

There’s also a closed-source database introspection tool coming, currently codenamed MongoDB Scout. 

  • The name will change, in part because if you try to search on that name you’ll probably find an unrelated Scout. :)
  • Scout samples data, runs stats, and all that stuff.
  • Scout is referred to as a “schema introspection” tool, but I’m not sure why; schema introspection sounds more like a feature or architectural necessity than an actual product.

As for storage engines:

  • WiredTiger, which was the biggest deal in MongoDB 3.0, will become the default in 3.2. I continue to think analogies to InnoDB are reasonably appropriate.
  • An in-memory storage engine option was also announced with MongoDB 3.0. Now there’s a totally different in-memory option. However, details were not available at posting time. Stay tuned.
  • Yet another MongoDB storage engine, based on or akin to WiredTiger, will do encryption. Presumably, overhead will be acceptably low. Key management and all that will be handled by usual-suspect third parties.

Finally — most data management vendors brag to me about how important their text search option is, although I’m not necessarily persuaded. :) MongoDB does have built-in text search, of course, of which I can say:

  • It’s a good old-fashioned TF/IDF algorithm. (Text Frequency/Inverse Document Frequency.)
  • About the fanciest stuff they do is tokenization and stemming. (In a text search context, tokenization amounts to the identification of word boundaries and the like. Stemming is noticing that alternate forms of the same word really are the same thing.)

This level of technology was easy to get in the 1990s. One thing that’s changed in the intervening decades, however, is that text search commonly supports more languages. MongoDB offers stemming in 8 or 9 languages for free, plus a paid option via Basis for other languages yet.

Related links

Categories: Other

Multi-model database managers

Mon, 2015-08-24 02:07

I’d say:

  • Multi-model database management has been around for decades. Marketers who say otherwise are being ridiculous.
  • Thus, “multi-model”-centric marketing is the last refuge of the incompetent. Vendors who say “We have a great DBMS, and by the way it’s multi-model (now/too)” are being smart. Vendors who say “You need a multi-model DBMS, and that’s the reason you should buy from us” are being pathetic.
  • Multi-logical-model data management and multi-latency-assumption data management are greatly intertwined.

Before supporting my claims directly, let me note that this is one of those posts that grew out of a Twitter conversation. The first round went:

Merv Adrian: 2 kinds of multimodel from DBMS vendors: multi-model DBMSs and multimodel portfolios. The latter create more complexity, not less.

Me: “Owned by the same vendor” does not imply “well integrated”. Indeed, not a single example is coming to mind.

Merv: We are clearly in violent agreement on that one.

Around the same time I suggested that Intersystems Cache’ was the last significant object-oriented DBMS, only to get the pushback that they were “multi-model” as well. That led to some reasonable-sounding justification — although the buzzwords of course aren’t from me — namely:

Caché supports #SQL, #NoSQL. Interchange across tables, hierarchical, document storage.

Along the way, I was reminded that some of the marketing claims around “multi-model” are absurd. For example, at the time I am writing this, the Wikipedia article on “multi-model database” claims that “The first multi-model database was OrientDB, created in 2010…” In fact, however, by the definitions used in that article, multi-model DBMS date back to the 1980s, when relational functionality was grafted onto pre-relational systems such as TOTAL and IDMS.

What’s more, since the 1990s, multi-model functionality has been downright common, specifically in major products such as Oracle, DB2 and Informix, not to mention PostgreSQL. (But not so much Microsoft or Sybase.) Indeed, there was significant SQL standards work done around datatype extensions, especially in the contexts of SQL/MM and SQL3.

I tackled this all in 2013, when I argued:

Developments since then have been in line with my thoughts. For example, Spark added DataFrames, which promise substantial data model flexibility for Spark use cases, but more mature products have progressed in a more deliberate way.

What’s new in all this is a growing desire to re-integrate short-request and analytic processing — hence Gartner’s new-ish buzzword of HTAP (Hybrid Transactional/Analytic Processing). The more sensible reasons for this trend are:

  • Operational applications have always needed to accept immediate writes. (Losing data is bad.)
  • Operational applications have always needed to serve small query result sets based on the freshest data.(If you write something into a database, you might need to immediately retrieve it to finish the business operation.)
  • It is increasingly common for predictive decisions to be made at similar speeds. (That’s what recommenders and personalizers do.) Ideally, such decisions can be based on fresh and historical data alike.
  • The long-standing desire for business intelligence to operate on super-fresh data is, increasingly, making sense, as we get ever more stuff to monitor. However …
  • … most such analysis should look at historical data as well.
  • Streaming technology is supplying ever more fresh data.

But here’s the catch — the best models for writing data are the worst for reading it, and vice-versa, because you want to write data as a lightly-structured document or log, but read it from a Ted-Codd-approved RDBMS or MOLAP system. And if you don’t have the time to move data among multiple stores, then you want one store to do a decent job of imitating both kinds of architecture. The interesting new developments in multi-model data management will largely be focused on that need.

Related links

  • The two-policemen joke seems ever more relevant.
  • My April, 2015 post on indexing technology reminds us that one DBMS can do multiple things.
  • Back in 2009 integrating OLTP and data warehousing was clearly a bad idea.
Categories: Other

Data messes

Mon, 2015-08-03 03:58

A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.

To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — :) — mine.)

Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:

  • Inconsistent, in which case humans might not know how to look it up and database JOINs might fail.
  • Unintegrated, in which case one application might not be able to use data that another happily maintains. (This is the classic data silo problem.)

Inconsistency can take multiple forms, including: 

  • Variant names.
  • Variant spellings.
  • Variant data structures (not to mention datatypes, formats, etc.).

Addressing the first two is the province of master data management (MDM), and also of the same data cleaning technologies that might help with outright errors. Addressing the third is the province of other data integration technology, which also may be what’s needed to break down the barriers between data silos.

So far I’ve been assuming that data is neatly arranged in fields in some kind of database. But suppose it’s in documents or videos or something? Well, then there’s a needed step of data enhancement; even when that’s done, further data integration issues are likely to be present.

All of the above issues occur with analytic data too. In some cases it probably makes sense not to fix them until the data is shipped over for analysis. In other cases, it should be fixed earlier, but isn’t. And in hybrid cases, data is explicitly shipped to an operational data warehouse where the problems are presumably fixed.

Further, some problems are much greater in their analytic guise. Harmonization and integration among data silos are likely to be much more intense. (What is one table for analytic purposes might be many different ones operationally, for reasons that might span geography, time period, or application legacy.) Addressing those issues is the province of data integration technologies old and new. Also, data transformation and enhancement are likely to be much bigger deals in the analytic sphere, in part because of poly-structured internet data. Many Hadoop and now Spark use cases address exactly those needs.

Let’s now consider missing data. In operational cases, there are three main kinds of missing data:

  • Missing values, as a special case of inaccuracy.
  • Data that was only collected over certain time periods, as a special case of changing data structure.
  • Data that hasn’t been derived yet, as the main case of a need for data enhancement.

All of those cases can ripple through to cause analytic headaches. But for certain inherently analytic data sets — e.g. a weblog or similar stream — the problem can be even worse. The data source might stop functioning, or might change the format in which it transmits; but with no immediate operations compromised, it might take a while to even notice. I don’t know of any technology that does a good, simple job of addressing these problems, but I am advising one startup that plans to try.

Further analytics-mainly data messes can be found in three broad areas:

  • Problems caused by new or changing data sources hit much faster in analytics than in operations, because analytics draws on a greater variety of data.
  • Event recognition, in which most of a super-high-volume stream is discarded while the “good stuff” is kept, is more commonly a problem in analytics than in pure operations. (That said, it may arise on the boundary of operations and analytics, namely in “real-time” monitoring.
  • Analytics has major problems with data scavenger hunts, in which business analysts and data scientists don’t know what data is available for them to examine.

That last area is the domain of a lot of analytics innovation. In particular:

  • It’s central to the dubious Gartner concept of a Logical Data Warehouse, and to the more modest logical data layers I advocate as alternative.
  • It’s been part of BI since the introduction of Business Objects’ “semantic layer”. (See, for example, my recent post on Zoomdata.)
  • It’s a big part of the story of startups such as Alation or Tamr.
  • In a failed effort, it was part of Greenplum’s pitch some years back, as an aspect of the “enterprise data cloud”.
  • It led to some of the earliest differentiated features at Gooddata.
  • It’s implicit in the some BI collaboration stories, in some BI/search integration, and in ClearStory’s “Data You May Like”.

Finally, suppose we return to the case of operational data, assumed to be accurately stored in fielded databases, with sufficient data integration technologies in place. There’s still a whole other kind of possible mess than those I cited above — applications may not be doing a good job of understanding and using it. I could write a whole series of posts on that subject alone … but it’s going slowly. :) So I’ll leave that subject area for another time.

Categories: Other

SaaS and traditional software from the same vendor?

Mon, 2015-07-20 03:09

It is extremely difficult to succeed with SaaS (Software as a Service) and packaged software in the same company. There were a few vendors who seemed to pull it off in the 1970s and 1980s, generally industry-specific application suite vendors. But it’s hard to think of more recent examples — unless you have more confidence than I do in what behemoth software vendors say about their SaaS/”cloud” businesses.

Despite the cautionary evidence, I’m going to argue that SaaS and software can and often should be combined. The “should” part is pretty obvious, with reasons that start:

  • Some customers are clearly better off with SaaS. (E.g., for simplicity.)
  • Some customers are clearly better off with on-premises software. (E.g., to protect data privacy.)
  • On-premises customers want to know they have a path to the cloud.
  • Off-premises customers want the possibility of leaving their SaaS vendor’s servers.
  • SaaS can be great for testing, learning or otherwise adopting software that will eventually be operated in-house.
  • Marketing and sales efforts for SaaS and packaged versions can be synergistic.
    • The basic value proposition, competitive differentiation, etc. should be the same, irrespective of delivery details.
    • In some cases, SaaS can be the lower cost/lower commitment option, while packaged product can be the high end or upsell.
    • An ideal sales force has both inside/low-end and bag-carrying/high-end components.

But the “how” of combining SaaS and traditional software is harder. Let’s review why. 

Why it is hard for one vendor to succeed at both packaged software and SaaS?

SaaS and packaged software have quite different development priorities and processes. SaaS vendors deliver and support software that:

  • Runs on a single technology stack.
  • Is run only at one or a small number of physical locations.
  • Is run only in one or a small number of historical versions.
  • May be upgraded multiple times per month.
  • Can be assumed to be operated by employees of the SaaS company.
  • Needs, for customer acquisition and retention reasons, to be very easy for users to learn.

But traditional packaged software:

  • Runs on technology the customer provides and supports, at the location of the customer’s choice.
  • Runs in whichever versions customers have not yet upgraded from.
  • Should — to preserve the sanity of all concerned — have only have a few releases per year.
  • Is likely to be operated by less knowledgeable or focused staff than a SaaS vendor enjoys.
  • Can sometimes afford more of an end-user learning curve than SaaS.

Thus, in most cases:

  • Traditional software creates greater support and compatibility burdens than SaaS does.
  • SaaS and on-premises software have very different release cycles.
  • SaaS should be easier for end-users than most traditional software, but …
  • … traditional software should be easier to administer than SaaS.

Further — although this is one difference that I think has at times been overemphasized — SaaS vendors would prefer to operate truly multi-tenant versions of their software, while enterprises less often have that need.

How this hard thing could be done

Most of the major problems with combining SaaS and packaged software efforts can be summarized in two words — defocused development. Even if the features are substantially identical, SaaS is developed on different schedules and for different platform stacks than packaged software is.

So can we design an approach to minimize that problem? I think yes. In simplest terms, I suggest:

  • A main development organization focused almost purely on SaaS.
  • A separate unit adapting the SaaS code for on-premises customers, with changes to the SaaS offering being concentrated in three aspects:
    • Release cadence.
    • Platform support.
    • Administration features, which are returned to the SaaS group for its own optional use.

Certain restrictions would need to be placed on the main development unit. Above all, because the SaaS version will be continually “thrown over the wall” to the sibling packaged-product group, code must be modular and documentation must be useful. The standard excuses — valid or otherwise — for compromising on these virtues cannot be tolerated.

There is one other potentially annoying gotcha. Hopefully, the SaaS group uses third-party products and lots of them; that’s commonly better than reinventing the wheel. But in this plan they need to use ones that are also available for third-party/OEM kinds of licensing.

My thoughts on release cadence start:

  • There should be a simple, predictable release cycle:
    • N releases per year, for N approximately = 4.
    • Strong efforts to adhere to a predictable release schedule.
  • A reasonable expectation is that what’s shipped and supported for on-premises use is 6-9 months behind what’s running on the SaaS service. 3-6 months would be harder to achieve.

The effect would be that on-premises software would lag SaaS features to a predictable and bounded extent.

As for platform support:

  • You have to stand ready to install and support whatever is needed. (E.g., in the conversation that triggered this post, the list started with Hadoop, Spark, and Tachyon.)
  • You have to adapt to customers’ own reasonably-current installations of needed components (but help them upgrade if they’re way out of date).
  • Writing connectors is OK. Outright porting from your main stack to another may be unwise.
  • Yes, this is all likely to involve significant professional services, at least to start with, because different customers will require different degrees of adaptation.

That last point is key. The primary SaaS offering can be standard, in the usual way. But the secondary business — on-premises software — is inherently services-heavy. Fortunately, packaged software and professional services can be successfully combined.

And with that I’ll just stop and reiterate my conclusion:

It may be advisable to offer both SaaS and services-heavy packaged software as two options for substantially the same product line.

Related link

  • Point #4 of my VC overlord post is relevant  — and Point #3 even more so. :)
Categories: Other

Zoomdata and the Vs

Tue, 2015-07-07 17:23

Let’s start with some terminology biases:

So when my clients at Zoomdata told me that they’re in the business of providing “the fastest visual analytics for big data”, I understood their choice, but rolled my eyes anyway. And then I immediately started to check how their strategy actually plays against the “big data” Vs.

It turns out that:

  • Zoomdata does its processing server-side, which allows for load-balancing and scale-out. Scale-out and claims of great query speed are relevant when data is of high volume.
  • Zoomdata depends heavily on Spark.
  • Zoomdata’s UI assumes data can be a mix of historical and streaming, and that if looking at streaming data you might want to also check history. This addresses velocity.
  • Zoomdata assumes data can be in a variety of data stores, including:
    • Relational (operational RDBMS, analytic RDBMS, or SQL-on-Hadoop).
    • Files (generic HDFS — Hadoop Distributed File System or S3).*
    • NoSQL (MongoDB and HBase were mentioned).
    • Search (Elasticsearch was mentioned among others).
  • Zoomdata also tries to detect data variability.
  • Zoomdata is OEM/embedding-friendly.

*The HDFS/S3 aspect seems to be a major part of Zoomdata’s current story.

Core aspects of Zoomdata’s technical strategy include: 

  • QlikView/Tableau-style navigation, at least up to a point. (I hope that vendors with a much longer track record have more nuances in their UIs.)
  • Suitable UI for wholly or partially “real-time” data. In particular:
    • Time is an easy dimension to get along the X-axis.
    • You can select current or historical regions from the same graph, aka “data rewind”.
  • Federated query with some predicate pushdown, aka “data fusion”.
    • Data filtering and some GroupBys are pushed down to the underlying data stores — SQL or NoSQL — when it makes sense.*
    • Pushing down joins (assuming that both sides of the join are from the same data store) is a roadmap item.
  • Approximate query results, aka “data sharpening”. Zoomdata simulates high-speed query by first serving you approximate query results, ala Datameer.
  • Spark to finish up queries. Anything that isn’t pushed down to the underlying data store is probably happening in Spark DataFrames.
  • Spark for other kinds of calculations.

*Apparently it doesn’t make sense in some major operational/general-purpose — as opposed to analytic — RDBMS. From those systems, Zoomdata may actually extract and pre-cube data.

The technology story for “data sharpening” starts:

  • Zoomdata more-or-less samples the underlying data, and returns a result just for the sample. Since this is a small query, it resolves quickly.
  • More precisely, there’s a sequence of approximations, with results based on ever larger samples, until eventually the whole query is answered.
  • Zoomdata has a couple of roadmap items for making these approximations more accurate:
    • The integration of BlinkDB with Spark will hopefully result in actual error bars for the approximations.
    • Zoomdata is working itself on how to avoid sample skew.

The point of data sharpening, besides simply giving immediate gratification, is that hopefully the results for even a small sample will be enough for the user to determine:

  • Where in particular she wants to drill down.
  • Whether she asked the right query in the first place. :)

I like this early drilldown story for a couple of reasons:

  • I think it matches the way a lot of people work. First you get to the query of the right general structure; then you refine the parameters.
  • It’s good for exact-results performance too. Most of what otherwise might have been a long-running query may not need to happen at all.

Aka “Honey, I shrunk the query!”

Zoomdata’s query execution strategy depends heavily on doing lots of “micro-queries” and unioning their result sets. In particular:

  • Data sharpening relies on a bunch of data-subset queries of increasing size.
  • Streaming/”real-time” BI is built from a bunch of sub-queries restricted to small time slices each.

Even for not-so-micro queries, Zoomdata may find itself doing a lot of unioning, as data from different time periods may be in different stores.

Architectural choices in support of all this include:

  • Zoomdata ships with Spark, but can and probably in most cases should be pointed at an external Spark cluster instead. One point is that Zoomdata itself scales by user count, while the Spark cluster scales by data volume.
  • Zoomdata uses MongoDB off to the side as a metadata store. Except for what’s in that store, Zoomdata seems to be able to load balance rather statelessly. And Zoomdata doesn’t think that the MongoDB store is a bottleneck either.
  • Zoomdata uses Docker.
  • Zoomdata is starting to use Mesos.

When a young company has good ideas, it’s natural to wonder how established or mature this all is. Well:

  • Zoomdata has 86 employees.
  • Zoomdata has (production) customers, success stories, and so on, but can’t yet talk fluently about many production use cases.
  • If we recall that companies don’t always get to do (all) their own positioning, it’s fair to say that Zoomdata started out as “Cloudera’s cheap-option BI buddy”, but I don’t think that’s an accurate characterization as this point.
  • Zoomdata, like almost all young companies in the history of BI, favors a “land-and-expand” adoption strategy. Indeed …
  • … Zoomdata tells prospects it wants to be an additional BI provider to them, rather than rip-and-replacement.

As for technological maturity:

  • Zoomdata’s view of data seems essentially tabular, notwithstanding its facility with streams and NoSQL. It doesn’t seem to have tackled much in the way of event series analytics yet.
  • One of Zoomdata’s success stories is iPad-centric. (Salesperson visits prospect and shows her an informative chart; prospect opens wallet; ka-ching.) So I presume mobile BI is working.
  • Zoomdata is comfortable handling 10s of millions of rows of data, may be strained when handling 100s of millions of rows, and has been tested in-house up to 1 billion rows. But that’s data that lands in Spark. The underlying data being filtered can be much larger, and Zoomdata indeed cites one example of a >40 TB Impala database.
  • When I asked about concurrency, Zoomdata told me of in-house testing, not actual production users.
  • Zoomdata’s list when asked what they don’t do (except through partners, of which they have a bunch) was:
    • Data wrangling.
    • ETL (Extract/Transform/Load).
    • Data transformation. (In a market segment with a lot of Hadoop and Spark, that’s not really redundant with the previous bullet point.)
    • Data cataloguing, ala Alation or Tamr.
    • Machine learning.

Related link

  • I wrote about multiple kinds of approximate query result capabilities, Zoomdata-like or otherwise, back in July, 2012.
Categories: Other