Skip navigation.


Syndicate content
Choices in data management and analysis
Updated: 17 hours 1 min ago

The questionably named Cloudera Navigator Optimizer

Thu, 2015-11-19 05:55

I only have mixed success at getting my clients to reach out to me for messaging advice when they’re introducing something new. Cloudera Navigator Optimizer, which is being announced along with Cloudera 5.5, is one of my failures in that respect; I heard about it for the first time Tuesday afternoon. I hate the name. I hate some of the slides I saw. But I do like one part of the messaging, namely the statement that this is about “refactoring” queries.

All messaging quibbles aside, I think the Cloudera Navigator Optimizer story is actually pretty interesting, and perhaps not just to users of SQL-on-Hadoop technologies such as Hive (which I guess I’d put in that category for simplicity) or Impala. As I understand Cloudera Navigator Optimizer:

  • It’s all about analytic SQL queries.
  • Specifically, it’s about reducing duplicated work.
  • It is not an “optimizer” in the ordinary RDBMS sense of the word.
  • It’s delivered via SaaS (Software as a Service).
  • Conceptually, it’s not really tied to SQL-on-Hadoop. However, …
  • … in practice it likely will be used by customers who want to optimize performance of Cloudera’s preferred styles of SQL-on-Hadoop, either because they’re already using SQL-on-Hadoop or in connection with an initial migration.

It grows out of, which started with the intention of being a general workload optimizer for Hadoop and wound up with this beta announcement of a tuning adviser for analytic SQL.

Right now, the Cloudera Navigator Optimizer service is:

  • Query code in.
  • Information and advice out.

Naturally, Cloudera’s intention — perhaps as early as at first general availability — is for the output to start including something that’s more like automation, e.g. hints for the Impala optimizer.

As Anupam Singh describes it, there are basically four kinds of problems that Cloudera Navigator Optimizer can help with:

  • ETL (Extract/Transform/Load) might repeat the same operation over and over again, e.g. joining to a reference table to help with data cleaning. It can be an optimization to consolidate some of that work. (The same would surely also be true in cases where the workload is more properly described as ELT.)
  • For business intelligence it is often helpful to materialize aggregates or result sets. (This is, of course, why materialized views were invented in the first place.)
  • Queries-from-hell — perhaps thousands of lines of SQL long — can perhaps be usefully rewritten into a sequence of much shorter queries.
  • Ad-hoc query workloads can have enough repetition that there’s opportunity for similar optimizations. Anupam thinks his technology has enough intelligence to detect some of these patterns.

Actually, all four of these cases can involve materializing tables so that they don’t need to keep being in part or whole recreated.

In essence, then, this is a way to add in more query pipelining than the underlying data store automagically provides on its own. And that seems to me like a very good idea to try. The whole thing might be worth trying out at least once, even if your analytic RDBMS installation has nothing to do with SQL at all.

Categories: Other

CDH 5.5

Thu, 2015-11-19 05:52

I talked with Cloudera shortly ahead of today’s announcement of Cloudera 5.5. Much of what we talked about had something or other to do with SQL data management. Highlights include:

  • Impala and Kudu are being donated to Apache. This actually was already announced Tuesday. (Due to Apache’s rules, if I had any discussion with Cloudera speculating on the likelihood of Apache accepting the donations, I would not be free to relay it.)
  • Cloudera is introducing SQL extensions so that Impala can query nested data structures. More on that below.
    • The basic idea for the nested datatype support is that there are SQL extensions with a “dot” notation to let you get at the specific columns you need.
    • From a feature standpoint, we’re definitely still in the early days.
      • When I asked about indexes on these quasi-columns, I gathered that they’re not present in beta but are hoped for by the time of general availability.
      • Basic data skipping, also absent in beta, seems to be more confidently expected in GA.
    • This is for Parquet first, Avro next, and presumably eventually native JSON as well.
    • This is said to be Dremel-like, at least in the case of Parquet. I must confess that I’m not familiar enough with Apache Drill to compare the two efforts.
  • Cloudera is increasing its coverage of Spark in several ways.
    • Cloudera is adding support for MLlib.
    • Cloudera is adding support for SparkSQL. More on that below.
    • Cloudera is adding support for Spark going against S3. The short answer to “How is this different from the Databricks service?” is:
      • More “platform” stuff from the Hadoop stack (e.g. for data ingest).
      • Less in the way of specific Spark usability stuff.
    • Cloudera is putting into beta what it got in the acquisition, which it unfortunately is naming Cloudera Navigator Optimizer. More on that in a separate post.
    • Impala and Hive are getting column-level security via Apache Sentry.
    • There are other security enhancements.
    • Some policy-based information lifecycle management is being added as well.

While I had Cloudera on the phone, I asked a few questions about Impala adoption, specifically focused on concurrency. There was mention of:

  • Petabyte scale databases — at least one clear case for Impala/business intelligence only, and the likelihood that the Impala/BI part of other bigger installations was also in that range.
  • Hundreds of nodes.
  • 10s of simultaneous queries in dashboard use cases.
  • 1 – 3 million queries/month as a common figure.

Cloudera also expressed the opinions that:

  • An “overwhelming majority” of Cloudera customers have adopted Impala. (I imagine there’s a bit of hyperbole in that — for one thing, Cloudera has a pricing option in which Impala is not included.)
  • It is common for Impala customers to use Hive for “data preparation”.
  • SparkSQL has “order of magnitude” less performance than Impala, but a little more than performance than Hive running over either Spark or Tez.
  • SparkSQL’s main use cases are (and these overlap heavily):
    • As part of an analytic process (as opposed to straightforwardly DBMS-like use).
    • To persist data outside the confines of a single Spark job.


Categories: Other

Issues in enterprise application software

Wed, 2015-11-11 07:39

1. I think the next decade or so will see much more change in enterprise applications than the last one. Why? Because the unresolved issues are piling up, and something has to give. I intend this post to be a starting point for a lot of interesting discussions ahead.

2. The more technical issues I’m thinking of include:

  • How will app vendors handle analytics?
  • How will app vendors handle machine-generated data?
  • How will app vendors handle dynamic schemas?
  • How far will app vendors get with social features?
  • What kind of underlying technology stacks will app vendors drag along?

We also always have the usual set of enterprise app business issues, including:

  • Will the current leaders — SAP, Oracle and whoever else you want to include — continue to dominate the large-enterprise application market?
  • Will the leaders in the large-enterprise market succeed in selling to smaller markets?
  • Which new categories of application will be important?
  • Which kinds of vendors and distribution channels will succeed in serving small enterprises?

And perhaps the biggest issue of all, intertwined with most of the others, is:

  • How will the move to SaaS (Software as a Service) play out?

3. I’m not ready to answer those questions yet, but at least I’ve been laying some groundwork.

Along with this post, I’m putting up a three post series on the history of enterprise apps. Takeaways include but are not limited to:

  • Application software is a very diverse area. Different generalities apply to different parts of it.
  • A considerable fraction of application software has always been sold with the technology stack being under vendor control. Examples include most app software sold to small and medium enterprises, and much the application software that Oracle sells.
  • Apps that are essentially distributed have often relied on different stacks than single-site apps. (Duh.)

4. Reasons I see for the enterprise apps area having been a bit dull in recent years include:

5. But I did do some work in the area even so. :) Besides posts linked above, other things I wrote relevant to the present discussion include:


Categories: Other

Differentiation in business intelligence

Mon, 2015-10-26 13:34

Parts of the business intelligence differentiation story resemble the one I just posted for data management. After all:

  • Both kinds of products query and aggregate data.
  • Both are offered by big “enterprise standard” behemoth companies and also by younger, nimbler specialists.
  • You really, really, really don’t want your customer data to leak via a security breach in either kind of product.

That said, insofar as BI’s competitive issues resemble those of DBMS, they are those of DBMS-lite. For example:

  • BI is less mission-critical than some other database uses.
  • BI has done a lot less than DBMS to deal with multi-structured data.
  • Scalability demands on BI are less than those on DBMS — indeed, they’re the ones that are left over after the DBMS has done its data crunching first.

And full-stack analytic systems — perhaps delivered via SaaS (Software as a Service) — can moot the BI/data management distinction anyway.

Of course, there are major differences between how DBMS and BI are differentiated. The biggest are in user experience. I’d say:

  • For many people, BI is the user experience over the underlying data store(s).
  • Two crucial aspects of user experience are navigational power and speed of response.
    • At one extreme, people hated the old green paper reports.
    • At the other, BI in the QlikView/Tableau era is one of the few kinds of enterprise software that competes on the basis of being
    • This is also somewhat true with respect to snazzy BI demos, such as interactive maps or way-before-their-day touch screens.*
  • Features like collaboration and mobile UIs also matter.
  • Since BI is commonly adopted via quick departmental projects — at least as the hoped-for first-step of a “land-and-expand” campaign — administrative usability is at a premium as well.

* Computer Pictures and thus Cullinet used a touch screen over 30 years ago. Great demo, but not so useful as an actual product, due to the limitations on data structure.

Where things get tricky is in my category of accuracy. In the early 2000s, I pitched and wrote a white paper arguing that BI helps bring “integrity” to an enterprise in various ways. But I don’t think BI vendors have done a good job of living up to that promise.

  • They’ve moved slowly in accuracy-intensive areas such as alerting or predictive modeling.
  • “Single source of truth” and similar protestations turned out to be much oversold.

Indeed, it’s tempting to say that business intelligence has been much too stupid. :) I really like some attempts to make BI sharper, e.g. at Rocana or ClearStory, but it remains to be seen whether many customer care about their business intelligence actually being smart.

So how does all this fit into my differentiation taxonomy/framework? Referring liberally to what has already been written above, we get:

  • Scope:
    • For traditional tabular analysis, BI products compete on a bunch of UI features.
    • Non-tabular analysis is much more primitive. Event series interfaces may be the closest thing to an exception.
    • Collaboration is in the mix as well.
  • Accuracy: I discussed this one above.
  • Other trustworthiness:
    • Security is a big deal.
    • Mission-critical robustness is usually, in truth, just a nice-to-have. But some (self-)important executives may disagree. :)
  • Speed:
    • For some functionality — e.g. cross-database joins — BI tools almost have to rely on their own DBMS-like engines for performance.
    • For other it’s more optional. You can do single-RDBMS query straight against the underlying system, or you can pre-position some of the data in memory.
    • Please also see the adoption and administration section below.
  • User experience: I discussed this one above.
  • Adoption and administration:
    • When BI is “owned” by a department, especially one that also doesn’t manage the underlying data, set-up and administration need to be super-easy.
    • Sometimes, departmental BI is used as an excuse to pressure central IT into making data available.
    • Much like analytic DBMS, BI adoption can sometimes be tied to huge first-time-data-warehouse building projects.
    • Administration of big enterprise-standard BI is, to re-use a term, much like DBMS-lite.
  • Cost: The true cost of BI usage is commonly governed more by the underlying data management (and data acquisition) than by the BI software (and supporting servers) itself. That said:
    • BI “hard” costs — licenses, servers, cloud fees, whatever — commonly have to fit into departmental budgets.
    • So do BI people costs.
    • BI people requirements also often have to fit into departmental skillets.
Categories: Other

Differentiation in data management

Mon, 2015-10-26 13:32

In the previous post I broke product differentiation into 6-8 overlapping categories, which may be abbreviated as:

  • Scope
  • Accuracy
  • (Other) trustworthiness
  • Speed
  • User experience
  • Cost

and sometimes also issues in adoption and administration.

Now let’s use this framework to examine two market categories I cover — data management and, in separate post, business intelligence.

Applying this taxonomy to data management:

  • Scope: Different subcategories of data management technology are suitable for different kinds of data, different scale of data, etc. To a lesser extent that may be true within a subcategory as well.
  • Scope: Further, products may differ in what you can do with the data, especially analytically.
  • Accuracy: Don’t … lose … data.
  • Other trustworthiness:
    • Uptime, availability and so on are big deals in many data management sectors.
    • Security is hugely important for data that both belongs to other people — usually your customers — and is accessible via the internet. It’s important in numerous other database use cases as well.
    • Awkwardly, the CAP Theorem teaches us that there can be a bit of a trade-off between availability and (temporary) accuracy.
  • Speed:
    • Different kinds of data management products perform differently in different use cases.
    • If your use case is down the middle of what a mature data management subsector focuses on, performance may not vary much among individual leading products.
    • Even then, tuning effort may be quite different for different products.
  • User experience:
    • Users rarely interact directly with database management products.
    • There can be clear differentiation in database administration UIs. (The most dramatic example was perhaps the rise of Microsoft SQL Server.)
    • Data manipulation languages (DMLs) can make a huge difference in programmers’ lives.
  • Cost:
    • License and maintenance costs can be a huge issue, especially if you’re buying from traditional vendors.
    • Performance affects cost in a few ways: hardware costs for sure, tuning effort in some cases, and occasionally even vendor license/maintenance fees.
    • Ongoing operations costs can vary greatly by database product in general, and by your pre-existing in-house expertise in particular.
    • Ease of programming can sometimes lead to significant programming cost differences as well.
  • Adoption: This one is often misunderstood.
    • The effort of adopting new database technology for new applications is often overrated. When projects are huge, it’s often because of what you’re doing with the technology, not because of the technology itself.
    • Migration, however, is usually a bitch.

For reasons of length, I’m doing a separate post on differentiation in business intelligence.

Categories: Other

Sources of differentiation

Mon, 2015-10-26 13:31

Obviously, a large fraction of what I write about involves technical differentiation. So let’s try for a framework where differentiation claims can be placed in context. This post will get through the generalities. The sequels will apply them to specific cases.

Many buying and design considerations for IT fall into six interrelated areas: 

  • Scope: What does the technology even purport to do? This consideration applies to pretty much everything. :)
    • Usually, this means something like features.
    • However, there’s an important special case in which the important features are the information content. (Examples: Arguably Google, and the Bloomberg service for sure.)
  • Accuracy: How correctly does the technology do it? This can take multiple forms.
    • Sometimes, a binary right/wrong distinction pretty much suffices, with an acceptable error rate of zero. If you’re writing data, it shouldn’t get lost. If you’re doing arithmetic, it should be correct. Etc.
    • Sometimes, there’s a clear right/wrong distinction, but error rates are necessarily non-zero, often with a trade-off between the rates for false positives and false negatives. (In text search and similar areas, those rates are measured respectively as precision and recall.) Security is a classic example. Many other cases arise when trying to identify problems or
    • Sometimes accuracy is on a scale. Predictive modeling results are commonly of that kind. So are text search, voice recognition and so on.
  • Other trustworthiness.
    • Reliability, availability and security are considerations in almost any IT scenario.
    • Also crucial are any factors that are perceived as affecting the risk of project failure. Sometimes, these are lumped together as (part of) maturity.
  • Speed. There’s a great real and/or perceived “need for speed”.
    • On the user level:
      • There are many advantages to quick results, “real time” or otherwise.
      • In particular, analysis is often more accurate if you have time for more iterations or intermediate steps.
      • Please recall that speed can actually have multiple kinds of benefit. For example, it can reduce costs, it can improve accuracy, it can improve user experience, or it can enable capabilities that would otherwise be wholly impractical.
    • There can also be considerations of time to (initial) value, although people sometimes overrate how often this is a function of the technology itself.
    • Consistency of performance can be an important aspect of product maturity.
  • User experience. Ideally, using a system is easy and pleasurable, or at least not unpleasant.
    • Ease of use often equates to ease of (re)learning …
    • … but there are exceptions, generally for what might be considered “power users”.
    • Speed and performance can avoid a lot of unpleasant frustration.
    • In some cases you can compel somebody — usually an employee — to use your interface. Often, however, you can’t, and that’s when user experience may matter most.
    • An important category of user experience that doesn’t directly equate to ease or is Of course, the more accurate the recommendations are, the better.
    • Most systems have at least two categories of user experience — one for the true users, and one for the IT folks who manage it. The IT folks’ experience often depends not just on true UI features, but on how hard or difficult the underlying system is to deal with in the first place.
  • Cost, or more precisely TCO (Total Cost of Ownership). Cost is always important, and especially so if there are numerous viable alternatives.
    • Sometimes money paid to the vendor really is the largest component of TCO.
    • Often, however, hardware or IT personnel expenditures are the lion’s share of overall cost.
    • Administrators’ user experience can affect a large chunk of TCO.

Related links

Categories: Other

Cassandra and privacy requirements

Thu, 2015-10-15 09:18

For starters:

But when I made that connection and checked in accordingly with my client Patrick McFadin at DataStax, I discovered that I’d been a little confused about how multi-data-center Cassandra works. The basic idea holds water, but the details are not quite what I was envisioning.

The story starts:

  • Cassandra groups nodes into logical “data centers” (i.e. token rings).
  • As a best practice, each physical data center can contain one or more logical data center, but not vice-versa.
  • There are two levels of replication — within a single logical data center, and between logical data centers.
  • Replication within a single data center is planned in the usual way, with the principal data center holding a database likely to have a replication factor of 3.
  • However, copies of the database held elsewhere may have different replication factors …
  • … and can indeed have different replication factors for different parts of the database.

In particular, a remote replication factor for Cassandra can = 0. When that happens, then you have data sitting in one geographical location that is absent from another geographical location; i.e., you can be in compliance with laws forbidding the export of certain data. To be clear (and this contradicts what I previously believed and hence also implied in this blog):

  • General multi-data-center operation is not what gives you geo-compliance, because the default case is that the whole database is replicated to each data center.
  • Instead, you get that effect by tweaking your specific replication settings.

The most visible DataStax client using this strategy is apparently ING Bank.

If you have a geo-compliance issue, you’re probably also concerned about security. After all, the whole reason the issue arises is because one country’s government might want to look at another country’s citizens’ or businesses’ data. The DataStax security story is approximately:

  • Encryption in flight, for any Cassandra.
  • Encryption at rest, specifically with DataStax Enterprise.
  • No cell-level or row-level security until Cassandra 3.0 is introduced and established. (I didn’t actually ask whether something similar to HBase coprocessors is coming for Cassandra, but that would be my first guess.)
  • Various roles and permissions stuff.

While flexible, Cassandra’s multi-data-center features do add some complexity. Tunable-consistency choices are baked into Cassandra programs at each point data is accessed, and more data centers make for more choices. (Default best practice = write if you get a local quorum, running the slight risk of logical data centers being out of sync with each other.)

One way in which the whole thing does seem nice and simple is that you can have different logical data centers running on different kinds of platforms — cloud, colocation, in-house, whatever — without Cassandra caring.

I’m not going to call the DataStax Enterprise approach to geo-compliance the “gold standard”, because some of it seems pretty clunky or otherwise feature-light. On the other hand, I’m not aware of competitors who exceed it, in features or track record, so “silver standard” seems defensible.

Categories: Other

Basho and Riak

Thu, 2015-10-15 09:18

Basho was on my (very short) blacklist of companies with whom I refuse to speak, because they have lied about the contents of previous conversations. But Tony Falco et al. are long gone from the company. So when Basho’s new management team reached out, I took the meeting.

For starters:

  • Basho management turned over significantly 1-2 years ago. The main survivors from the old team are 1 each in engineering, sales, and services.
  • Basho moved its headquarters to Bellevue, WA. (You get one guess as to where the new CEO lives.) Engineering operations are very distributed geographically.
  • Basho claims that it is much better at timely product shipments than it used to be. Its newest product has a planned (or at least hoped-for) 8-week cadence for point releases.
  • Basho’s revenue is ~90% subscription.
  • Basho claims >200 enterprise clients, vs. 100-120 when new management came in. Unfortunately, I forgot to ask the usual questions about divisions vs. whole organizations, OEM sell-through vs. direct, etc.
  • Basho claims an average contract value of >$100K, typically over 2-3 years. $9 million of that (which would be close to half the total, actually), comes from 2 particular deals of >$4 million each.

Basho’s product line has gotten a bit confusing, but as best I understand things the story is:

  • There’s something called Riak Core, which isn’t even a revenue-generating product. However, it’s an open source project with some big users (e.g. Goldman Sachs, Visa), and included in pretty much everything else Basho promotes.
  • Riak KV is the key-value store previously known as Riak. It generates the lion’s share of Basho’s revenue.
  • Riak S2 is an emulation of Amazon S3. Basho thinks that Riak KV loses efficiency when objects get bigger than 1 MB or so, and that’s when you might want to use Riak S2 in addition or instead.
  • Riak TS is for time series, and just coming out now.
  • Also in the mix are some (extra charge) connectors for Redis and Spark. Presumably, there are more of these to come.
  • There’s an umbrella marketing term of “Basho Data Platform”.

Technical notes on some of that include: 

  • Riak Core doesn’t do data management. It just manages distributed operation of — well, whatever you want to operate. In part, Basho sees Riak Core as a better Apache ZooKeeper.
    • That is the essence of the Riak/Spark pitch — something better than ZooKeeper for cluster management, and I presume some help in persisting Spark RDDs as well.
    • The Riak/Redis pitch is even simpler — cluster management for Redis, and persistent backing as well.
    • Basho’s criticisms of ZooKeeper start with “Cluster manager, manage thyself” claims about ZooKeeper availability, as in the PagerDuty ZooKeeper critique.
  • Riak KV has secondary indexing. Performance is somewhat questionable. It also has Solr indexing, which is fast.
  • At least in its 1.0 form, Riak TS assumes:
    • There’s some kind of schema or record structure.
    • There are explicit or else easily-inferred timestamps.
    • Microsecond accuracy, perfect ordering and so on are not essential.
  • Thus, Riak TS 1.0 is not ideal for the classic Splunk use case where you text index/search on a lot of log emissions. It also is not ideal for financial tick storage.
  • Riak TS has range-based partitioning, where the range is in terms of time. Basho refers to this as “locality”.
  • Riak TS has a SQL subset. Evidently there’s decent flexibility as to which part of the database carries which schema.
  • Riak has a nice feature of allowing you stage a change to network topology before you push it live.
  • Riak’s vector clock approach to wide-area synchronization is more controversial.

Finally, notes on what Basho sees as use cases and competition include:

  • Riak KV is generally used to store usual-suspect stuff — log data, user/profile data and so on.
  • Basho thinks NoSQL is a 4-horse race — Basho/Riak KV, DataStax/Cassandra, MongoDB, Couchbase. (I would be surprised if there was much agreement with that view from, for example, MongoDB, DataStax, Aerospike, MapR or the HBase community.)
  • Basho competes on availability, scalability (including across geography) and so on, or in simplest terms:
    • “Availability and correctness”
    • Simple operation
  • Unsurprisingly, Basho thinks its closest competitor is DataStax. (However, DataStax tells me they don’t compete much with Basho.)
  • Basho thinks Riak KV has ease-of-operation advantages vs. Cassandra.
  • An example of a mission-critical Riak app is the UK National Health Service storing prescription information.
  • An example of Riak S2 and Riak KV being used together is Turner Broadcasting storing video in the former and associated metadata in the latter.
  • Riak TS is initially pointed at two use cases:
    • “Internet of Things”
    • “Metrics”, which seems to mean monitoring of system metrics.
  • Basho sees the competition for Riak TS as starting with HBase, Cassandra, and InfluxDB.
Categories: Other

Couchbase 4.0 and related subjects

Thu, 2015-10-15 09:17

I last wrote about Couchbase in November, 2012, around the time of Couchbase 2.0. One of the many new features I mentioned then was secondary indexing. Ravi Mayuram just checked in to tell me about Couchbase 4.0. One of the important new features he mentioned was what I think he said was Couchbase’s “first version” of secondary indexing. Obviously, I’m confused.

Now that you’re duly warned, let me remind you of aspects of Couchbase timeline.

  • 2 corporate name changes ago, Couchbase was organized to commercialize memcached. memcached, of course, was internet companies’ default way to scale out short-request processing before the rise of NoSQL, typically backed by manually sharded MySQL.
  • Couchbase’s original value proposition, under the name Membase, was to provide persistence and of course support for memcached. This later grew into a caching-oriented pitch even to customers who weren’t already memcached users.
  • A merger with the makers of CouchDB ensued, with the intention of replacing Membase’s SQLite back end with CouchDB at the same time as JSON support was introduced. This went badly.
  • By now, however, Couchbase sells for more than distributed cache use cases. Ravi rattled off a variety of big-name customer examples for system-of-record kinds of use cases, especially in session logging (duh) and also in travel reservations.
  • Couchbase 4.0 has been in beta for a few months.

Technical notes on Couchbase 4.0 — and related riffs :) — start:

  • There’s a new SQL-like language called N1QL (pronounced like “nickel”). I’m hearing a lot about SQL-on-NoSQL these days. More on that below.
  • “Index”, “data” and “query” are three different services/tiers.
    • You can run them all on the same nodes or separately. Couchbase doesn’t have enough experience yet with the technology to know which choice will wind up as a best practice.
    • I’m hearing a lot about heterogeneous-node/multi-tier DBMS architectures these days, and would no longer stand by my 2009 statement that they are unusual. Other examples include Oracle Exadata, MySQL, MongoDB (now that it has pluggable storage engines), MarkLogic, and of course the whole worlds of Hadoop and Spark.
  • To be clear — the secondary indexes are global, and not tied to the same nodes as the data they index.
  • There’s a new back end called ForestDB, but if I understood correctly, it’s used just for the indexes, not for the underlying data.
  • ForestDB represents Couchbase indexes in something that resembles b-trees, but also relies on tries. Indeed, if I’m reading the relevant poster correctly, it’s based on a trie of b-trees.
  • In another increasingly common trend, Couchbase uses Bloom filters to help decide which partitions to retrieve for any particular query.

Up to a point, SQL-on-NoSQL stories can be fairly straightforward.

  • You define some kind of a table,* perhaps in a SQL-like DDL (Data Description Language).
  • SELECT, FROM and WHERE clauses work in the usual way.
  • Hopefully, if a column is going to have a lot of WHERE clauses on it, it also has an index.

For example, I think that’s the idea behind most ODBC/JDBC drivers for NoSQL systems. I think it’s also the idea behind most “SQL-like” languages that NoSQL vendors ship.

*Nobody I talk to about this ever wants to call it a “view”, but it sure sounds like a view to me — not a materialized view, of course, but a view nonetheless.

JOIN syntax can actually be straightforward as well under these assumptions. As for JOIN execution, Couchbase pulls all the data into the relevant tier, and nested loop execution there. My new clients at SequoiaDB have a similar strategy, by the way, although in their case there’s a hash join option as well.

But if things stopped there, they would miss an important complication: NoSQL has nested data. I.e., a value can actually be an array, whose entries are arrays themselves, and so on. That said, the “turtles all the way down” joke doesn’t quite apply, because at some point there are actual scalar or string values, and those are the ones SQL wants to actually operate on.

Most approaches I know of to that problem boil down to identifying particular fields as table columns, with or without aliases/renaming; I think that’s the old Hadapt/Vertica strategy, for example. Couchbase claims to be doing something a little different however, with a SQL-extending operator called UNNEST. Truth be told, I’m finding the N1QL language reference a bit terse, and haven’t figured out what the practical differences vs. the usual approach are, if any. But it sounds like there may be some interesting ideas in there somewhere.

Categories: Other

Notes on privacy and surveillance, October 11, 2015

Sun, 2015-10-11 04:44

1. European Union data sovereignty laws have long had a “Safe Harbour” rule stating it was OK to ship data to the US. Per the case Maximilian Schrems v Data Protection Commissioner, this rule is now held to be invalid. Angst has ensued, and rightly so.

The core technical issues are roughly:

  • Data is usually in one logical database. Data may be replicated locally, for availability and performance. It may be replicated remotely, for availability, disaster recovery, and performance. But it’s still usually logically in one database.
  • Now remote geographic partitioning may be required by law. Some technologies (e.g. Cassandra) support that for a single logical database. Some don’t.
  • Even under best circumstances, hosting and administrative costs are likely to be higher when a database is split across more geographies (especially when the count is increased from 1 to 2).

Facebook’s estimate of billions of dollars in added costs is not easy to refute.

My next set of technical thoughts starts:

  • This is about data storage, not data use; for example, you can analyze Austrian data in the US, but you can’t store it there.
  • Of course, that can be a tricky distinction to draw. We can only hope that intermediate data stores, caches and so on can be allowed to use data from other geographies.
  • Assuming the law is generous in this regard, scan-heavy analytics are more problematic than other kinds.
  • But if there are any problems in those respects — well, if analytics can be parallelized in general, then in particular one should be able to parallelize across geographies. (Of course, this could require replicating one’s whole analytic stack across geographies.)

2. US law enforcement is at loggerheads with major US tech companies, because it wants the right to subpoena data stored overseas. The central case here is a request to get at Microsoft’s customer data stored in Ireland. A government victory would be catastrophic for the US tech industry, but I’m hopeful that sense will — at least to some extent — prevail.

3. Ed Snowden, Glenn Greenwald and numerous other luminaries are pushing something called the Snowden Treaty, as a model for how privacy laws should be set up. I’m a huge fan of what Snowden and Greenwald have done in general, but this particular project has not started well. First, they’ve rolled the thing out while actually giving almost no details, so they haven’t really contributing anything except a bit of PR. Second, one of the few details they did provide contains a horrific error.

Specifically, they “demand”

freedom from damaging publicity, public scrutiny …

To that I can only say: “Have you guys lost your minds???????” As written, that’s a demand that can only be met by censorship laws. I’m sure this error is unintentional, because Greenwald is in fact a stunningly impassioned and articulate opponent of censorship. Even so, that’s an appallingly careless mistake, which for me casts the whole publicity campaign into serious doubt.

4. As a general rule — although the details of course depend upon where you live — it is no longer possible to move around and be confident that you won’t be tracked. This is true even if you’re not a specific target of surveillance. Ways of tracking your movements include but are not limited to:

  • Electronic records of you paying public transit fares or tolls, as relevant. (Ditto rental car fees, train or airplane tickets, etc.)
  • License plate cameras, which in the US already have billions of records on file.
  • Anything that may be inferred from your mobile phone.

5. The previous point illustrates that the strong form of the Snowden Treaty is a pipe dream — it calls for a prohibition on mass surveillance, and that will never happen, because:

  • Governments will insist on trying to prevent “terrorism” before the fact. That mass surveillance is generally lousy at doing so won’t keep them from trying.
  • Governments will insist on being able to do general criminal forensics after the fact. So they’ll want mass surveillance data sitting around just in case they find that they need it.
  • Businesses share consumers’ transaction and interaction data, and such sharing is central to the current structure of the internet industry. That genie isn’t going back into the bottle. Besides, if it did, a few large internet companies would have even more of an oligopolistic advantage vs. the others than they now do.

The huge problem with these truisms, of course, is scope creep. Once the data exists, it can be used for many more purposes than the few we’d all agree are actually OK.

6. That, in turn, leads me back to two privacy posts that I like to keep reminding people of, because they make points that aren’t commonly found elsewhere:

Whether or not you basically agree with me about privacy and surveillance, those two posts may help flesh out whatever your views on the subject actually are.

Categories: Other

Notes on packaged applications (including SaaS)

Wed, 2015-10-07 18:27

1. The rise of SAP (and later Siebel Systems) was greatly helped by Anderson Consulting, even before it was split off from the accounting firm and renamed as Accenture. My main contact in that group was Rob Kelley, but it’s possible that Brian Sommer was even more central to the industry-watching part of the operation. Brian is still around, and he just leveled a blast at the ERP* industry, which I encourage you to read. I agree with most of it.

*Enterprise Resource Planning

Brian’s argument, as I interpret it, boils down mainly to two points:

  • Big ERP companies selling big ERP systems are pathetically slow at adding new functionality. He’s right. My favorite example is the multi-decade slog to integrate useful analytics into operational apps.
  • The world of “Big Data” is fundamentally antithetical to the design of current-generation ERP systems. I think he’s right in that as well.

I’d add that SaaS (Software As A Service)/on-premises tensions aren’t helping incumbent vendors either.

But no article addresses all the subjects it ideally should, and I’d like to call out two omissions. First, what Brian said is in many cases applicable just to large and/or internet-first companies. Plenty of smaller, more traditional businesses could get by just fine with no more functionality than is in “Big ERP” today, if we stipulate that it should be:

  • Delivered via SaaS.
  • Much easier to adopt and use.

Second, even within the huge enterprise/huge app vendor world, it’s not entirely clear how integrated ERP supposedly is or isn’t with CRM (Customer Relationship Management). And a lot of what Brian talks about fits pretty cleanly into the CRM bucket.

2. In any case, there are many application areas that — again assuming that we’re in the large enterprise or large internet company world — fit well neither with classical ERP nor with its CRM sibling. For starters, investigative analytics doesn’t fit well into packaged application suites, for a myriad of reasons, the most basic of which are:

  • The whole point of investigative analytics is to discover things that are new. Therefore, business processes are inherently unpredictable.
  • So are data inputs.

If somebody does claim to be selling an app in investigative analytics, it is usually really an analytic application subsystem or else something very disconnected from other apps. Indeed, in almost all cases it’s both.

3. When it comes to customer-facing websites, I stand by my arguments three years ago in the post just linked above, which boil down to:

  • What I just said above about investigative analytics, plus the observation that …
  • … websites have a strong creative aspect that fits badly with soup-to-nuts packaged applications.

Also, complex websites are likely to rely on dynamic schemas, and packaged apps have trouble adapting to those.

4. This is actually an example of a more general point — packaged or SaaS apps generally assume rather fixed schemas. (The weasel word “rather” is included to allow for customization-through-configuration, but I think the overall point holds.) Indeed, database design is commonly the essence of packaged app technology.

5. However, those schemas do not have to be relational. It would be inaccurate to say that packaged apps always assume tabular data, because of examples such as:

  • SAP has built on top of quasi-objects for a long time, although the underpinnings are technically relational.
  • There are some cases of building entirely on an object-oriented or hierarchical data model, especially in health care.
  • Business has some inherent hierarchies that get reflected in data structures, e.g. in bills of materials or organization charts.

But even non-tabular data structures are, in the minds of app developers, usually assumed to have fixed schemas.

Related links

Categories: Other

Consumer data management

Mon, 2015-10-05 00:27

Don’t plan to fish in your personal data lake.

Perhaps the biggest mess in all of IT is the management of individual consumers’ data. Our electronic data is thoroughly scattered. Most individual portions are poorly managed. There’s no integration. The data that’s on paper is even worse. For example:

  • Do you have access to your medical records? Do you even know when you were last vaccinated for what?
  • Several enterprises have comprehensive records of all your credit card purchases, in easy-to-analyze form. Do you have such records too?
  • How easily can you find old emails? How about old paper correspondence?

For the most part, the technology community is barely trying to solve those problems. But even when it does try, success is mixed at best. For example:

And those are some of the most successful names.

There are numerous reasons for this dismal state of affairs. 

  • The problem is generically hard. There are many types of data, in both the narrow and broad senses of “type”. There are many use cases. There are many possible devices that would need to be supported. There even are a bunch of different regulatory implications.
  • Consumers aren’t going to organize data themselves. A solution that actually worked would need really great usability and automation.
  • Companies see their data about customers as an asset. They don’t want to share — even with the customers themselves.

The toughest problem, I think, is in my middle bullet point — people hate organizing their own information. That’s true, by the way, of consumers and individual employees alike. Canonical examples on the enterprise side include knowledge management, taxonomy building,* or getting salespeople to properly fill in sales force automation software forms. On the consumer side, personal computers were pitched in their very early days as a way to store recipes; how did that ever work out? Thus, the standard for usability for people to actually like personal data management technology is very high, and very difficult to meet.

*Well, canonical at least among text search geeks. :)

Despite all this negativity, I think there are two areas in which it is inevitable that consumers will wind up with access to well-organized online data stores — health and money. The first reason is simply perceived value. Health and money are both important, and people know it, and so those have always been the two areas in which consumers have willingly paid quite a bit for information and advice.

I happen to have picked up that truism in the 1990s, when I published a subscription newsletter, and the only categories in which consumer newsletters sold well were health and money. But if you don’t believe me, you could note:

  • Consumers pay a lot for money management/wealth management.
  • Consumers pay a lot to physicians who — surgeons and so on aside — strictly speaking aren’t providing anything except information and advice.

My more precise reasons for believing consumer financial data management will eventually be straightened out start:

  • Nothing is easier to organize than financial records.
  • Both your tax authority and credit card data resellers try to pull all your financial information together.

As for health care:

  • Integrated health records make lots of sense for your health care provider.
  • Ultimately, they won’t be able to deny you access to them.
  • Besides, they don’t really have much reason to deny such access.

But that’s most of the good news. Oh, I do think Apple will one of these decades come up with a decent way to manage what’s on your Apple devices. A few other point solutions will be similarly competent. But personal data lakes, or anything like that? I don’t think those are going to happen in any kind of foreseeable time frame.

Categories: Other

The potential significance of Cloudera Kudu

Mon, 2015-09-28 01:54

This is part of a three-post series on Kudu, a new data storage system from Cloudera.

Combined with Impala, Kudu is (among other things) an attempt to build a no-apologies analytic DBMS (DataBase Management System) into Hadoop. My reactions to that start:

  • It’s plausible; just not soon. What I mean by that is:
    • Success will, at best, be years away. Please keep that in mind as you read this otherwise optimistic post.
    • Nothing jumps out at me to say “This will never work!”
    • Unlike when it introduced Impala — or when I used to argue with Jeff Hammerbacher pre-Impala :) — this time Cloudera seems to have reasonable expectations as to how hard the project is.
  • There’s huge opportunity if it works.
    • The analytic RDBMS vendors are beatable. Teradata has a great track record of keeping its product state-of-the-art, but it likes high prices. Most other strong analytic RDBMS products were sold to (or originated by) behemoth companies that seem confused about how to proceed.
    • RDBMS-first analytic platforms didn’t do as well as I hoped. That leaves a big gap for Hadoop.

I’ll expand on that last point. Analytics is no longer just about fast queries on raw or simply-aggregated data. Data transformation is getting ever more complex — that’s true in general, and it’s specifically true in the case of transformations that need to happen in human real time. Predictive models now often get rescored on every click. Sometimes, they even get retrained at short intervals. And while data reduction in the sense of “event extraction from high-volume streams” isn’t that a big deal yet in commercial apps featuring machine-generated data — if growth trends continue as much of us expect, it’s only a matter of time before that changes.

Of course, this is all a bullish argument for Spark (or Flink, if I’m wrong to dismiss its chances as a Spark competitor). But it also all requires strong low-latency analytic data underpinnings, and I suspect that several kinds of data subsystem will prosper. I expect Kudu-supported Hadoop/Spark to be a strong contender for that role, along with the best of the old-school analytic RDBMS, Tachyon-supported Spark, one or more contenders from the Hana/MemSQL crowd (i.e., memory-centric RDBMS that purport to be good at analytics and transactions alike), and of course also whatever Cloudera’s strongest competitor(s) choose to back.

Categories: Other

Cloudera Kudu deep dive

Mon, 2015-09-28 01:52

This is part of a three-post series on Kudu, a new data storage system from Cloudera.

Let’s talk in more detail about how Kudu stores data.

  • As previously noted, inserts land in an in-memory row store, which is periodically flushed to the column store on disk. Queries are federated between these two stores. Vertica taught us to call these the WOS (Write-Optimized Store) and ROS (Read-Optimized Store) respectively, and I’ll use that terminology here.
  • Part of the ROS is actually another in-memory store, aka the DeltaMemStore, where updates and deletes land before being applied to the DiskRowSets. These stores are managed separately for each DiskRowSet. DeltaMemStores are checked at query time to confirm whether what’s in the persistent store is actually up to date.
  • A major design goal for Kudu is that compaction should never block – nor greatly slow — other work. In support of that:
    • Compaction is done, server-by-server, via a low-priority but otherwise always-on background process.
    • There is a configurable maximum to how big a compaction process can be — more precisely, the limit is to how much data the process can work on at once. The current default figure = 128 MB, which is 4X the size of a DiskRowSet.
    • When done, Kudu runs a little optimization to figure out which 128 MB to compact next.
  • Every tablet has its own write-ahead log.
    • This creates a practical limitation on the number of tablets …
    • … because each tablet is causing its own stream of writes to “disk” …
    • … but it’s only a limitation if your “disk” really is all spinning disk …
    • … because multiple simultaneous streams work great with solid-state memory.
  • Log retention is configurable, typically the greater of 5 minutes or 128 MB.
  • Metadata is cached in RAM. Therefore:
    • ALTER TABLE kinds of operations that can be done by metadata changes only — i.e. adding/dropping/renaming columns — can be instantaneous.
    • To keep from being screwed up by this, the WOS maintains a column that labels rows by which schema version they were created under. I immediately called this MSCC — Multi-Schema Concurrency Control :) — and Todd Lipcon agreed.
  • Durability, as usual, boils down to “Wait until a quorum has done the writes”, with a configurable option as to what constitutes a “write”.
    • Servers write to their respective write-ahead logs, then acknowledge having done so.
    • If it isn’t too much of a potential bottleneck — e.g. if persistence is on flash — the acknowledgements may wait until the log has been fsynced to persistent storage.
  • There’s a “thick” client library which, among other things, knows enough about the partitioning scheme to go straight to the correct node(s) on a cluster.

Leaving aside the ever-popular possibilities of:

  • Cluster-wide (or larger) equipment outages
  • Bugs

the main failure scenario for Kudu is:

  • The leader version of a tablet (within its replica) set goes down.
  • A new leader is elected.
  • The workload is such that the client didn’t notice and adapt to the error on its own.

Todd says that Kudu’s MTTR (Mean Time To Recovery) for write availability tests internally at 1-2 seconds in such cases, and shouldn’t really depend upon cluster size.

Beyond that, I had some difficulties understanding details of the Kudu write path(s). An email exchange ensued, and Todd kindly permitted me to post some of his own words (edited by me for clarification and format).

Every tablet has its own in-memory store for inserts (MemRowSet). From a read/write path perspective, every tablet is an entirely independent entity, with its own MemRowSet, rowsets, etc. Basically the flow is:

  • The client wants to make a write (i.e. an insert/update/delete), which has a primary key.
    • The client applies the partitioning algorithm to determine which tablet that key belongs in.
    • The information about which tablets cover which key ranges (or hash buckets) is held in the master. (But since it is cached by the clients, this is usually a local operation.)
    • It sends the operation to the “leader” replica of the correct tablet (batched along with any other writes that are targeted to the same tablet).
  • Once the write reaches the tablet leader:
    • The leader enqueues the write to its own WAL (Write-Ahead Log) and also enqueues it to be sent to the “follower” replicas.
    • Once it has reached a majority of the WALs (i.e. 2/3 when the replication factor = 3), the write is considered “replicated”. That is to say, it’s durable and would always be rolled forward, even if the leader crashed at this point.
    • Only now do we enter the “storage” part of the system, where we start worrying about MemRowSets vs DeltaMemStores, etc.

Put another way, there is a fairly clean architectural separation into three main subsystems:

  • Metadata and partitioning (map from a primary key to a tablet, figure out which servers host that tablet).
  • Consensus replication (given a write operation, ensure that it is durably logged and replicated to a majority of nodes, so that even if we crash, everyone will agree whether it should be applied or not).
  • Tablet storage (now that we’ve decided a write is agreed upon across replicas, actually apply it to the database storage).

These three areas of the code are separated as much as possible — for example, once we’re in the “tablet storage” code, it has no idea that there might be other tablets. Similarly, the replication and partitioning code don’t know much anything about MemRowSets, etc – that’s entirely within the tablet layer.

As for reading — the challenge isn’t in the actual retrieval of the data so much as in figuring out where to retrieve it from. What I mean by that is:

  • Data will always be either in memory or in a persistent column store. So I/O speed will rarely be a problem.
  • Rather, the challenge to Kudu’s data retrieval architecture is finding the relevant record(s) in the first place, which is slightly more complicated than in some other systems. For upon being told the requested primary key, Kudu still has to:
    • Find the correct tablet(s).
    • Find the record(s) on the (rather large) tablet(s).
    • Check various in-memory stores as well.

The “check in multiple places” problem doesn’t seem to be of much concern, because:

  • All that needs to be checked is the primary key column.
  • The on-disk data is front-ended by Bloom filters.
  • The cases in which a Bloom filter returns a false positive are generally the same busy ones where the key column is likely to be cached in RAM.
  • Cloudera just assumes that checking a few different stores in RAM isn’t going to be a major performance issue.

When it comes to searching the tablets themselves:

  • Kudu tablets feature data skipping among DiskRowSets, based on value ranges for the primary key.
  • The whole point of compaction is to make the data skipping effective.

Finally, Kudu pays a write-time (or compaction-time) cost to boost retrieval speeds from inside a particular DiskRowSet, by creating something that Todd called an “ordinal index” but agreed with me would be better called something like “ordinal offset” or “offset index”. Whatever it’s called, it’s an index that tells you the number of rows you would need to scan before getting the one you want, thus allowing you to retrieve (except for the cost of an index probe) at array speeds.

Categories: Other

Introduction to Cloudera Kudu

Mon, 2015-09-28 01:50

This is part of a three-post series on Kudu, a new data storage system from Cloudera.

Cloudera is introducing a new open source project, Kudu,* which from Cloudera’s standpoint is meant to eventually become the single best underpinning for analytics on the Hadoop stack. I’ve spent multiple hours discussing Kudu with Cloudera, mainly with Todd Lipcon. Any errors are of course entirely mine.

*Like the impala, the kudu is a kind of antelope. I knew that, because I enjoy word games. What I didn’t know — and which is germane to the naming choice — is that the kudu has stripes. :)

For starters:

  • Kudu is an alternative to HDFS (Hadoop Distributed File System), or to HBase.
  • Kudu is meant to be the underpinning for Impala, Spark and other analytic frameworks or engines.
  • Kudu is not meant for OLTP (OnLine Transaction Processing), at least in any foreseeable release. For example:
    • Kudu doesn’t support multi-row transactions.
    • There are no active efforts to front-end Kudu with an engine that is fast at single-row queries.
    • Kudu is rather columnar, except for transitory in-memory stores.
  • Kudu’s core design points are that it should:
    • Accept data very quickly.
    • Immediately make that data available for analytics.
  • More specifically, Kudu is meant to accept, along with slower forms of input:
    • Lots of fast random writes, e.g. of web interactions.
    • Streams, viewed as a succession of inserts.
    • Updates and inserts alike.
  • The core “real-time” use cases for which Kudu is designed are, unsurprisingly:
    • Low-latency business intelligence.
    • Predictive model scoring.
  • Kudu is designed to work fine with spinning disk, and indeed has been tested to date mainly on disk-only nodes. Even so, Kudu’s architecture is optimized for the assumption that there will be at least some flash on the node.
  • Kudu is designed primarily to support relational/SQL processing. However, Kudu also has a nested-data roadmap, which of course starts with supporting the analogous capabilities in Impala.

Also, it might help clarify Kudu’s status and positioning if I add:

  • Kudu is in its early days — heading out to open source and beta now, with maturity still quite a way off. Many obviously important features haven’t been added yet.
  • Kudu is expected to be run with a replication factor (tunable, usually =3). Replication is via the Raft protocol.
  • Kudu and HDFS can run on the same nodes. If they do, they are almost entirely separate from each other, with the main exception being some primitive workload management to help them share resources.
  • Permanent advantages of older alternatives over Kudu are expected to include:
    • Legacy. Older, tuned systems may work better over some HDFS formats than over Kudu.
    • Pure batch updates. Preparing data for immediate access has overhead.
    • Ultra-high update volumes. Kudu doesn’t have a roadmap to completely catch up in write speeds with NoSQL or in-memory SQL DBMS.

Kudu’s data organization story starts:

  • Storage is right on the server (this is of course also the usual case for HDFS).
  • On any one server, Kudu data is broken up into a number of “tablets”, typically 10-100 tablets per node.
  • Inserts arrive into something called a MemRowSet and are soon flushed to something called a DiskRowSet. Much as in Vertica:
    • MemRowSets are managed by an in-memory row store.
    • DiskRowSets are managed by a persistent column store.*
    • In essence, queries are internally federated between the in-memory and persistent stores.
  • Each DiskRowSet contains a separate file for each column in the table.
  • DiskRowSets are tunable in size. 32 MB currently seems like the optimal figure.
  • Page size default is 256K, but can be dropped as low as 4K.
  • DiskRowSets feature columnar compression, with a variety of standard techniques.
    • All compression choices are specific to a particular DiskRowSet.
    • So, in the case of dictionary/token compression, is the dictionary.
    • Thus, data is decompressed before being operated on by a query processor.
    • Also, selected columns or an entire DiskRowSet can be block-compressed.
  • Tables and DiskRowSets do not expose any kind of RowID. Rather, tables have primary keys in the usual RDBMS way.
  • Kudu can partition data in the three usual ways: randomly, by range or by hash.
  • Kudu does not (yet) have a slick and well-tested way to broadcast-replicated a small table across all nodes.

*I presume there are a few ways in which Kudu’s efficiency or overhead seem more row-store-like than columnar. Still, Kudu seems to meet the basic requirements to be called a columnar system.

Categories: Other

Rocana’s world

Thu, 2015-09-17 05:49

For starters:

  • My client Rocana is the renamed ScalingData, where Rocana is meant to signify ROot Cause ANAlysis.
  • Rocana was founded by Omer Trajman, who I’ve referenced numerous times in the past, and who I gather is a former boss of …
  • … cofounder Eric Sammer.
  • Rocana recently told me it had 35 people.
  • Rocana has a very small number of quite large customers.

Rocana portrays itself as offering next-generation IT operations monitoring software. As you might expect, this has two main use cases:

  • Actual operations — figuring out exactly what isn’t working, ASAP.
  • Security.

Rocana’s differentiation claims boil down to fast and accurate anomaly detection on large amounts of log data, including but not limited to:

  • The sort of network data you’d generally think of — “everything” except packet-inspection stuff.
  • Firewall output.
  • Database server logs.
  • Point-of-sale data (at a retailer).
  • “Application data”, whatever that means. (Edit: See Tom Yates’ clarifying comment below.)

In line with segment leader Splunk’s pricing, data volumes in this area tend to be described in terms of new data/day. Rocana seems to start around 3 TB/day, which not coincidentally is a range that would generally be thought of as:

  • Challenging for Splunk, and for the budgets of Splunk customers.
  • Not a big problem for well-implemented Hadoop.

And so part of Rocana’s pitch, familiar to followers of analytic RDBMS and Hadoop alike, is “We keep and use all your data, unlike the legacy guys who make you throw some of it away up front.”

Since Rocana wants you to keep all your data, 3 TB/day is about 1 PB/year.

But really, that’s just saying that Rocana is an analytic stack built on Hadoop, using Hadoop for what people correctly think it’s well-suited for, done by guys who know a lot about Hadoop.

The cooler side of Rocana, to my tastes, is the actual analytics. Truth be told, I find almost any well thought out event-series analytics story cool. It’s an area much less mature than relational business intelligence, and accordingly with much more scope for innovation. On the visualization side, crucial aspects start:

  • Charting over time (duh).
  • Comparing widely disparate time intervals (e.g., current vs. historical/baseline).
  • Whichever good features from relational BI apply to your use case as well.

Other important elements may be more data- or application-specific — and the fact that I don’t have a long list of particulars illustrates just how immature the area really is.

Even cooler is Rocana’s integration of predictive modeling and BI, about which I previously remarked:

The idea goes something like this:

  • Suppose we have lots of logs about lots of things. Machine learning can help:
    • Notice what’s an anomaly.
    • Group together things that seem to be experiencing similar anomalies.
  • That can inform a BI-plus interface for a human to figure out what is happening.

Makes sense to me.

So far as I can tell, predictive modeling is used to notice aberrant data (raw or derived). This is quickly used to define a subset of data to drill down to (e.g., certain kinds of information from certain machines in a certain period of time). Event-series BI/visualization then lets you see the flows that led to the aberrant result, which was any luck will allow you to find the exact place where the data first goes wrong. And that, one hopes, is something that the ops guys can quickly fix.

I think similar approaches could make sense in numerous application segments.

Related links

Categories: Other

DataStax and Cassandra update

Mon, 2015-09-14 00:02

MongoDB isn’t the only company I reached out to recently for an update. Another is DataStax. I chatted mainly with Patrick McFadin, somebody with whom I’ve had strong consulting relationships at a user and vendor both. But Rachel Pedreschi contributed the marvelous phrase “twinkling dashboard”.

It seems fair to say that in most cases:

  • Cassandra is adopted for operational applications, specifically ones with requirements for extreme uptime and/or extreme write speed. (Of course, it should also be the case that NoSQL data structures are a good fit.)
  • Spark, including SparkSQL, and Solr are seen primarily as ways to navigate or analyze the resulting data.

Those generalities, in my opinion, make good technical sense. Even so, there are some edge cases or counterexamples, such as:

  • DataStax trumpets British Gas‘ plans collecting a lot of sensor data and immediately offering it up for analysis.*
  • Safeway uses Cassandra for a mobile part of its loyalty program, scoring customers and pushing coupons at them.
  • A large title insurance company uses Cassandra-plus-Solr to manage a whole lot of documents.

*And so a gas company is doing lightweight analysis on boiler temperatures, which it regards as hot data. :)

While most of the specifics are different, I’d say similar things about MongoDB, Cassandra, or any other NoSQL DBMS that comes to mind:

  • You can get any kind of data into them very fast; indeed, that’s a central part of what they were designed for.
  • In the general case, getting it back out for low-latency analytics is problematic …
  • … but there’s an increasing list of exceptions.

For DataStax Enterprise, exceptions start:

  • Formally, you can do almost anything in at least one of Solr or Spark/SparkSQL. So if volumes are low enough, you’re fine. In particular, Spark offers the potential to do many things at in-memory speeds.
  • Between Spark, the new functions, and general scripting, there are several ways to do low-latency aggregations. This can lead to “twinkling dashboards”.*
  • DataStax is alert to the need to stream data into Cassandra.
    • That’s central to the NoSQL expectation of ingesting internet data very quickly.
    • Kafka, Storm and Spark Streaming all seem to be in the mix.
  • Solr over Cassandra has a searchable RAM buffer, which can give the effect of real-time text indexing within a second or so of ingest.

*As much as I love the “twinkling dashboard” term — it reminds me of my stock analyst days — it does raise some concerns. In many use cases, human real-time BI should be closely integrated with the more historical kind.

DataStax Enterprise:

  • Is based on Cassandra 2.1.
  • Will probably never include Cassandra 2.2, waiting instead for …
  • ….Cassandra 3.0, which will feature a storage engine rewrite …
  • … and will surely include Cassandra 2.2 features of note.

This connects to what I said previously in that Cassandra 2.2 adds some analytic features, specifically in the area of user-defined functions. Notes on Cassandra 2.2 UDFs include:

  • These are functions — not libraries, a programming language, or anything like that.
  • The “user-defined” moniker notwithstanding, the capability has been used to implement COUNT, SUM, AVG, MAX and so on.
  • You are meant to run user-defined functions on data in a single Cassandra partition; run them across partitions at your own performance risk.

And finally, some general tidbits:

  • A while ago, Apple said it had >75,000 Cassandra nodes. The figure is surely bigger now.
  • There are at least several other petabyte range Cassandra installations, and several more half-petabyte ones.
  • Netflix is not one of those. Instead, it has many 10s of smaller Cassandra clusters.
  • There are Cassandra users with >1 million reads+writes per second.

Finally a couple of random notes:

  • One of the text search use cases for Solr/Cassandra is to — in one query — get at information that originated in multiple places, e.g. for reasons of time period or geography. (I hear this about text search across lots of database technologies, relational and non-relational alike.)
  • As big a change as Cassandra 3.0 will be, it will not require that you take down your applications for an upgrade. That hasn’t been necessary since Cassandra 0.7.
Categories: Other

MongoDB update

Thu, 2015-09-10 04:33

One pleasure in talking with my clients at MongoDB is that few things are NDA. So let’s start with some numbers:

  • >2,000 named customers, the vast majority of which are unique organizations who do business with MongoDB directly.
  • ~75,000 users of MongoDB Cloud Manager.
  • Estimated ~1/4 million production users of MongoDB total.

Also >530 staff, and I think that number is a little out of date.

MongoDB lacks many capabilities RDBMS users take for granted. MongoDB 3.2, which I gather is slated for early November, narrows that gap, but only by a little. Features include:

  • Some JOIN capabilities.
    • Specifically, these are left outer joins, so they’re for lookup but not for filtering.
    • JOINs are not restricted to specific shards of data …
    • … but do benefit from data co-location when it occurs.
  • A BI connector. Think of this as a MongoDB-to- SQL translator. Using this does require somebody to go in and map JSON schemas and relational tables to each other. Once that’s done, the flow is:
    • Basic SQL comes in.
    • Filters and GroupBys are pushed down to MongoDB. A result set … well, it results. :)
    • The result set is formatted into a table and returned to the system — for example a business intelligence tool — that sent the SQL.
  • Database-side document validation, in the form of field-specific rules that combine into a single expression against which to check a document.
    • This is fairly simple stuff — no dependencies among fields in the same document, let alone foreign key relationships.
    • MongoDB argues, persuasively, that this simplicity makes it unlikely to recreate the spaghetti code maintenance nightmare that was 1990s stored procedures.
    • MongoDB concedes that, for performance, it will ordinarily be a good idea to still do your validation on the client side.
    • MongoDB points out that enforcement can be either strict (throw errors) or relaxed (just note invalid documents to a log). The latter option is what makes it possible to install this feature without breaking your running system.

There’s also a closed-source database introspection tool coming, currently codenamed MongoDB Scout. 

  • The name will change, in part because if you try to search on that name you’ll probably find an unrelated Scout. :)
  • Scout samples data, runs stats, and all that stuff.
  • Scout is referred to as a “schema introspection” tool, but I’m not sure why; schema introspection sounds more like a feature or architectural necessity than an actual product.

As for storage engines:

  • WiredTiger, which was the biggest deal in MongoDB 3.0, will become the default in 3.2. I continue to think analogies to InnoDB are reasonably appropriate.
  • An in-memory storage engine option was also announced with MongoDB 3.0. Now there’s a totally different in-memory option. However, details were not available at posting time. Stay tuned.
  • Yet another MongoDB storage engine, based on or akin to WiredTiger, will do encryption. Presumably, overhead will be acceptably low. Key management and all that will be handled by usual-suspect third parties.

Finally — most data management vendors brag to me about how important their text search option is, although I’m not necessarily persuaded. :) MongoDB does have built-in text search, of course, of which I can say:

  • It’s a good old-fashioned TF/IDF algorithm. (Text Frequency/Inverse Document Frequency.)
  • About the fanciest stuff they do is tokenization and stemming. (In a text search context, tokenization amounts to the identification of word boundaries and the like. Stemming is noticing that alternate forms of the same word really are the same thing.)

This level of technology was easy to get in the 1990s. One thing that’s changed in the intervening decades, however, is that text search commonly supports more languages. MongoDB offers stemming in 8 or 9 languages for free, plus a paid option via Basis for other languages yet.

Related links

Categories: Other

Multi-model database managers

Mon, 2015-08-24 02:07

I’d say:

  • Multi-model database management has been around for decades. Marketers who say otherwise are being ridiculous.
  • Thus, “multi-model”-centric marketing is the last refuge of the incompetent. Vendors who say “We have a great DBMS, and by the way it’s multi-model (now/too)” are being smart. Vendors who say “You need a multi-model DBMS, and that’s the reason you should buy from us” are being pathetic.
  • Multi-logical-model data management and multi-latency-assumption data management are greatly intertwined.

Before supporting my claims directly, let me note that this is one of those posts that grew out of a Twitter conversation. The first round went:

Merv Adrian: 2 kinds of multimodel from DBMS vendors: multi-model DBMSs and multimodel portfolios. The latter create more complexity, not less.

Me: “Owned by the same vendor” does not imply “well integrated”. Indeed, not a single example is coming to mind.

Merv: We are clearly in violent agreement on that one.

Around the same time I suggested that Intersystems Cache’ was the last significant object-oriented DBMS, only to get the pushback that they were “multi-model” as well. That led to some reasonable-sounding justification — although the buzzwords of course aren’t from me — namely:

Caché supports #SQL, #NoSQL. Interchange across tables, hierarchical, document storage.

Along the way, I was reminded that some of the marketing claims around “multi-model” are absurd. For example, at the time I am writing this, the Wikipedia article on “multi-model database” claims that “The first multi-model database was OrientDB, created in 2010…” In fact, however, by the definitions used in that article, multi-model DBMS date back to the 1980s, when relational functionality was grafted onto pre-relational systems such as TOTAL and IDMS.

What’s more, since the 1990s, multi-model functionality has been downright common, specifically in major products such as Oracle, DB2 and Informix, not to mention PostgreSQL. (But not so much Microsoft or Sybase.) Indeed, there was significant SQL standards work done around datatype extensions, especially in the contexts of SQL/MM and SQL3.

I tackled this all in 2013, when I argued:

Developments since then have been in line with my thoughts. For example, Spark added DataFrames, which promise substantial data model flexibility for Spark use cases, but more mature products have progressed in a more deliberate way.

What’s new in all this is a growing desire to re-integrate short-request and analytic processing — hence Gartner’s new-ish buzzword of HTAP (Hybrid Transactional/Analytic Processing). The more sensible reasons for this trend are:

  • Operational applications have always needed to accept immediate writes. (Losing data is bad.)
  • Operational applications have always needed to serve small query result sets based on the freshest data.(If you write something into a database, you might need to immediately retrieve it to finish the business operation.)
  • It is increasingly common for predictive decisions to be made at similar speeds. (That’s what recommenders and personalizers do.) Ideally, such decisions can be based on fresh and historical data alike.
  • The long-standing desire for business intelligence to operate on super-fresh data is, increasingly, making sense, as we get ever more stuff to monitor. However …
  • … most such analysis should look at historical data as well.
  • Streaming technology is supplying ever more fresh data.

But here’s the catch — the best models for writing data are the worst for reading it, and vice-versa, because you want to write data as a lightly-structured document or log, but read it from a Ted-Codd-approved RDBMS or MOLAP system. And if you don’t have the time to move data among multiple stores, then you want one store to do a decent job of imitating both kinds of architecture. The interesting new developments in multi-model data management will largely be focused on that need.

Related links

  • The two-policemen joke seems ever more relevant.
  • My April, 2015 post on indexing technology reminds us that one DBMS can do multiple things.
  • Back in 2009 integrating OLTP and data warehousing was clearly a bad idea.
Categories: Other

Data messes

Mon, 2015-08-03 03:58

A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.

To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — :) — mine.)

Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:

  • Inconsistent, in which case humans might not know how to look it up and database JOINs might fail.
  • Unintegrated, in which case one application might not be able to use data that another happily maintains. (This is the classic data silo problem.)

Inconsistency can take multiple forms, including: 

  • Variant names.
  • Variant spellings.
  • Variant data structures (not to mention datatypes, formats, etc.).

Addressing the first two is the province of master data management (MDM), and also of the same data cleaning technologies that might help with outright errors. Addressing the third is the province of other data integration technology, which also may be what’s needed to break down the barriers between data silos.

So far I’ve been assuming that data is neatly arranged in fields in some kind of database. But suppose it’s in documents or videos or something? Well, then there’s a needed step of data enhancement; even when that’s done, further data integration issues are likely to be present.

All of the above issues occur with analytic data too. In some cases it probably makes sense not to fix them until the data is shipped over for analysis. In other cases, it should be fixed earlier, but isn’t. And in hybrid cases, data is explicitly shipped to an operational data warehouse where the problems are presumably fixed.

Further, some problems are much greater in their analytic guise. Harmonization and integration among data silos are likely to be much more intense. (What is one table for analytic purposes might be many different ones operationally, for reasons that might span geography, time period, or application legacy.) Addressing those issues is the province of data integration technologies old and new. Also, data transformation and enhancement are likely to be much bigger deals in the analytic sphere, in part because of poly-structured internet data. Many Hadoop and now Spark use cases address exactly those needs.

Let’s now consider missing data. In operational cases, there are three main kinds of missing data:

  • Missing values, as a special case of inaccuracy.
  • Data that was only collected over certain time periods, as a special case of changing data structure.
  • Data that hasn’t been derived yet, as the main case of a need for data enhancement.

All of those cases can ripple through to cause analytic headaches. But for certain inherently analytic data sets — e.g. a weblog or similar stream — the problem can be even worse. The data source might stop functioning, or might change the format in which it transmits; but with no immediate operations compromised, it might take a while to even notice. I don’t know of any technology that does a good, simple job of addressing these problems, but I am advising one startup that plans to try.

Further analytics-mainly data messes can be found in three broad areas:

  • Problems caused by new or changing data sources hit much faster in analytics than in operations, because analytics draws on a greater variety of data.
  • Event recognition, in which most of a super-high-volume stream is discarded while the “good stuff” is kept, is more commonly a problem in analytics than in pure operations. (That said, it may arise on the boundary of operations and analytics, namely in “real-time” monitoring.
  • Analytics has major problems with data scavenger hunts, in which business analysts and data scientists don’t know what data is available for them to examine.

That last area is the domain of a lot of analytics innovation. In particular:

  • It’s central to the dubious Gartner concept of a Logical Data Warehouse, and to the more modest logical data layers I advocate as alternative.
  • It’s been part of BI since the introduction of Business Objects’ “semantic layer”. (See, for example, my recent post on Zoomdata.)
  • It’s a big part of the story of startups such as Alation or Tamr.
  • In a failed effort, it was part of Greenplum’s pitch some years back, as an aspect of the “enterprise data cloud”.
  • It led to some of the earliest differentiated features at Gooddata.
  • It’s implicit in the some BI collaboration stories, in some BI/search integration, and in ClearStory’s “Data You May Like”.

Finally, suppose we return to the case of operational data, assumed to be accurately stored in fielded databases, with sufficient data integration technologies in place. There’s still a whole other kind of possible mess than those I cited above — applications may not be doing a good job of understanding and using it. I could write a whole series of posts on that subject alone … but it’s going slowly. :) So I’ll leave that subject area for another time.

Categories: Other