- I’ve suggested in the past that multi-data-center capabilities are important for “data sovereignty”/geo-compliance.
- The need for geo-compliance just got a lot stronger, with the abolition of the European Union’s Safe Harbour rule for the US. If you collect data in multiple countries, you should be at least thinking about geo-compliance.
- Cassandra is an established leader in multi-data-center operation.
But when I made that connection and checked in accordingly with my client Patrick McFadin at DataStax, I discovered that I’d been a little confused about how multi-data-center Cassandra works. The basic idea holds water, but the details are not quite what I was envisioning.
The story starts:
- Cassandra groups nodes into logical “data centers” (i.e. token rings).
- As a best practice, each physical data center can contain one or more logical data center, but not vice-versa.
- There are two levels of replication — within a single logical data center, and between logical data centers.
- Replication within a single data center is planned in the usual way, with the principal data center holding a database likely to have a replication factor of 3.
- However, copies of the database held elsewhere may have different replication factors …
- … and can indeed have different replication factors for different parts of the database.
In particular, a remote replication factor for Cassandra can = 0. When that happens, then you have data sitting in one geographical location that is absent from another geographical location; i.e., you can be in compliance with laws forbidding the export of certain data. To be clear (and this contradicts what I previously believed and hence also implied in this blog):
- General multi-data-center operation is not what gives you geo-compliance, because the default case is that the whole database is replicated to each data center.
- Instead, you get that effect by tweaking your specific replication settings.
The most visible DataStax client using this strategy is apparently ING Bank.
If you have a geo-compliance issue, you’re probably also concerned about security. After all, the whole reason the issue arises is because one country’s government might want to look at another country’s citizens’ or businesses’ data. The DataStax security story is approximately:
- Encryption in flight, for any Cassandra.
- Encryption at rest, specifically with DataStax Enterprise.
- No cell-level or row-level security until Cassandra 3.0 is introduced and established. (I didn’t actually ask whether something similar to HBase coprocessors is coming for Cassandra, but that would be my first guess.)
- Various roles and permissions stuff.
While flexible, Cassandra’s multi-data-center features do add some complexity. Tunable-consistency choices are baked into Cassandra programs at each point data is accessed, and more data centers make for more choices. (Default best practice = write if you get a local quorum, running the slight risk of logical data centers being out of sync with each other.)
One way in which the whole thing does seem nice and simple is that you can have different logical data centers running on different kinds of platforms — cloud, colocation, in-house, whatever — without Cassandra caring.
I’m not going to call the DataStax Enterprise approach to geo-compliance the “gold standard”, because some of it seems pretty clunky or otherwise feature-light. On the other hand, I’m not aware of competitors who exceed it, in features or track record, so “silver standard” seems defensible.
Basho was on my (very short) blacklist of companies with whom I refuse to speak, because they have lied about the contents of previous conversations. But Tony Falco et al. are long gone from the company. So when Basho’s new management team reached out, I took the meeting.
- Basho management turned over significantly 1-2 years ago. The main survivors from the old team are 1 each in engineering, sales, and services.
- Basho moved its headquarters to Bellevue, WA. (You get one guess as to where the new CEO lives.) Engineering operations are very distributed geographically.
- Basho claims that it is much better at timely product shipments than it used to be. Its newest product has a planned (or at least hoped-for) 8-week cadence for point releases.
- Basho’s revenue is ~90% subscription.
- Basho claims >200 enterprise clients, vs. 100-120 when new management came in. Unfortunately, I forgot to ask the usual questions about divisions vs. whole organizations, OEM sell-through vs. direct, etc.
- Basho claims an average contract value of >$100K, typically over 2-3 years. $9 million of that (which would be close to half the total, actually), comes from 2 particular deals of >$4 million each.
Basho’s product line has gotten a bit confusing, but as best I understand things the story is:
- There’s something called Riak Core, which isn’t even a revenue-generating product. However, it’s an open source project with some big users (e.g. Goldman Sachs, Visa), and included in pretty much everything else Basho promotes.
- Riak KV is the key-value store previously known as Riak. It generates the lion’s share of Basho’s revenue.
- Riak S2 is an emulation of Amazon S3. Basho thinks that Riak KV loses efficiency when objects get bigger than 1 MB or so, and that’s when you might want to use Riak S2 in addition or instead.
- Riak TS is for time series, and just coming out now.
- Also in the mix are some (extra charge) connectors for Redis and Spark. Presumably, there are more of these to come.
- There’s an umbrella marketing term of “Basho Data Platform”.
Technical notes on some of that include:
- Riak Core doesn’t do data management. It just manages distributed operation of — well, whatever you want to operate. In part, Basho sees Riak Core as a better Apache ZooKeeper.
- That is the essence of the Riak/Spark pitch — something better than ZooKeeper for cluster management, and I presume some help in persisting Spark RDDs as well.
- The Riak/Redis pitch is even simpler — cluster management for Redis, and persistent backing as well.
- Basho’s criticisms of ZooKeeper start with “Cluster manager, manage thyself” claims about ZooKeeper availability, as in the PagerDuty ZooKeeper critique.
- Riak KV has secondary indexing. Performance is somewhat questionable. It also has Solr indexing, which is fast.
- At least in its 1.0 form, Riak TS assumes:
- There’s some kind of schema or record structure.
- There are explicit or else easily-inferred timestamps.
- Microsecond accuracy, perfect ordering and so on are not essential.
- Thus, Riak TS 1.0 is not ideal for the classic Splunk use case where you text index/search on a lot of log emissions. It also is not ideal for financial tick storage.
- Riak TS has range-based partitioning, where the range is in terms of time. Basho refers to this as “locality”.
- Riak TS has a SQL subset. Evidently there’s decent flexibility as to which part of the database carries which schema.
- Riak has a nice feature of allowing you stage a change to network topology before you push it live.
- Riak’s vector clock approach to wide-area synchronization is more controversial.
Finally, notes on what Basho sees as use cases and competition include:
- Riak KV is generally used to store usual-suspect stuff — log data, user/profile data and so on.
- Basho thinks NoSQL is a 4-horse race — Basho/Riak KV, DataStax/Cassandra, MongoDB, Couchbase. (I would be surprised if there was much agreement with that view from, for example, MongoDB, DataStax, Aerospike, MapR or the HBase community.)
- Basho competes on availability, scalability (including across geography) and so on, or in simplest terms:
- “Availability and correctness”
- Simple operation
- Unsurprisingly, Basho thinks its closest competitor is DataStax. (However, DataStax tells me they don’t compete much with Basho.)
- Basho thinks Riak KV has ease-of-operation advantages vs. Cassandra.
- An example of a mission-critical Riak app is the UK National Health Service storing prescription information.
- An example of Riak S2 and Riak KV being used together is Turner Broadcasting storing video in the former and associated metadata in the latter.
- Riak TS is initially pointed at two use cases:
- “Internet of Things”
- “Metrics”, which seems to mean monitoring of system metrics.
- Basho sees the competition for Riak TS as starting with HBase, Cassandra, and InfluxDB.
I last wrote about Couchbase in November, 2012, around the time of Couchbase 2.0. One of the many new features I mentioned then was secondary indexing. Ravi Mayuram just checked in to tell me about Couchbase 4.0. One of the important new features he mentioned was what I think he said was Couchbase’s “first version” of secondary indexing. Obviously, I’m confused.
Now that you’re duly warned, let me remind you of aspects of Couchbase timeline.
- 2 corporate name changes ago, Couchbase was organized to commercialize memcached. memcached, of course, was internet companies’ default way to scale out short-request processing before the rise of NoSQL, typically backed by manually sharded MySQL.
- Couchbase’s original value proposition, under the name Membase, was to provide persistence and of course support for memcached. This later grew into a caching-oriented pitch even to customers who weren’t already memcached users.
- A merger with the makers of CouchDB ensued, with the intention of replacing Membase’s SQLite back end with CouchDB at the same time as JSON support was introduced. This went badly.
- By now, however, Couchbase sells for more than distributed cache use cases. Ravi rattled off a variety of big-name customer examples for system-of-record kinds of use cases, especially in session logging (duh) and also in travel reservations.
- Couchbase 4.0 has been in beta for a few months.
Technical notes on Couchbase 4.0 — and related riffs — start:
- There’s a new SQL-like language called N1QL (pronounced like “nickel”). I’m hearing a lot about SQL-on-NoSQL these days. More on that below.
- “Index”, “data” and “query” are three different services/tiers.
- You can run them all on the same nodes or separately. Couchbase doesn’t have enough experience yet with the technology to know which choice will wind up as a best practice.
- I’m hearing a lot about heterogeneous-node/multi-tier DBMS architectures these days, and would no longer stand by my 2009 statement that they are unusual. Other examples include Oracle Exadata, MySQL, MongoDB (now that it has pluggable storage engines), MarkLogic, and of course the whole worlds of Hadoop and Spark.
- To be clear — the secondary indexes are global, and not tied to the same nodes as the data they index.
- There’s a new back end called ForestDB, but if I understood correctly, it’s used just for the indexes, not for the underlying data.
- ForestDB represents Couchbase indexes in something that resembles b-trees, but also relies on tries. Indeed, if I’m reading the relevant poster correctly, it’s based on a trie of b-trees.
- In another increasingly common trend, Couchbase uses Bloom filters to help decide which partitions to retrieve for any particular query.
Up to a point, SQL-on-NoSQL stories can be fairly straightforward.
- You define some kind of a table,* perhaps in a SQL-like DDL (Data Description Language).
- SELECT, FROM and WHERE clauses work in the usual way.
- Hopefully, if a column is going to have a lot of WHERE clauses on it, it also has an index.
For example, I think that’s the idea behind most ODBC/JDBC drivers for NoSQL systems. I think it’s also the idea behind most “SQL-like” languages that NoSQL vendors ship.
*Nobody I talk to about this ever wants to call it a “view”, but it sure sounds like a view to me — not a materialized view, of course, but a view nonetheless.
JOIN syntax can actually be straightforward as well under these assumptions. As for JOIN execution, Couchbase pulls all the data into the relevant tier, and nested loop execution there. My new clients at SequoiaDB have a similar strategy, by the way, although in their case there’s a hash join option as well.
But if things stopped there, they would miss an important complication: NoSQL has nested data. I.e., a value can actually be an array, whose entries are arrays themselves, and so on. That said, the “turtles all the way down” joke doesn’t quite apply, because at some point there are actual scalar or string values, and those are the ones SQL wants to actually operate on.
Most approaches I know of to that problem boil down to identifying particular fields as table columns, with or without aliases/renaming; I think that’s the old Hadapt/Vertica strategy, for example. Couchbase claims to be doing something a little different however, with a SQL-extending operator called UNNEST. Truth be told, I’m finding the N1QL language reference a bit terse, and haven’t figured out what the practical differences vs. the usual approach are, if any. But it sounds like there may be some interesting ideas in there somewhere.
1. European Union data sovereignty laws have long had a “Safe Harbour” rule stating it was OK to ship data to the US. Per the case Maximilian Schrems v Data Protection Commissioner, this rule is now held to be invalid. Angst has ensued, and rightly so.
The core technical issues are roughly:
- Data is usually in one logical database. Data may be replicated locally, for availability and performance. It may be replicated remotely, for availability, disaster recovery, and performance. But it’s still usually logically in one database.
- Now remote geographic partitioning may be required by law. Some technologies (e.g. Cassandra) support that for a single logical database. Some don’t.
- Even under best circumstances, hosting and administrative costs are likely to be higher when a database is split across more geographies (especially when the count is increased from 1 to 2).
Facebook’s estimate of billions of dollars in added costs is not easy to refute.
My next set of technical thoughts starts:
- This is about data storage, not data use; for example, you can analyze Austrian data in the US, but you can’t store it there.
- Of course, that can be a tricky distinction to draw. We can only hope that intermediate data stores, caches and so on can be allowed to use data from other geographies.
- Assuming the law is generous in this regard, scan-heavy analytics are more problematic than other kinds.
- But if there are any problems in those respects — well, if analytics can be parallelized in general, then in particular one should be able to parallelize across geographies. (Of course, this could require replicating one’s whole analytic stack across geographies.)
2. US law enforcement is at loggerheads with major US tech companies, because it wants the right to subpoena data stored overseas. The central case here is a request to get at Microsoft’s customer data stored in Ireland. A government victory would be catastrophic for the US tech industry, but I’m hopeful that sense will — at least to some extent — prevail.
3. Ed Snowden, Glenn Greenwald and numerous other luminaries are pushing something called the Snowden Treaty, as a model for how privacy laws should be set up. I’m a huge fan of what Snowden and Greenwald have done in general, but this particular project has not started well. First, they’ve rolled the thing out while actually giving almost no details, so they haven’t really contributing anything except a bit of PR. Second, one of the few details they did provide contains a horrific error.
Specifically, they “demand”
freedom from damaging publicity, public scrutiny …
To that I can only say: “Have you guys lost your minds???????” As written, that’s a demand that can only be met by censorship laws. I’m sure this error is unintentional, because Greenwald is in fact a stunningly impassioned and articulate opponent of censorship. Even so, that’s an appallingly careless mistake, which for me casts the whole publicity campaign into serious doubt.
4. As a general rule — although the details of course depend upon where you live — it is no longer possible to move around and be confident that you won’t be tracked. This is true even if you’re not a specific target of surveillance. Ways of tracking your movements include but are not limited to:
- Electronic records of you paying public transit fares or tolls, as relevant. (Ditto rental car fees, train or airplane tickets, etc.)
- License plate cameras, which in the US already have billions of records on file.
- Anything that may be inferred from your mobile phone.
5. The previous point illustrates that the strong form of the Snowden Treaty is a pipe dream — it calls for a prohibition on mass surveillance, and that will never happen, because:
- Governments will insist on trying to prevent “terrorism” before the fact. That mass surveillance is generally lousy at doing so won’t keep them from trying.
- Governments will insist on being able to do general criminal forensics after the fact. So they’ll want mass surveillance data sitting around just in case they find that they need it.
- Businesses share consumers’ transaction and interaction data, and such sharing is central to the current structure of the internet industry. That genie isn’t going back into the bottle. Besides, if it did, a few large internet companies would have even more of an oligopolistic advantage vs. the others than they now do.
The huge problem with these truisms, of course, is scope creep. Once the data exists, it can be used for many more purposes than the few we’d all agree are actually OK.
6. That, in turn, leads me back to two privacy posts that I like to keep reminding people of, because they make points that aren’t commonly found elsewhere:
- The essential questions of fair data use, in which I point out such a long list of legal issues that almost everybody has overlooked some of them.
- Very chilling effects, in which I point out how damaging surveillance can be when there’s even a possibility of adverse consequence.
Whether or not you basically agree with me about privacy and surveillance, those two posts may help flesh out whatever your views on the subject actually are.
1. The rise of SAP (and later Siebel Systems) was greatly helped by Anderson Consulting, even before it was split off from the accounting firm and renamed as Accenture. My main contact in that group was Rob Kelley, but it’s possible that Brian Sommer was even more central to the industry-watching part of the operation. Brian is still around, and he just leveled a blast at the ERP* industry, which I encourage you to read. I agree with most of it.
*Enterprise Resource Planning
Brian’s argument, as I interpret it, boils down mainly to two points:
- Big ERP companies selling big ERP systems are pathetically slow at adding new functionality. He’s right. My favorite example is the multi-decade slog to integrate useful analytics into operational apps.
- The world of “Big Data” is fundamentally antithetical to the design of current-generation ERP systems. I think he’s right in that as well.
I’d add that SaaS (Software As A Service)/on-premises tensions aren’t helping incumbent vendors either.
But no article addresses all the subjects it ideally should, and I’d like to call out two omissions. First, what Brian said is in many cases applicable just to large and/or internet-first companies. Plenty of smaller, more traditional businesses could get by just fine with no more functionality than is in “Big ERP” today, if we stipulate that it should be:
- Delivered via SaaS.
- Much easier to adopt and use.
Second, even within the huge enterprise/huge app vendor world, it’s not entirely clear how integrated ERP supposedly is or isn’t with CRM (Customer Relationship Management). And a lot of what Brian talks about fits pretty cleanly into the CRM bucket.
2. In any case, there are many application areas that — again assuming that we’re in the large enterprise or large internet company world — fit well neither with classical ERP nor with its CRM sibling. For starters, investigative analytics doesn’t fit well into packaged application suites, for a myriad of reasons, the most basic of which are:
- The whole point of investigative analytics is to discover things that are new. Therefore, business processes are inherently unpredictable.
- So are data inputs.
If somebody does claim to be selling an app in investigative analytics, it is usually really an analytic application subsystem or else something very disconnected from other apps. Indeed, in almost all cases it’s both.
3. When it comes to customer-facing websites, I stand by my arguments three years ago in the post just linked above, which boil down to:
- What I just said above about investigative analytics, plus the observation that …
- … websites have a strong creative aspect that fits badly with soup-to-nuts packaged applications.
Also, complex websites are likely to rely on dynamic schemas, and packaged apps have trouble adapting to those.
4. This is actually an example of a more general point — packaged or SaaS apps generally assume rather fixed schemas. (The weasel word “rather” is included to allow for customization-through-configuration, but I think the overall point holds.) Indeed, database design is commonly the essence of packaged app technology.
5. However, those schemas do not have to be relational. It would be inaccurate to say that packaged apps always assume tabular data, because of examples such as:
- SAP has built on top of quasi-objects for a long time, although the underpinnings are technically relational.
- There are some cases of building entirely on an object-oriented or hierarchical data model, especially in health care.
- Business has some inherent hierarchies that get reflected in data structures, e.g. in bills of materials or organization charts.
But even non-tabular data structures are, in the minds of app developers, usually assumed to have fixed schemas.