Skip navigation.


Syndicate content
Choices in data management and analysis
Updated: 1 hour 33 min ago

MongoDB update

Thu, 2015-09-10 04:33

One pleasure in talking with my clients at MongoDB is that few things are NDA. So let’s start with some numbers:

  • >2,000 named customers, the vast majority of which are unique organizations who do business with MongoDB directly.
  • ~75,000 users of MongoDB Cloud Manager.
  • Estimated ~1/4 million production users of MongoDB total.

Also >530 staff, and I think that number is a little out of date.

MongoDB lacks many capabilities RDBMS users take for granted. MongoDB 3.2, which I gather is slated for early November, narrows that gap, but only by a little. Features include:

  • Some JOIN capabilities.
    • Specifically, these are left outer joins, so they’re for lookup but not for filtering.
    • JOINs are not restricted to specific shards of data …
    • … but do benefit from data co-location when it occurs.
  • A BI connector. Think of this as a MongoDB-to- SQL translator. Using this does require somebody to go in and map JSON schemas and relational tables to each other. Once that’s done, the flow is:
    • Basic SQL comes in.
    • Filters and GroupBys are pushed down to MongoDB. A result set … well, it results. :)
    • The result set is formatted into a table and returned to the system — for example a business intelligence tool — that sent the SQL.
  • Database-side document validation, in the form of field-specific rules that combine into a single expression against which to check a document.
    • This is fairly simple stuff — no dependencies among fields in the same document, let alone foreign key relationships.
    • MongoDB argues, persuasively, that this simplicity makes it unlikely to recreate the spaghetti code maintenance nightmare that was 1990s stored procedures.
    • MongoDB concedes that, for performance, it will ordinarily be a good idea to still do your validation on the client side.
    • MongoDB points out that enforcement can be either strict (throw errors) or relaxed (just note invalid documents to a log). The latter option is what makes it possible to install this feature without breaking your running system.

There’s also a closed-source database introspection tool coming, currently codenamed MongoDB Scout. 

  • The name will change, in part because if you try to search on that name you’ll probably find an unrelated Scout. :)
  • Scout samples data, runs stats, and all that stuff.
  • Scout is referred to as a “schema introspection” tool, but I’m not sure why; schema introspection sounds more like a feature or architectural necessity than an actual product.

As for storage engines:

  • WiredTiger, which was the biggest deal in MongoDB 3.0, will become the default in 3.2. I continue to think analogies to InnoDB are reasonably appropriate.
  • An in-memory storage engine option was also announced with MongoDB 3.0. Now there’s a totally different in-memory option. However, details were not available at posting time. Stay tuned.
  • Yet another MongoDB storage engine, based on or akin to WiredTiger, will do encryption. Presumably, overhead will be acceptably low. Key management and all that will be handled by usual-suspect third parties.

Finally — most data management vendors brag to me about how important their text search option is, although I’m not necessarily persuaded. :) MongoDB does have built-in text search, of course, of which I can say:

  • It’s a good old-fashioned TF/IDF algorithm. (Text Frequency/Inverse Document Frequency.)
  • About the fanciest stuff they do is tokenization and stemming. (In a text search context, tokenization amounts to the identification of word boundaries and the like. Stemming is noticing that alternate forms of the same word really are the same thing.)

This level of technology was easy to get in the 1990s. One thing that’s changed in the intervening decades, however, is that text search commonly supports more languages. MongoDB offers stemming in 8 or 9 languages for free, plus a paid option via Basis for other languages yet.

Related links

Categories: Other

Multi-model database managers

Mon, 2015-08-24 02:07

I’d say:

  • Multi-model database management has been around for decades. Marketers who say otherwise are being ridiculous.
  • Thus, “multi-model”-centric marketing is the last refuge of the incompetent. Vendors who say “We have a great DBMS, and by the way it’s multi-model (now/too)” are being smart. Vendors who say “You need a multi-model DBMS, and that’s the reason you should buy from us” are being pathetic.
  • Multi-logical-model data management and multi-latency-assumption data management are greatly intertwined.

Before supporting my claims directly, let me note that this is one of those posts that grew out of a Twitter conversation. The first round went:

Merv Adrian: 2 kinds of multimodel from DBMS vendors: multi-model DBMSs and multimodel portfolios. The latter create more complexity, not less.

Me: “Owned by the same vendor” does not imply “well integrated”. Indeed, not a single example is coming to mind.

Merv: We are clearly in violent agreement on that one.

Around the same time I suggested that Intersystems Cache’ was the last significant object-oriented DBMS, only to get the pushback that they were “multi-model” as well. That led to some reasonable-sounding justification — although the buzzwords of course aren’t from me — namely:

Caché supports #SQL, #NoSQL. Interchange across tables, hierarchical, document storage.

Along the way, I was reminded that some of the marketing claims around “multi-model” are absurd. For example, at the time I am writing this, the Wikipedia article on “multi-model database” claims that “The first multi-model database was OrientDB, created in 2010…” In fact, however, by the definitions used in that article, multi-model DBMS date back to the 1980s, when relational functionality was grafted onto pre-relational systems such as TOTAL and IDMS.

What’s more, since the 1990s, multi-model functionality has been downright common, specifically in major products such as Oracle, DB2 and Informix, not to mention PostgreSQL. (But not so much Microsoft or Sybase.) Indeed, there was significant SQL standards work done around datatype extensions, especially in the contexts of SQL/MM and SQL3.

I tackled this all in 2013, when I argued:

Developments since then have been in line with my thoughts. For example, Spark added DataFrames, which promise substantial data model flexibility for Spark use cases, but more mature products have progressed in a more deliberate way.

What’s new in all this is a growing desire to re-integrate short-request and analytic processing — hence Gartner’s new-ish buzzword of HTAP (Hybrid Transactional/Analytic Processing). The more sensible reasons for this trend are:

  • Operational applications have always needed to accept immediate writes. (Losing data is bad.)
  • Operational applications have always needed to serve small query result sets based on the freshest data.(If you write something into a database, you might need to immediately retrieve it to finish the business operation.)
  • It is increasingly common for predictive decisions to be made at similar speeds. (That’s what recommenders and personalizers do.) Ideally, such decisions can be based on fresh and historical data alike.
  • The long-standing desire for business intelligence to operate on super-fresh data is, increasingly, making sense, as we get ever more stuff to monitor. However …
  • … most such analysis should look at historical data as well.
  • Streaming technology is supplying ever more fresh data.

But here’s the catch — the best models for writing data are the worst for reading it, and vice-versa, because you want to write data as a lightly-structured document or log, but read it from a Ted-Codd-approved RDBMS or MOLAP system. And if you don’t have the time to move data among multiple stores, then you want one store to do a decent job of imitating both kinds of architecture. The interesting new developments in multi-model data management will largely be focused on that need.

Related links

  • The two-policemen joke seems ever more relevant.
  • My April, 2015 post on indexing technology reminds us that one DBMS can do multiple things.
  • Back in 2009 integrating OLTP and data warehousing was clearly a bad idea.
Categories: Other