Skip navigation.

Curt Monash

Syndicate content
Choices in data management and analysis
Updated: 2 min 56 sec ago

Teradata bought Hadapt and Revelytix

Wed, 2014-07-23 02:29

My client Teradata bought my (former) clients Revelytix and Hadapt.* Obviously, I’m in confidentiality up to my eyeballs. That said — Teradata truly doesn’t know what it’s going to do with those acquisitions yet. Indeed, the acquisitions are too new for Teradata to have fully reviewed the code and so on, let alone made strategic decisions informed by that review. So while this is just a guess, I conjecture Teradata won’t say anything concrete until at least September, although I do expect some kind of stated direction in time for its October user conference.

*I love my business, but it does have one distressing aspect, namely the combination of subscription pricing and customer churn. When your customers transform really quickly, or even go out of existence, so sometimes does their reliance on you.

I’ve written extensively about Hadapt, but to review:

  • The HadoopDB project was started by Dan Abadi and two grad students.
  • HadoopDB tied a bunch of PostgreSQL instances together with Hadoop MapReduce. Lab benchmarks suggested it was more performant than the coyly named DBx (where x=2), but not necessarily competitively with top analytic RDBMS.
  • Hadapt was formed to commercialize HadoopDB.
  • After some fits and starts, Hadapt was a Cambridge-based company. Former Vertica CEO Chris Lynch invested even before he was a VC, and became an active chairman. Not coincidentally, Hadapt had a bunch of Vertica folks.
  • Hadapt decided to stick with row-based PostgreSQL, Dan Abadi’s previous columnar enthusiasm notwithstanding. Not coincidentally, Hadapt’s performance never blew anyone away.
  • Especially after the announcement of Cloudera Impala, Hadapt’s SQL-on-Hadoop positioning didn’t work out. Indeed, Hadapt laid off most or all of its sales and marketing folks. Hadapt pivoted to emphasize its schema-on-need story.
  • Chris Lynch, who generally seems to think that IT vendors are created to be sold, shopped Hadapt aggressively.

As for what Teradata should do with Hadapt:

  • My initial thought Hadapt was to just double down, pushing the technology forward, presumably including a columnar option such as the one Citus Data developed.
  • But upon reflection, if it made technical sense to merge the Aster and Hadapt products, that would be better yet.

I herewith apologize to Aster co-founder and Hadapt skeptic Tasso Argyros (who by the way has moved on from Teradata) for even suggesting such heresy. :)

Complicating the story further:

  • Impala lets you treat data in HDFS (Hadoop Distributed File System) as if it were in a SQL DBMS. So does Teradata SQL-H. But Hadapt makes you decide whether the data is in HDFS or the SQL DBMS, and it can’t be in both at once. Edit: Actually, see Dan Abadi’s comments below.
  • Impala and Oracle’s new SQL-H competitor have daemons running on every data node. So does one option in Hadapt. But I don’t think SQL-H does that yet.

I was less involved with Revelytix that with Hadapt (although I’m told I served as the “catalyst” for the original Teradata/Revelytix partnership). That said, Teradata — like Oracle — is always building out a data integration suite to cover a limited universe of data stores. And Revelytix’ dataset management technology is a nice piece toward an integrated data catalog.

Related posts

The point of predicate pushdown

Tue, 2014-07-15 07:52

Oracle is announcing today what it’s calling “Oracle Big Data SQL”. As usual, I haven’t been briefed, but highlights seem to include:

  • Oracle Big Data SQL is basically data federation using the External Tables capability of the Oracle DBMS.
  • Unlike independent products — e.g. Cirro — Oracle Big Data SQL federates SQL queries only across Oracle offerings, such as the Oracle DBMS, the Oracle NoSQL offering, or Oracle’s Cloudera-based Hadoop appliance.
  • Also unlike independent products, Oracle Big Data SQL is claimed to be compatible with Oracle’s usual security model and SQL dialect.
  • At least when it talks to Hadoop, Oracle Big Data SQL exploits predicate pushdown to reduce network traffic.

And by the way – Oracle Big Data SQL is NOT “SQL-on-Hadoop” as that term is commonly construed, unless the complete Oracle DBMS is running on every node of a Hadoop cluster.

Predicate pushdown is actually a simple concept:

  • If you issue a query in one place to run against a lot of data that’s in another place, you could spawn a lot of network traffic, which could be slow and costly. However …
  • … if you can “push down” parts of the query to where the data is stored, and thus filter out most of the data, then you can greatly reduce network traffic.

“Predicate pushdown” gets its name from the fact that portions of SQL statements, specifically ones that filter data, are properly referred to as predicates. They earn that name because predicates in mathematical logic and clauses in SQL are the same kind of thing — statements that, upon evaluation, can be TRUE or FALSE for different values of variables or data.

The most famous example of predicate pushdown is Oracle Exadata, with the story there being:

  • Oracle’s shared-everything architecture created a huge I/O bottleneck when querying large amounts of data, making Oracle inappropriate for very large data warehouses.
  • Oracle Exadata added a second tier of servers each tied to a subset of the overall storage; certain predicates are pushed down to that tier.
  • The I/O between Exadata’s two sets of servers is now tolerable, and so Oracle is now often competitive in the high-end data warehousing market,

Oracle evidently calls this “SmartScan”, and says Oracle Big Data SQL does something similar with predicate pushdown into Hadoop.

Oracle also hints at using predicate pushdown to do non-tabular operations on the non-relational systems, rather than shoehorning operations on multi-structured data into the Oracle DBMS, but my details on that are sparse.

Related link

21st Century DBMS success and failure

Mon, 2014-07-14 00:37

As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.

DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.

In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.

Buyer inertia is a greater concern.

  • A significant minority of enterprises are highly committed to their enterprise DBMS standards.
  • Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
  • FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.

A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.

  • First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
  • Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
  • Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
  • Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.

Otherwise I’d say: 

  • At large enterprises, their internet operations perhaps excepted:
    • Short-request/general-purpose SQL alternatives to the behemoths — e.g. MySQL, PostgreSQL, NewSQL — have had tremendous difficulty getting established. The last big success was the rise of Microsoft SQL Server in the 1990s. That’s why I haven’t mentioned the term mid-range DBMS in years.
    • NoSQL/non-SQL has penetrated large enterprises mainly for a few specific use cases, for example the lists I posted for MongoDB or graph databases.
  • Internet-only companies have few inertia issues when it comes to database managers. They’ll consider anything they regard as being in their price ballpark (which is however often restricted to open source). I think part of the reason is that as quickly as they rewrite their applications, DBMS are vastly less “strategic” to them than they are to most larger enterprises.
  • The internet operations of large companies — especially large retailers — in many cases behave like internet-only companies, but in many other cases behave like the rest of the enterprise.

The major reasons for DBMS categories to get established in the first place are:

  • Performance and/or scalability (many examples).
  • Developer features (for example dynamic schema).
  • License/maintenance cost (for example several open source categories).
  • Ease of installation and administration (for example open source again, and also data warehouse appliances).

Those same characteristics are major bases for competition among members of a new category, although as noted above behemoth-loyalty can also come into play.

Cool-vs.-weird tradeoffs are somewhat secondary among SQL DBMS.

  • There’s not much of a “cool” factor, because new products aren’t that different in what they do vs. older ones.
  • There’s not a terrible “weird” factor either, but of course any smaller offering faces FUD, and also …
  • … appliances are anti-strategic for many buyers, especially ones who demand a smooth path to the cloud.)

They’re huge, however, in the non-SQL world. Most non-SQL data managers have a major “weird” factor. Fortunately, NoSQL and Hadoop both have huge “cool” cred to offset it. XML/XQuery unfortunately did not.

Finally, in most DBMS categories there are massive issues with product completeness, more in the area of maturity than that of whole product. The biggest whole product issues are concentrated on the matter of interoperating with other software — business intelligence tools, packaged applications (if relevant to the category), etc. Most notably, the handful of DBMS that are certified to run SAP share a huge market that other DBMS can’t touch. But BI tools are less of a differentiator — I yawn when vendors tell me they are certified for/partnered with MicroStrategy, Tableau, Pentaho and Jaspersoft, and I’m surprised at any product that isn’t.

DBMS maturity has a lot of aspects, but the toughest challenges are concentrated in two main areas:

  • Reliability, especially but not only in short-request use cases.
  • Performance across a great variety of use cases. I observe frequently that performance in best-case scenarios, performance in the lab and performance in real-world environments are much further apart than vendors like to think.

In particular:

  • Maturity demands seem to be much higher for SQL DBMS than for NoSQL.
    • I think this is one of several reasons NoSQL has been much more successful than NewSQL.
    • It’s why I think MarkLogic’s “Enterprise NoSQL” positioning is a mistake.
  • As for MySQL:
    • MySQL wasn’t close to reliable enough for enterprises to trust it until InnoDB became the default storage engine.
    • MySQL 5 point releases have added major features, or decent performance for major features. I’ll confess to having lost track of what’s been fixed and what’s still missing.
    • In saying all that I’m holding MySQL to a much higher maturity standard than I’m holding NoSQL — because that’s what I think enterprise customers do.
  • PostgreSQL “should” be doing a lot better than it is. I have an extremely low opinion of its promoters, and not just for personal reasons. (That said, the personal reasons don’t just apply to EnterpriseDB anymore. I’ve also run out of patience waiting for Josh Berkus to retract untruths he posted about me years ago.)
  • SAP HANA checks boxes for performance (In-memory rah rah rah!!) and whole product (Runs SAP!!). That puts it well ahead of most other newish SQL DBMS, purely analytic ones perhaps excepted.
  • Any other new short-request SQL DBMS that sounds like is has traction is also memory-centric.
  • Analytic RDBMS are in most respects held to lower maturity standards than DBMS used for write-intensive workloads. Even so, products in the category are still frequently tripped up by considerations of concurrent performance and mixed workload management.

Related links

There have been 1,470 previous posts in the 9-year history of this blog, many of which could serve as background material for this one. A couple that seem particularly germane and didn’t get already get linked above are:

MongoDB is growing up

Thu, 2014-04-17 02:56

I caught up with my clients at MongoDB to discuss the recent MongoDB 2.6, along with some new statements of direction. The biggest takeaway is that the MongoDB product, along with the associated MMS (MongoDB Management Service), is growing up. Aspects include:

  • An actual automation and management user interface, as opposed to the current management style, which is almost entirely via scripts (except for the monitoring UI).
    • That’s scheduled for public beta in May, and general availability later this year.
    • It will include some kind of integrated provisioning with VMware, OpenStack, et al.
    • One goal is to let you apply database changes, software upgrades, etc. without taking the cluster down.
  • A reasonable backup strategy.
    • A snapshot copy is made of the database.
    • A copy of the log is streamed somewhere.
    • Periodically — the default seems to be 6 hours — the log is applied to create a new current snapshot.
    • For point-in-time recovery, you take the last snapshot prior to the point, and roll forward to the desired point.
  • A reasonable locking strategy!
    • Document-level locking is all-but-promised for MongoDB 2.8.
    • That means what it sounds like. (I mention this because sometimes an XML database winds up being one big document, which leads to confusing conversations about what’s going on.)
  • Security. My eyes glaze over at the details, but several major buzzwords have been checked off.
  • A general code rewrite to allow for (more) rapid addition of future features.

Of course, when a DBMS vendor rewrites its code, that’s a multi-year process. (I think of it at Oracle as spanning 6 years and 2 main-number releases.) With that caveat, the MongoDB rewrite story is something like:

  • Updating has been reworked. Most of the benefits are coming later.
  • Query optimization and execution have been reworked. Most of the benefits are coming later, except that …
  • … you can now directly filter on multiple indexes in one query; previously you could only simulate doing that by pre-building a compound index.
  • One of those future benefits is more index types, for example R-trees or inverted lists.
  • Concurrency improvements are down the road.
  • So are rewrites of the storage layer, including the introduction of compression.

Also, you can now straightforwardly transform data in a MongoDB database and write it into new datasets, something that evidently wasn’t easy to do before.

One thing that MongoDB is not doing is offer any ODBC/JDBC or other SQL interfaces. Rather, there’s some other API — I don’t know the details — whereby business intelligence tools or other systems can extract views, and a few BI vendors evidently are doing just that. In particular, MicroStrategy and QlikView were named, as well as a couple of open source usual-suspects.

As of 2.6, MongoDB seems to have a basic integrated text search capability — which however does not rise to the search functionality level that was in Oracle 7.3.2. In particular:

  • 15 Western languages are supported with stopwords, tokenization, etc.
  • Search predicates can be mixed into MongoDB queries.
  • The search language isn’t very rich; for example, it lacks WHERE NEAR semantics.
  • You can’t tweak the lexicon yourself.

And finally, some business and pricing notes:

  • Two big aspects of the paid-versus-free version of MongoDB (the product line) are:
    • Security.
    • Management tools.
  • Well, actually, you can get the management tools for free, but only on a SaaS basis from MongoDB (the company).
    • If you want them on premises or in your part of the cloud, you need to pay.
    • If you want MongoDB (the company) to maintain your backups for you, you need to pay.
  • Customer counts include:
    • At least 1000 or so subscribers (counting by organization).
    • Over 500 (additional?) customers for remote backup.
    • 30 of the Fortune 100.

And finally, MongoDB did something many companies should, which is aggregate user success stories for which they may not be allowed to publish full details. Tidbits include:

  • Over 100 organizations run clusters with more than 100 nodes. Some clusters exceed 1,000 nodes.
  • Many clusters deliver hundreds of thousands of operations per second (combined read and write).
  • MongoDB clusters routinely store hundreds of terabytes, and some store multiple petabytes of data. Over 150 clusters exceed 1 billion documents in size. Many manage more than 100 billion documents.