Skip navigation.

Curt Monash

Syndicate content
Choices in data management and analysis
Updated: 16 hours 32 min ago

An idealized log management and analysis system — from whom?

Sun, 2014-09-07 06:38

I’ve talked with many companies recently that believe they are:

  • Focused on building a great data management and analytic stack for log management …
  • … unlike all the other companies that might be saying the same thing :)
  • … and certainly unlike expensive, poorly-scalable Splunk …
  • … and also unlike less-focused vendors of analytic RDBMS (which are also expensive) and/or Hadoop distributions.

At best, I think such competitive claims are overwrought. Still, it’s a genuinely important subject and opportunity, so let’s consider what a great log management and analysis system might look like.

Much of this discussion could apply to machine-generated data in general. But right now I think more players are doing product management with an explicit conception either of log management or event-series analytics, so for this post I’ll share that focus too.

A short answer might be “Splunk, but with more analytic functionality and more scalable performance, at lower cost, plus numerous coupons for free pizza.” A more constructive and bottoms-up approach might start with: 

  • Agents for any kind of machine that admits streams of data.
  • Parsers that:
    • Immediately identify explicit name-value pairs in popular formats such as JSON or XML.
    • Also immediately extract a significant fraction of all implicit fields in text strings — timestamps for sure, but also a lot else. (Splunk is the current gold standard for such capabilities.)
    • Allow you to easily write rules for more such extractions.
  • Immediate indexing in line with everything the parsers do.
  • Easy import of log files, relational tables, and other relevant data structures.
  • Queries that can exploit all the indexes, at least up to the functionality level of SQL 2003 analytics (including windowing) and StreamSQL, of course with …
  • … blazing scalable performance.
  • Strong workload management and concurrent performance support. (Teradata is the gold standard for such capabilities in the analytic sphere.)
  • Various other mature-DBMS features, e.g. in backup, manageability, and uptime.

Further, there would be numerous styles of business intelligence interface, at least including:

  • Generic BI like we generally see for tabular data.
  • Constantly-changing displays of streaming data.
  • BI with an event-series orientation.
  • Strong alerting.
  • Mobile versions of everything.

And there would be good support for quick-turnaround, easily-operationalized predictive analytics, of the sort that’s fairly central to the visions for Kiji and Spark.

The data management part of that is particularly hard, in that:

  • Different architectures seem naturally well-suited for different parts of the problem.
  • Maturing a new data management product is always difficult, costly and slow.

My thoughts on strengths and weaknesses of some obvious log data management contenders start:

  • Oracle, IBM, and Microsoft have a lot of heft in all things database. But while each of those vendors has great resources and occasionally impressive pieces of new database engineering, none shows much evidence of framing, let alone solving, the problem in the right way(s).
  • SAP owns Sybase, HANA, several old CEP companies, and Business Objects. Add them to the Oracle/IBM/Microsoft list.
  • Teradata has a lot going for them. Their core analytic data management strengths are obvious. They’ve owned Aster for a while, and Aster innovated nPath quite some time ago. They recently added Hadapt, a leader in schema-on-need, as well as Revelytix, which has some good ideas in dataset management. Like most other DBMS vendors, however, Teradata doesn’t yet have much of a story for streaming data, and anyhow the most optimistic case for Teradata involves the difficult task of stitching together disparate data management technologies.
  • HP Vertica has a decent position as well. Probably more proven in general concurrent, scalable performance than others in their peer group (Netezza, Greenplum, et al.), Vertica also was relatively early in innovations relevant to log analysis, including a range of time series/event series features and its own schema-on-need effort. Vertica was also founded by people who were also streaming pioneers (there were heavily overlapping groups of academics behind StreamBase, Vertica and VoltDB), but it’s not clear how that background is reflected in present Vertica product.
  • Splunk, of course, has a complete stack. At the data acquisition and parsing layers, it’s second to none, and it has a considerable set of log-appropriate BI capabilities as well. And for data management it in effect is stitching together two different inverted-list data stores, plus Hadoop.
  • Hadoop distribution vendors such as Cloudera, MapR or Hortonworks offer typically bundle a range of relevant capabilities. HDFS (Hadoop Distributed File System) is the default place to dump entire logs. In most distros, Spark offers a new approach to streaming. Impala, Drill and so on offer query. Flume gathers the log data in the first place. But a lot of the cooler capabilities are immature or unproven, and in some cases that’s putting it mildly.

In the interest of length, I’ll omit discussion of smaller vendors, except to say that Platfora’s integrated-stack event series analytics story deserves attention, and I’m disappointed that I never hear about Sumo Logic. And I don’t know a lot about companies positioned as SIEM (Security Information and Event Management), especially now that SenSage has left the scene.