Some technical background about Splunk
In an October, 2009 technical introduction to Splunk, I wrote (emphasis added):
Splunk software both reads logs and indexes them. The same code runs both on the nodes that do the indexing and on machines that simply emit logs.
It turns out that the bolded part was changed several years ago. However, I don’t have further details, so let’s move on to Splunk’s DBMS-like aspects.
I also wrote:
The fundamental thing that Splunk looks at is an increment to a log – i.e., whatever has been added to the log since Splunk last looked at it.
That remains true. Confusingly, Splunk refers to these log increments as “rows”, even though they’re really structured and queried more like documents.
I further wrote:
Splunk has a simple ILM (Information Lifecycle management) story based on time. I didn’t probe for details.
Splunk’s ILM story turns out to be simple indeed.
- As data streams in, Splunk adds it to the most recent — “hot” — bucket. Once a bucket is full, it becomes immutable — “warm” — and a new hot bucket is opened to receive data.
- Splunk executes queries against whichever of these time-slice buckets make sense, then unions results together as needed.
Finally, I wrote:
I get the impression that most Splunk entity extraction is done at search time, not at indexing time. Splunk says that, if a <name, value> pair is clearly marked, its software does a good job of recognizing same. Beyond that, fields seem to be specified by users when they define searches.
I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.
The point of what I in October, 2013 called
a high(er)-performance data store into which you can selectively copy columns of data
and which Splunk enthusiastically calls its “High Performance Analytic Store” is to meet that latter need.
Inverted list technology is confusing for several reasons, which start:
- It has two names that — rightly or wrongly — are used fairly interchangeably: inverted index and inverted list.
- Inverted indexes have played different roles at different times. in particular:
What’s more, inverted list technology can take several different forms.
- In the simplest case, for each of many keywords, the inverted index lists the documents that contain it. Splunk does a form of this, where the “keyword” is the field — i.e. name — in a (field, value) pair.
- Another option is to store, for each keyword or name, not just document_IDs, but additional information.
- In the case of (field, value) pairs, the value can be stored. Splunk sometimes does that too.
- In the case of text documents, the index can store the position(s) in the document that the word occurs. This is irrelevant to Splunk.
- When you list all the records that have a certain field in them, and the list mentions the values, you’re getting pretty close to having a column-group NoSQL DBMS (e.g. Cassandra or HBase). Indeed, you might even be on your way to a columnar RDBMS; after all, SAP HANA grew out of a text indexing system.
Splunk, HPAS, and inverted indexes
With all that background, we can finally summarize Splunk’s “High Performance Analytic Store” story.
- Splunk’s classic data store is an inverted list system that:
- Tracks (field, value) pairs for a few fields that are always the same, such as Source_System.
- Otherwise tracks fields only.
- Splunk HPAS is an inverted list system that tracks (field, value) pairs for arbitrary fields. This gives much higher performance for queries that SELECT on or GROUP BY those fields.
- As of Splunk 6, Splunk Classic and Splunk HPAS are tightly and almost transparently integrated.
While I haven’t probed for full specifics, I did gather:
- Queries execute against both data stores at once, without any syntax change. At least, they do if you press some button; that’s the “almost” in the transparency.
- HPAS time-slices the data it stores by the same time intervals that Splunk Classic does. Hence for each time range, integrated Splunk can interrogate the HPAS first and, if it can’t answer, go to the slower traditional Splunk store.
- There are two basic ways to populate the HPAS:
- As the data streams in.
- Via the result sets of Splunk queries. Splunk talks as if this is the preferred way, which fits with Splunk’s long-time argument that it’s nice not to have to make any schema choices before you start streaming the data in.
For quite some time, one of the most frequent marketing pitches I’ve heard is “Analytics made easy for everybody!”, where by “quite some time” I mean “over 30 years”. “Uniquely easy analytics” is a claim that I meet with the greatest of skepticism.* Further confusing matters, these claims are usually about what amounts to business intelligence tools, but vendors increasingly say “Our stuff is better than the BI that came before, so we don’t want you to call it ‘BI’ as well.”
*That’s even if your slide deck doesn’t contain a picture of a pyramid of user kinds; if there actually is such a drawing, then the chance that I believe you is effectively nil.
All those caveats notwithstanding, there are indeed at least three forms of widespread analytics:
- Fairly standalone, eas(ier) to use business intelligence tools, sometimes marketed as focusing on “data exploration” or “data discovery”.
- Charts and graphs integrated or at least well-embedded into production applications. This technology is on a long-term rise. But in some sense, integrated reporting has been around since the invention of accounting.
- Predictive analytics built into automated systems, for example ad selection. This is not what is usually meant by the “easy analytics” claim, and I’ll say no more about it in this post.
It would be nice to say that the first two bullet points represent a fairly clean operational/investigative BI split, but that would be wrong; human real-time dashboards can at once be standalone and operational.
Often, the message “Our BI is easy to use by everybody, unlike every other BI offering in the past 40 years” is unsupported by facts; vendors just offer me-too BI technology and falsely claim it’s something special. But sometimes there is actual substance, usually in one or more aspects of time-to-answer. For example:
- Sometimes the BI itself has a particularly good interface for navigation.
- I think it’s still possible to be differentiated in mobile BI delivery.
- It’s definitely still possible to be differentiated in real-time/streaming BI interfaces.
- Sometimes the visible BI is just part of a specialized stack, whose other elements make it much easier to set up working UI than in the traditional model.
- Some claims along these lines are bogus, drawing false comparisons to worst-case scenarios in which enterprises take a year or two setting up their first-ever data warehouse.
- Some of these claims, however, are more legitimate, at least to the extent that the stack includes leading-edge smart data integration, schema-on-need data management, and so on.
One items I’m leaving off the list is the capability to easily design charts, graphs or whole dashboards. When BI vendors add that functionality, they often present it as an industry innovation; but it’s been years since I saw something in that vein beyond the me-too.
A couple of points that arise frequently in conversation, but that I don’t seem to have made clearly online.
“Metadata” is generally defined as “data about data”. That’s basically correct, but it’s easy to forget how many different kinds of metadata there are. My list of metadata kinds starts with:
- Data about data structure. This is the classical sense of the term. But please note:
- In a relational database, structural metadata is rather separate from the data itself.
- In a document database, each document might carry structure information with it.
- Other inputs to core data management functions. Two major examples are:
- Column statistics that inform RDBMS optimizers.
- Value ranges that inform partition pruning or, more generally, data skipping.
- Inputs to ancillary data management functions — for example, security privileges.
- Support for human decisions about data — for example, information about authorship or lineage.
What’s worse, the past year’s most famous example of “metadata”, telephone call metadata, is misnamed. This so-called metadata, much loved by the NSA (National Security Agency), is just data, e.g. in the format of a CDR (Call Detail Record). Calling it metadata implies that it describes other data — the actual contents of the phone calls — that the NSA strenuously asserts don’t actually exist.
And finally, the first bullet point above has a counter-intuitive consequence — all common terminology notwithstanding, relational data is less structured than document data. Reasons include:
- Relational databases usually just hold strings — or maybe numbers — with structural information being held elsewhere.
- Some document databases store structural metadata right with the document data itself.
- Some document databases store data in the form of (name, value) pairs. In some cases additional structure is imposed by naming conventions.
- Actual text documents carry the structure imposed by grammar and syntax.
Memory-centric data management is confusing. And so I’m going to clarify a couple of things about MemSQL 3.0 even though I don’t yet have a lot of details.* They are:
- MemSQL has historically been an in-memory row store, which as of last year scales out.
- It turns out that the MemSQL row store actually has two table types. One is scaled out. The other — called “reference” — is replicated on every node.
- MemSQL has now added a third table type, which is columnar and which resides in flash memory.
- If you want to keep data in, for example, both the scale-out row store and the column store, you’d have to copy/replicate it within MemSQL. And if you wanted to access data from both versions at once (e.g. because different copies cover different time periods), you’d likely have to do a UNION or something like that.
*MemSQL’s first columnar offering sounds pretty basic; for example, there’s no columnar compression yet. (Edit: Oops, that’s not accurate. See comment below.) But at least they actually have one, which puts them ahead of many other row-based RDBMS vendors that come to mind.
And to hammer home the contrast:
- IBM, Oracle and Microsoft, which all sell row-based DBMS meant to run on disk or other persistent storage, have added or will add columnar options that run in RAM.
- MemSQL, which sells a row-based DBMS that runs in RAM, has added a columnar option that runs in persistent solid-state storage.
Ever more products try to integrate SQL with Hadoop, and discussions of them seem confused, in line with Monash’s First Law of Commercial Semantics. So let’s draw some distinctions, starting with (and these overlap):
- Are the SQL engine and Hadoop:
- Necessarily on the same cluster?
- Necessarily or at least most naturally on different clusters?
- How, if at all, is Hadoop invoked by the SQL engine? Specifically, what is the role of:
- HDFS (Hadoop Distributed File System)?
- Hadoop MapReduce?
- How, if at all, is the SQL engine invoked by Hadoop?
- If something is called a “connector”, then Hadoop and the SQL engine are most likely on separate clusters. Good features include (but these can partially contradict each other):
- A way of making data transfer maximally parallel.
- Query planning that is smart about when to process on the SQL engine and when to use Hadoop’s native SQL (Hive or otherwise).
- If something is called “SQL-on-Hadoop”, then Hadoop and the SQL engine are or should be on the same cluster, using the same nodes to store and process data. But while that’s a necessary condition, I’d prefer that it not be sufficient.
Let’s go to some examples.
Hive is the closest example of SQL/Hadoop integration known. Hive executes a somewhat low-grade dialect of SQL — HQL (Hive Query Language) — via very standard Hadoop: Hadoop MapReduce, all HDFS file formats, etc. HCatalog is an enhancement/replacement for the Hive metadata store. HQL is just another language that can be used to write (parts of) Hadoop jobs.
Impala is Cloudera’s replacement for Hive. Impala is and/or is planned to be much like Hive, but much better, for example in performance and in SQL functionality. Impala has its own custom execution engine, including a daemon on every Hadoop data node, and seems to run against a variety of but not all HDFS file formats.
Stinger is Hortonworks’ (and presumably also Apache’s) answer to Impala, but is more of a Hive upgrade than an outright replacement. In particular, Stinger’s answer to the new Impala engine is a port of Hive to the new engine Tez.
Teradata SQL-H is an RDBMS-Hadoop connector that uses HCatalog, and plans queries across the two clusters. Microsoft Polybase is like SQL-H, but it seems more willing than Teradata or Teradata Aster to (optionally) coexist on the same nodes as Hadoop.
Hadapt runs on the Hadoop cluster, putting PostgreSQL* and other software on each Hadoop data node. It has two query engines, one that invokes Hadoop MapReduce (the original one, still best for longer-running queries) and one that doesn’t (more analogous to Impala). When last I looked, Hadapt didn’t query or update against the HDFS API, but there was an interesting future in preloading data from HDFS into Hadapt PostgreSQL tables, and I think that Hadapt’s PostgreSQL tables are technically HDFS files. I don’t think Hadapt makes much use of HCatalog.
*Hacked to allow Hadapt to offer more than just SQL/Hadoop integration.
Splice Machine is a new entrant (public beta is imminent) that has put Apache Derby over an HBase back end. (Apache Derby is the former Cloudscape, an embeddable Java RDBMS that was acquired by Informix and hence later by IBM.) Splice Machine runs on your Hadoop nodes as an HBase coprocessor. Its relationship to non-HBase parts of Hadoop is arm’s-length. I wish this weren’t called “SQL-on-Hadoop”.
From time to time I like to do “what I’m working on” posts. From my recent blogging, you probably already know that includes:
- Hadoop (always, and please see below).
- Analytic RDBMS (ditto).
- NoSQL and NewSQL.
- Specifically, SQL-on-Hadoop
- Spark and other memory-centric technology, including streaming.
- Public policy, mainly but not only in the area of surveillance/privacy.
- General strategic advice for all sizes of tech company.
Other stuff on my mind includes but is not limited to:
1. Certain categories of buying organizations are inherently leading-edge.
- Internet companies have adopted Hadoop, NoSQL, NewSQL and all that en masse. Often, they won’t even look at things that are conventional or expensive.
- US telecom companies have been buying 1 each of every DBMS on the market since pre-relational days.
- Financial services firms — specifically algorithmic traders and broker-dealers — have been in their own technical world for decades …
- … as have national-security agencies …
- … as have pharmaceutical research departments.
Fine. But what really intrigues me is when more ordinary enterprises also put leading-edge technologies into production. I pester everybody for examples of that.
2. In particular, I hope to figure out where Hadoop is or soon will be getting major adoption.
- Widespread Hadoop adoption at ordinary large enterprises is, I think, inevitable and imminent. But it hasn’t quite happened yet.
- I think that part of the “enterprise data hub” story is a great bet to come true — Hadoop is becoming a key destination for data to land and be transformed. MapReduce was invented for data transformation; Hadoop was invented to do MapReduce; data transformation workloads have already been moving from expensive analytic RDBMS to cheaper Hadoop.
- I also think Hadoop — enhanced with Spark or whatever — will win as a platform for sophisticated predictive modeling; Hadoop’s (and Spark’s) flexibility is at least as useful for the purpose as RDBMS’ SQL execution speed.
- I’m still skeptical about ordinary enterprises’ adoption of Hadoop as a business intelligence platform, but it’s definitely another area to track.
3. Analytic RDBMS and data warehouse appliance pricing is always a big deal. Hadoop’s great price advantage doesn’t have to be permanent, and in fact there are a number of fairly low-cost RDBMS offerings, such as petascale Vertica, the Teradata 1000 series, or Infobright.
Speaking of that, it turns out Teradata now publishes per-terabyte pricing. Please note that those are uncompressed prices; actual prices can be assumed to be lower, at least for databases that compress well.
Analytic RDBMS prices are still shaking out.
4. As I previously noted, ensemble models have become the norm for machine learning. I want to learn more about the implications of that.
One conjecture — everything we learned in school about statistics is wrong, or at least it’s less important than we thought. Predictive modeling is not mainly about least squares, regressions, curve-fitting, etc. Rather, it’s first and foremost about data segmentation and clustering, with all the curve-fitting stuff being secondary.
Besides fitting — as it were — what I hear, this hypothesis also matches common sense. How do businesses use predictive modeling? For each customer/prospect/site-visitor/whatever, they decide which of a limited number of possible actions to take. At its core, that’s an exercise in segmentation.
5. I think data integration is getting a lot smarter than it was. Hadoop-based transformation is the obvious example. But there’s also ClearStory’s data intelligence pitch. (And yes, I know I need to talk with Paxata. There’s been a lot of ball-dropping on that one, including by me.)
6. There’s a meta-theme in the above — stuff that’s not exactly a DBMS or DBMS-like data store. Streaming fits into that. So does smart data integration. So, arguably, does Spark. So do data grids, another of those topics I’d like to know more about but haven’t nailed down yet.
Data management is getting ever more complex.
I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.
- Spark is very new. All Spark adoption is recent.
- Databricks was founded to commercialize Spark. It is very much in stealth mode …
- … except insofar as Databricks folks are going out and trying to drum up Spark adoption.
- Ion Stoica is running Databricks, but you couldn’t tell that from his UC Berkeley bio page. Edit: After I posted this, Ion’s bio was quickly updated.
- Spark creator and Databricks CTO Matei Zaharia is an MIT professor, but actually went on leave there before he ever showed up.
- Cloudera is perhaps Spark’s most visible supporter. But Cloudera’s views of Spark’s role in the world is different from the Spark team’s.
The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:
- Spark is a distributed execution engine for analytic processes …
- … which works well with Hadoop.
- Spark is distinguished by a flexible in-memory data model …
- … and farms out persistence to HDFS (Hadoop Distributed File System) or other existing data stores.
- Intended analytic use cases for Spark include:
- SQL data manipulation.
- ETL-like data manipulation.
- Streaming-like data manipulation.
- Machine learning.
- Graph analytics.
Except for certain low-latency operations,* anything you can do in Spark can also be done in straight Hadoop; Spark just can have advantages in performance and programming ease. Spark RDDs (Resilient Distributed Datasets) are immutable at this time, so Spark is not suited for short-request update workloads.
*A new Spark task requires a thread, not a whole Java Virtual Machine.
Everybody agrees that machine learning is a top Spark use case. In particular:
- Cloudera sees machine learning as the major area of Spark adoption to date.
- Ion gave me the impression machine learning is one of the major areas of Spark adoption to date.
- Mike gave me the impression that machine learning was a core intended use case for Spark the first time we talked about it.
- There’s a machine learning library for Spark, and also a way to use Spark to do distributed R.
I believe data transformation is a major Spark use case as well.
- Ion gave me that impression, although Cloudera surprisingly did not. Edit: Actually, see Matt Brandwine’s comment below.
- I have one client (ClearStory) using Spark that way, and a second that’s likely to.
- It makes sense that the #1 Hadoop use case (to date), which is something Spark also is well-suited for, would be an important early Spark use case as well.
Spark Streaming is fairly new, but is already getting some adoption. Notes on that start:
- The actual technology is a form of micro-batching. I plan to learn more about that in the future.
- Cloudera sees streaming as one of the two big Spark use cases, and praises Spark Streaming for its fault tolerance and its great ease of coding.
- Mike Franklin knows a lot about streaming.
Part of that story is a sudden decline in the reputation of Storm, whose troubles seem to include:
- Project founder and Twitter employee Nathan Marz seems no longer to be associated with Storm nor employed at Twitter.
- I am told that in general the Storm community is not all that vibrant.
- Various aspects of Storm’s technology are disappointing people.
Other notes on Spark use cases include:
- Impala-loving Cloudera doesn’t plan to support Shark. Duh.
- Cloudera also won’t at first support any Spark predictive modeling add-on.
- Ion’s other company, Conviva, is doing some real-time decisioning in Spark.
Spark data management has been enhanced by a project called Tachyon.* The main point of Tachyon is that Spark RDDs (Resilient Distributed Datasets) now persist in memory beyond the life of a job; besides offering the RDDs to other Spark jobs, Tachyon also opens them to Hadoop via an HDFS emulator.
*If there’s ever a Spark/Tachyon management suite, I hope some aspect is named Cherenkov — i.e., the radiation that is measured to detect the passage of tachyons.:)
And finally, some metrics and so on:
- Databricks has between 10 and 20 employees.
- Spark has >100 individual contributors from >25 different companies.
- There was a Spark Summit with >450 attendees (from >180 organizations), and an earlier Spark-mainly conference with >200 attendees.
- The Spark meet-up group in San Francisco has >1500 members signed up.
- Various Spark users and subprojects are identified on the Apache Spark pages.
- Most of the current substance on Databricks’ website is in its blog.
1. Censorship worries me, a lot. A classic example is Vietnam, which basically has outlawed online political discussion.
And such laws can have teeth. It’s hard to conceal your internet usage from an inquisitive government.
2. Software and software related patents are back in the news. Google, which said it was paying $5.5 billion or so for a bunch of Motorola patents, turns out to really have paid $7 billion or more. Twitter and IBM did a patent deal as well. Big numbers, and good for certain shareholders. But this all benefits the wider world — how?
The purpose of legal intellectual property protections, simply put, is to help make it a good decision to create something. …
Why does “securing … exclusive Right[s]” to the creators of things that are patented, copyrighted, or trademarked help make it a good decision for them to create stuff? Because it averts competition from copiers, thus making the creator a monopolist in what s/he has created, allowing her to at least somewhat value-price her creation.
I.e., the core point of intellectual property rights is to prevent copying-based competition. By way of contrast, any other kind of intellectual property “right” should be viewed with great suspicion.
That Constitutionally-based principle makes as much sense to me now as it did then. By way of contrast, “Let’s give more intellectual property rights to big corporations to protect middle-managers’ jobs” is — well, it’s an argument I view with great suspicion.
But I find it extremely hard to think of a technology industry example in which development was stimulated by the possibility of patent protection. Yes, the situation may be different in pharmaceuticals, or for gadgeteering home inventors, but I can think of no case in which technology has been better, or faster to come to market, because of the possibility of a patent-law monopoly. So if software and business-method patents were abolished entirely – even the ones that I think could be realistically adjudicated — I’d be pleased.
3. In November, 2008 I offered IT policy suggestions for the incoming Obama Administration, especially:
- Pick the right Chief Technology Officer.
- Fix the government technology contracting process in general.
- Fix the air traffic control system in particular.
- Generally take a businesslike approach to government IT. Obama’s focus on making government “transparent” and searchable would be just one byproduct of that effort.
- Continue to beef up internal search and knowledge management (remember the FBI agent who guessed the 9/11 plans, but couldn’t communicate his ideas to anybody who cared).
- Write privacy laws of the sort that will, for example, allow electronic health records to be adopted without great fear of misuse. (I have some strong opinions as to what form those laws should take.)
- Drastically beef up math education!! (Science too, but math is especially important.) This takes leadership to convince people it’s CRUCIAL to be numerate, perhaps even more than it takes specific policy initiatives. Little else is as important.
… we need an experienced technology implementation leader to:
- Recommend major changes in government IT contracting. Right now, information technology is bought at the wrong level of granularity, too coarse and too fine at once. Private sector CIOs make broad technology architecture decisions, then make incremental purchases as needed. Public sector IT managers, however, are generally compelled to make purchases on a “project” basis, which allows neither the sanity of broad-scale planning nor the economies and adaptability of just-in-time acquisition.
- Establish best practices in a broad range of IT areas. Obama’s “transparency” initiative involves pushing the state of the art in public-facing technology for search, query, and audio/video, at a minimum. Other areas of major technical challenge include internal search, knowledge management, and social networking; disaster robustness; planning in the face of political budgeting uncertainty; numbers-based management without the benefit of a profit/loss statement … and the list could easily be twice as long.
- Interact with the private sector. From electronic health records to the general supply chain, there are huge opportunities for public/private interoperability, quite apart from the obvious customer/vendor relationships the government has with the IT industry.
- Improve training, recruiting, and retention. Anywhere government needs employees whose skills are also in high demand in the private sector, government pay scales cause difficulties. IT is a top area for that problem. Outstanding leadership is needed to overcome it.
Little of that actually happened.
Kudos if you noticed the link — which I herewith repeat — to what I wrote about privacy in 2006.
In particular — and even after the HealthCare.gov fiasco — I think few voters or legislators understand how incredibly broken government IT contracting is. Almost all major projects go through a five-stage process:
Re-competes usually follow as well.
And so government IT is subject to extreme forms of two inevitable project killers:
- Waterfall methodology.
Procurement cycles take years, and in the worst cases decades. Project specifications are often fixed until the next procurement, which is often 7-10 years down the road. This, to put it mildly, is the opposite of agility, and widespread project failure ensues.
In response to the uproar created by the Edward Snowden revelations, the White House commissioned five dignitaries to produce a 300-page report, released last December 12. (Official name: Report and Recommendations of The President’s Review Group on Intelligence and Communications Technologies.) I read or skimmed a large minority of it, and I found enough substance to be worthy of a blog post.
Many of the report’s details fall in the buckets of bureaucratic administrivia,* internal information security, or general pabulum. But the commission started with four general principles that I think have great merit.
*One big item — restrict the NSA to foreign intelligence, and split off domestic cyber defense into a separate organization.
The United States Government must protect, at once, two different forms of security: national security and personal privacy.
… It might seem puzzling, or a coincidence of language, that the word “security” embodies such different values. But the etymology of the word solves the puzzle; there is no coincidence here. In Latin, the word “securus” offers the core meanings, which include “free from care, quiet, easy,” and also “tranquil; free from danger, safe.”
Key point: The report rejects any idea that national security concerns should run roughshod over individual liberty.
The central task is one of risk management; multiple risks are involved, and all of them must be considered. …
- Risks to privacy;
- Risks to freedom and civil liberties, on the Internet and elsewhere;
- Risks to our relationships with other nations; and
- Risks to trade and commerce, including international commerce.
… If people are fearful that their conversations are being monitored, expressions of doubt about or opposition to current policies and leaders may be chilled, and the democratic process itself may be compromised.
… These points make it abundantly clear that if officials can acquire information, it does not follow that they should do so.
I am always pleased when policy makers recognize that the key issue is chilling effects upon the exercise of ordinary freedoms; the report made that point multiple times, footnoting both Sonia Sotomayor and the 1970s Church Commission. (Search the document for chill to see where.)
The idea of “balancing” has an important element of truth, but it is also inadequate and misleading.
… The purposes of surveillance must be legitimate. If they are not, no amount of “balancing” can justify surveillance. For this reason, it is exceptionally important to create explicit prohibitions and safeguards, designed to reduce the risk that surveillance will ever be undertaken for illegitimate ends.
Exceptionally important indeed.
The government should base its decisions on a careful analysis of consequences, including both benefits and costs (to the extent feasible).
Government officials, even more than other large-organization employees, have the tendency to avoid job failure at all costs. This goes triple when they work on life-and-death issues. Even so, sometimes security can be pursued with too much vigor, and much of the United States’ post-9/11 history directly bears that out.
And here’s the part I like best of all (emphasis mine):
We recommend that, if the government legally intercepts a communication under section 702 … and if the communication either includes a United States person as a participant or reveals information about a United States person:
(1) any information about that United States person should be purged upon detection unless it either has foreign intelligence value or is necessary to prevent serious harm to others;
(2) any information about the United States person may not be used in evidence in any proceeding against that United States person;
I’ve felt for years that a deciding issue in the preservation of liberty will be what kinds of information are admissible in court, or otherwise may be used to hurt people. All safeguards on data collection and retention notwithstanding, huge datasets will be created and maintained. Continued liberty requires careful limitation of how they may be used against us.
- Why privacy laws should be based on data use more than on data possession (August, 2013)
- A brief history of privacy theory (July, 2013)
- The Obama Administration’s reasonable but inadequate consumer privacy proposals (February, 2012)
- The essential privacy questions our lawmakers must address (July, 2010)
Thanks to a court decision that overturned some existing regulations, network neutrality is back in the news. Most people think the key issue is whether
- Telecommunication companies (e.g. wireless and/or broadband services providers) should be allowed to charge …
- … other internet companies (website owners, game companies, streaming media providers, etc., collectively known as edge providers) for …
- … shipping data to internet service consumers in particularly attractive ways.
But I think some forms of charging can be OK — albeit not the ones currently being discussed — and so the question should instead be how the charges are designed.
When I wrote about network neutrality in 2006-7, the issue was mainly whether broadband providers would be allowed to ship different kinds of data at different speeds or reliability. Now the big controversy is whether mobile data providers should be allowed to accept “sponsorship” so as to have certain kinds of data not count against mobile data plan volume caps. Either way:
- The “anything goes” strategy has obvious free-market appeal.
- But proponents of network neutrality regulation — such as Fred Wilson and Nilay Patel — point out a major risk: By striking deals that smaller companies can’t imitate, large, established “edge provider” services may strangle upstart competitors in their cribs.
I think the anti-discrimination argument for network neutrality has much merit. But I also think there are some kinds of payment structure that could leave the playing field fairly level. Imagine, if you will, that:
- Consumers are charged for data, speed of connection, reliability of delivery, or anything else, but …
- … internet companies have the ability to absorb those charges on consumers’ behalf, but can only do so …
- … one interaction at a time, with no volume discounts, via an automated system that is open to everybody.
Such a system is surely technologically feasible — indeed, it is at least as feasible as the online advertising networks that already exist. Further, it would be possible for the system to have nice features such as:
- Telcos could implement forms of peak load pricing, for those times when their network capacity actually is under stress.
- “Edge provider” internet companies could pay subsidies only on behalf of certain consumers, where those consumers are selected in all the complex ways that advertisements are currently targeted.
In such a setup, which discrimination fears would or would not be realized?
- Startups that hope to get adoption first and monetize second might face the cash cost of actually paying their users to try their services. Sorry. But at least they could target their spend on whoever they viewed as being the most important potential adopters.
- Large vendors could not negotiate preferential pricing, reciprocal deals, or anything like that. At least, they couldn’t do so directly.
- Discrimination by type of service – for example telcos trying to hamstring communications services that compete with their own offerings – could be staved off, via fairly lightweight regulatory oversight of the ways pricing plans are structured.
- Regulators could head off sneaky “sweetheart deals” between big “edge provider” companies and telcos in much the same way.
I have no great objections to extreme net neutrality; behemoth oligopolist telcos should be among the last companies to cry “Un-free markets, boo-hoo-sob!!” But as internet pipes are increasingly used for telephony, streaming media or even medical consultations, drawing quality-of-service distinctions could have a certain merit. And so, for reasons similar to those I outlined in 2007, I still lean toward the partial network neutrality described above.
IBM excels at game technology, most famously in Deep Blue (chess) and Watson (Jeopardy!). But except at the chip level — PowerPC — IBM hasn’t accomplished much at game/real world crossover. And so I suspect the Watson hype is far overblown.
I believe that for two main reasons. First, whenever IBM talks about big initiatives like Watson, it winds up bundling a bunch of dissimilar things together and claiming they’re a seamless whole. Second, some core Watson claims are eerily similar to artificial intelligence (AI) over-hype three or more decades past. For example, the leukemia treatment advisor that is being hopefully built in Watson now sounds a lot like MYCIN from the early 1970s, and the idea of collecting a lot of tidbits of information sounds a lot like the Cyc project. And by the way:
- MYCIN led to E-MYCIN, which led to the company Teknowledge, which raised a lot of money* but now has almost faded from memory.
- Cyc is connected to the computer science community’s standard unit of bogosity.
*Much of it, I’m ashamed to say, with my help, back in my stock analyst days.
AI is something of an umbrella category, often just meaning “Computerized stuff that we don’t know how to do yet”, or ” … only recently figured out how to do.” Automated decision-making is an aspect of AI, for example, but so also is natural language recognition. It used to be believed that most AI should be approached in the same way:
- Come up with a clever way to represent knowledge.
- Match the actual situation against the knowledge.
- Produce a smart result.
But that template unfortunately proved disappointing time after time. The problem was typically that not enough knowledge could in practice be represented, and thus well-informed automated decisions could not be made. In particular, there was a “first step fallacy,” in which a demo system would solve a “toy problem”, but robust real-life systems never emerged.
Of course, there are exceptions to this general rule of disappointment; for example, Teknowledge and its fellow over-hyped expert system technology vendors of the 1980s (Intellicorp, Inference, et al.) did get a few solid production references. But the ones I remember best (e.g. American Express credit, United Airlines seat pricing, some equipment maintenance scheduling) were often for use cases that we’d now address in more straightforwardly mathematical ways.
Watson is generally promoted as helping with decision-making, but that message has to be scrutinized carefully. So far as I’ve been able to guess, the true core technology of IBM Watson is extracting knowledge from text — or primarily from text — and representing it in some way that is reasonably useful in answering natural language queries. The hope would then be to eventually achieve a rich enough knowledge base to support the Star Trek computer. But automated decision-making doesn’t just require knowledge; it also requires decision-making rules. And if Watson is significantly ahead of the 1980s decisioning state of the art (Rete, backward chaining, etc.), I’m not aware of how.
So if Watson is going to accomplish anything soon, it will probably be in areas where serious decision-making chops aren’t needed. Indeed, the application areas that I’ve seen mentioned for the past or near term are mainly:
- Playing Jeopardy! That’s pretty simple from a decision-making standpoint.
- Advising on treatments for a specific disease (not actually built yet). As noted above, that’s 1970s-level decisioning.
- Knowledge extraction from medical research articles. That has very little to do with decisioning, and incidentally sounds a lot like what SPSS (before it was acquired by IBM) and Temis were already doing years ago.
- Natural-language customer interaction. That may not involve any decisioning at all.
Returning to the point that Watson’s core technology is probably natural language, it seems fair to say that IBM these days is probably better at the text mining side than at speech understanding. Evidence I’m thinking of includes:
- That seems to be what IBM itself is saying on its speech recognition page.
- I also recall IBM’s natural language recognition projects being regarded as not going well in the late 1990s. (Project Penelope, I believe, although I can’t confirm that via googling.)
- IBM’s LanguageWare sounded more oriented to text mining in 2008.
- IBM bought SPSS, which had decent text mining technology.
And while this is too old to really count as evidence, IBM had a famously unsuccessful language recognition deal with Artificial Intelligence Corporation way back in 1983-4.*
*Yeah, I helped raise money for AICorp too, and also for Symbolics. As you might imagine, my investment banking trophies do not have pride of place on my desk.
One last observation — text mining has a very mixed track record. Watson will have to go far beyond predecessor text technologies to become nearly the big deal IBM is suggesting it will be.
I first wrote about in-memory data management a decade ago. But I long declined to use that term — because there’s almost always a persistence story outside of RAM — and coined “memory-centric” as an alternative. Then I relented 1 1/2 years ago, and defined in-memory DBMS as
DBMS designed under the assumption that substantially all database operations will be performed in RAM (Random Access Memory)
By way of contrast:
Hybrid memory-centric DBMS is our term for a DBMS that has two modes:
- Querying and updating (or loading into) persistent storage.
These definitions, while a bit rough, seem to fit most cases. One awkward exception is Aerospike, which assumes semiconductor memory, but is happy to persist onto flash (just not spinning disk). Another is Kognitio, which is definitely lying when it claims its product was in-memory all along, but may or may not have redesigned its technology over the decades to have become more purely in-memory. (But if they have, what happened to all the previous disk-based users??)
Two other sources of confusion are:
- The broad variety of memory-centric data management approaches.
- The over-enthusiastic marketing of SAP HANA.
With all that said, here’s a little update on in-memory data management and related subjects.
- I maintain my opinion that traditional databases will eventually wind up in RAM.
- At conventional large enterprises — as opposed to for example pure internet companies — production deployments of HANA are probably comparable in number and investment to production deployments of Hadoop. (I’m sorry, but much of my supporting information for that is confidential.)
- Cloudera is emphatically backing Spark. And a key aspect of Spark is that, unlike most of Hadoop, it’s memory-centric.
- It has become common for disk-based DBMS to persist data through a “log-structured” architecture. That’s a whole lot like what you do for persistence in a fundamentally in-memory system.
- I’m also sensing increasing comfort with the strategy of committing writes as soon as they’ve been acknowledged by two or more nodes in RAM.
- I’ve never heard a story about an in-memory DBMS actually losing data. It’s surely happened, but evidently not often.
Cassandra’s reputation in many quarters is:
- World-leading in the geo-distribution feature.
- Impressively scalable.
- Hard to use.
This has led competitors to use, and get away with, sales claims along the lines of “Well, if you really need geo-distribution and can’t wait for us to catch up — which we soon will! — you should use Cassandra. But otherwise, there are better choices.”
My friends at DataStax, naturally, don’t think that’s quite fair. And so I invited them — specifically Billy Bosworth and Patrick McFadin — to educate me. Here are some highlights of that exercise.
DataStax and Cassandra have some very impressive accounts, which don’t necessarily revolve around geo-distribution. Netflix, probably the flagship Cassandra user — since Cassandra inventor Facebook adopted HBase instead — actually hasn’t been using the geo-distribution feature. Confidential accounts include:
- A petabyte or so of data at a very prominent company, geo-distributed, with 800+ nodes, in a kind of block storage use case.
- A messaging application at a very prominent company, anticipated to grow to multiple data centers and a petabyte of so of data, across 1000s of nodes.
- A 300 terabyte single-data-center telecom account (which I can’t find on DataStax’s extensive customer list).
- A huge health records deal.
- A Fortune 10 company.
DataStax and Cassandra won’t necessarily win customer-brag wars versus MongoDB, Couchbase, or even HBase, but at least they’re strongly in the competition.
DataStax claims that simplicity is now a strength. There are two main parts to that surprising assertion.
- DataStax claims that operation is simple, that operators are “bored”, that large users appreciate the ease of operation, and so on. These claims become a lot more plausible if you recall:
- Cassandra isn’t used for databases that resemble relational schemas with 1000s of tables, lots of foreign keys, and so on.
- Performance and capacity problems in Cassandra don’t necessarily require sophisticated operational solutions; you can throw hardware at them instead.
- DataStax claims that CQL (Cassandra Query Language) makes Cassandra programming and data modeling much easier than they were before. More on that below.
DataStax claims that Cassandra excels at time series use cases, where “time series” seem to equate to collections of short records with timestamps. This seems borne out by, for example, the first three use cases on my bulleted list above. Actually, it’s not just timestamps, but rather any data that is naturally ordered by a sequential field, such as packet IDs from a packet-switching network.
Finally, DataStax claims that Cassandra is good for high-velocity applications in general. A generic example that DataStax supported with some Very Big Names — whether those were of customers or prospects wasn’t entirely clear — was in retailing, to actually serve accurate information as to whether inventory is in stock, something Walmart failed at as recently as last year.
Now let’s talk a bit about Cassandra technology. I’ll start with an example. Imagine a “phone-home” use case in which many devices emit many records each in the form of (DeviceID, TimeStamp, MeterReading) triples.
- A relational database would store that as a bunch of rows, 3 columns wide.
- A Cassandra database, however, would have a single row for each DeviceID; each row would contain two columns for each (TimeStamp, MeterReading) pair.
- The column names are composite, in a way that shows the different column pairs are each recording the same kind of thing.
- Cassandra Query Language (CQL) lets you query (or insert) as if the data were in the relational-table logical format. But of course you can also reference Cassandra in a way that takes its actual (row, column) structure at face value.
So in essence, you have schemas that at once are dynamic and tabular. The big downside vs. a relational DBMS is that — duh! — you can’t have the benefits of normalization.
For clarity, I should note that much of Cassandra’s logical architecture is shared by fellow BigTable-architecture data store HBase; it’s not a coincidence that Facebook invented Cassandra to support messaging, nor that when Facebook changed its mind about that, it adopted HBase as the alternative. Accumulo has similar characteristics as well.
Physically, what’s going on in Cassandra is something like this:
- Each Cassandra row is maintained in memory, and in most cases sorted on timestamp (or some other comparator), in either order. This is the basis for the claims of great Cassandra performance and general suitability specifically in time series use cases. (E.g., “Last 10 events” kinds of reads are very easy.)
- Once rows are flushed to disk, they are immutable … except that of course they eventually are compacted, typically via a merge sort. (When you do need to do a database update, last write wins.)
- Rows are organized into files on disk. There’s a “key cache” that in many cases will tell you exactly which file contains the row you’re looking for. If you have a cache miss …
- … each file has a Bloom filter predicting which keys it contains, and you interrogate those. Those Bloom filters are also maintained in memory (and copied on disk just for the sake of persistence).
Cassandra has few indexes, and no physical concept of datatype.
The benefits I see to this physical architecture are mainly:
- Plays nicely with Cassandra’s logical architecture.
- Plays nicely with scale-out.
- Seems to have been designed RAM-first, which matches how databases are actually used.
- Is fast for range queries on the comparator (e.g. timestamp).
- Doesn’t have a lot of knobs to twiddle, which makes it plausible that a relatively immature product can be easy to administer.
For some use cases, that’s not a bad list of advantages. Not bad at all.
It took me a bit of time, and an extra call with Vertica’s long-time R&D chief Shilpa Lawande, but I think I have a decent handle now on Vertica 7, code-named Crane. The two aspects of Vertica 7 I find most interesting are:
- Flex Zone, a schema-on-need technology very much like Hadapt’s (but of course with access to Vertica performance).
- What sounds like an alternate query execution capability for short-request queries, the big point of which is that it saves them from being broadcast across the whole cluster, hence improving scalability. (Adding nodes of course doesn’t buy you much for the portion of a workload that’s broadcast.)
Other Vertica 7 enhancements include:
- A lot of Bottleneck Whack-A-Mole.
- “Significant” improvements to the Vertica management console.
- Security enhancements (Kerberos), Hadoop integration enhancements (HCatalog), and enhanced integration with Hadoop security (Kerberos again).
- Some availability hardening. (“Fault groups”, which for example let you ensure that data is replicated not just to 2+ nodes, but also that the nodes aren’t all on the same rack.)
- Java as an option to do in-database analytics. (Who knew that feature was still missing?)
- Some analytic functionality. (Approximate COUNT DISTINCT, but not yet Approximate MEDIAN.)
Overall, two recurring themes in our discussion were:
- Load and ETL (Extract/Transform/Load) performance, and/or obviating ETL.
- Short-request performance, in the form of more scalable short-request concurrency.
Also, be warned that there are two entirely different key-value things going on in Vertica 7. I was pretty confused until I realized that.
Vertica Flex Zone basics include:
- Flex Zone is targeted at data that originates in, for example, a log file or a NoSQL DBMS.
- Flex Zone data can be stored in a Vertica construct called Flex Tables. It can also be accessed externally, but then of course performance is hampered by what amounts to a load operation for each query.
- Flex Zone data is always in a map datatype (i.e., lots of key-value pairs).
- Vertica automagically creates virtual columns on Flex Zone data. Virtual columns can be accessed (read-only) by SQL in the usual way, DML and DDL alike (Data Manipulation/Description Language). So in particular, business intelligence tools treat virtual columns the same way they’d treat real ones.
- Flex Zone virtual columns can drill into nested data structures.
- If you retrieve a virtual column, you retrieve the rest of the record (log entry, JSON document, whatever) with it. However …
- … virtual column data can be copied into ordinary Vertica physical column swith no change in SQL access; Vertica will redirect queries for you appropriately. At that point you get customary Vertica performance.
- Vertica (the HP division) points out that loading data into Flex Zone can be much faster than loading it into Vertica Classic.
- Vertica (the product), which is priced by data volume, costs much less for Flex Zone than for standard columnar data.
Basically, Flex Zone is meant to be (among other things) a big bit bucket, perhaps in some cases obviating the need for Hadoop to play the same role.
I have less detail on the new short-request query executor, but I gather that:
- It assumes a query can be resolved on a single Vertica node. (Paradigmatic example: single-row lookup.)
- It involves client code that predicts which Vertica node can resolve the query.
- It has a key-value style interface, even though …
- … what is sent to the Vertica cluster is SQL.
- A SQL interface is planned.
I assume this will eventually evolve to the point that you can join a small, broadcasted dimension table to a single node’s portion of a fact table, but Vertica hasn’t actually told me that that kind of functionality is in the works.
Finally, and as is appropriate for a whole-number release, Vertica 7 has a lot of different performance enhancements, in loads, joins, and more. In particular, workload management has been extended from covering just RAM (which is usually Vertica’s scarcest commodity anyhow) to, in a limited sense, CPU as well. Specifically, queries can be “pinned” to specific cores, which for example lets short-request workloads be isolated from their longer-running brethren.
- In-database analytics were first added in Vertica 5.
I think that most sufficiently large enterprise SaaS vendors should offer an appliance option, as an alternative to the core multi-tenant service. In particular:
- SaaS appliances address customer fears about security, privacy, compliance, performance isolation, and lock-in.
- Some of these benefits occur even if the appliance runs in the same data centers that host the vendor’s standard multi-tenant SaaS. Most of the rest occur if the customer can choose a co-location facility in which to place the appliance.
- Whether many customers should or will use the SaaS appliance option is somewhat secondary; it’s a check-mark item. I.e., many customers and prospects will be pleased that the option at least exists.
How I reached them
Core reasons for selling or using SaaS (Software as a Service) as opposed to licensed software start:
- The SaaS vendor handles all software upgrades, and makes them promptly. In principle, this benefit could also be achieved on a dedicated system on customer premises (or at the customer’s choice of co-location facility).
- In addition, the SaaS vendor handles all the platform and operational stuff — hardware, operating system, computer room, etc. This benefit is antithetical to direct customer control.
- The SaaS vendor only has to develop for and operate on a tightly restricted platform stack that it knows very well. This benefit is also enjoyed in the case of customer-premises appliances.
Conceptually, then, customer-premises SaaS is not impossible, even though one of the standard Big Three SaaS benefits is lost. Indeed:
- Microsoft Windows and many other client software packages already offer to let their updates be automagically handled by the vendor.
- In that vein, consumer devices such as game consoles already are a kind of SaaS appliance.
- Complex devices of any kind, including computers, will see ever more in the way of “phone-home” features or optional services, often including routine maintenance and upgrades.
But from an enterprise standpoint, that’s all (relatively) simple stuff. So we’re left with a more challenging question — does customer-premises SaaS make sense in the case of enterprise applications or other server software?
Why would a customer actually want on-premises SaaS, as opposed to the standard remote version? The first ideas that come to mind are:
- Security and/or privacy considerations, real or imagined. This is in fact the motivation behind the single case of on-premises enterprise SaaS I have confirmed, namely one that Cloudant told me about.* (I don’t have similar levels of detail about Glassbeam’s one on-premises subscription customer.)
- Similarly, a less specific desire for isolation …
- … and/or control.
- Avoiding the expense of data movement to/from a remote location. For example, an enterprise might use SaaS OLTP (OnLine Transaction Processing) apps whose results it wants to stream to an on-premises data warehouse. Or the enterprise might have lower-volume but also lower-latency — and hence more costly — data integration needs, perhaps between different OLTP application suites, or with some MDM (Master Data Management) in the mix.
And, um — that’s about all I’ve got.
*Yes, I know Cloudant is DBaaS — but to me that’s a kind of SaaS, in which the S just happens to center around a DBMS.
Confusing matters further, there’s a middle option as well. salesforce.com and HP just announced that salesforce.com apps will, for the first time, run on dedicated customer-specific racks. But this will only be within the same data centers and operation groups that handle the rest of salesforce.com’s system. Notes on what’s being called a “pod” strategy start:
- This suggests a perceived demand for isolation.
- If these pods are offered to accounts large enough to saturate a bunch of servers each, they need not be much more expensive than the multi-tenant version of salesforce’s offering.
- Other than (fully-loaded) cost, it’s hard to see a downside to this vs. multi-tenant SaaS.
Notwithstanding ever-increasing levels of comfort with SaaS and cloud computing, I’d guess that a number of enterprises will find the cost of single-tenant SaaS more palatable than the queasiness they feel about multi-tenant alternatives. And so I think there’s a place for single-customer enterprise SaaS stacks somewhere; the main remaining question is where they will be located.
Ducking that question a bit longer, let me note that:
- In any scenario, we’re most likely talking about something like SaaS appliances. Customer-premises server SaaS that isn’t in some kind of appliance form is madness (unless the SaaS vendor is paid for on-site support as well), because no SaaS vendor wants to support hardware it can’t specify or control.
- Some enterprises in some countries will surely insist on keeping data within national borders, for reasons of geo-compliance. Hence there will be a need to deploy SaaS appliances either literally to their premises, or else to an in-country co-location facility, perhaps managed by a big telecom firm. Of course, that need can only arise if vendors first overcome issues of software nationalization — language, regulations, other business customs, whatever.
- The final point in my recent SaaS discussion post was about lock-in. If you use something that only runs in the supplier’s data centers, your lock-in is even worse than it is with most enterprise IT technology.
- I can’t currently think of many examples in which SaaS appliances need to be located directly in the customers’ main data centers. When you get to data flows and volumes big enough for that to matter, you’re likely talking about the kinds of internet applications that probably shouldn’t be on-premises in the first place.
And that finally brings us to the opinions I copied up top.
I think that most sufficiently large enterprise SaaS vendors should offer an appliance option, as an alternative to the core multi-tenant service. In particular:
- SaaS appliances address customer fears about security, privacy, compliance, performance isolation, and lock-in.
- Some of these benefits occur even if the appliance runs in the same data centers that host the vendor’s standard multi-tenant SaaS. Most of the rest occur if the customer can choose a co-location facility in which to place the appliance.
- Whether many customers should or will use the SaaS appliance option is somewhat secondary; it’s a check-mark item. I.e., many customers and prospects will be pleased that the option at least exists.
- Naomi Bloom is a SaaS purist, who would presumably deplore the whole concept of “customer-premises SaaS”.
Generalizing about SaaS (Software as a Service) is hard. To prune some of the confusion, let’s start by noting:
- SaaS has been around for over half a century, and at times has been the dominant mode of application delivery.
- The term multi-tenancy is being used in several different ways.
- Multi-tenancy, in the purest sense, is inessential to SaaS. It’s simply an implementation choice that has certain benefits for the SaaS provider. And by the way, …
- … salesforce.com, the chief proponent of the theory that true multi-tenancy is the hallmark of true SaaS, abandoned that position this week.
- Internet-based services are commonly, if you squint a little, SaaS. Examples include but are hardly limited to Google, Twitter, Dropbox, Intuit, Amazon Web Services, and the company that hosts this blog (KnownHost).
- Some of the core arguments for SaaS’ rise, namely the various efficiencies of data center outsourcing and scale, apply equally to the public cloud, to SaaS, and to AEaaS (Anything Else as a Service).
- These benefits are particularly strong for inherently networked use cases. For example, you really don’t want to be hosting your website yourself. And salesforce.com got its start supporting salespeople who worked out of remote offices.
- In theory and occasionally in practice, certain SaaS benefits, namely the outsourcing of software maintenance and updates, could be enjoyed on-premises as well. Whether I think that could be a bigger deal going forward will be explored in future posts.
For smaller enterprises, the core outsourcing argument is compelling. How small? Well:
- What’s the minimum level of IT operations headcount needed for mission-critical systems? Let’s just say “several”.
- What does that cost? Fully burdened, somewhere in the six figures.
- What fraction of the IT budget should such headcount be? As low a double digit percentage as possible.
- What fraction of revenues should be spent on IT? Some single-digit percentage.
So except for special cases, an enterprise with less than $100 million or so in revenue may have trouble affording on-site data processing, at least at a mission-critical level of robustness. It may well be better to use NetSuite or something like that, assuming needed features are available in SaaS form.*
*Truth be told, I’m not up to speed on mid-range SaaS application suite alternatives.
Continuing that thought — if you’re a mid-range application software provider, you have to develop a SaaS version of your product line. That’s a very different business model than the apps + OEMed platform you’re probably providing now, but it’s the best way to serve your customers going forward. And by the way — while mid-range application software is commonly sold on a regional basis, SaaS can be sold more globally; after all, the the need for onsite service is eliminated, and price points should in most cases fit with telephone sales. Yes, national language and regional data privacy rules are both concerns, but they still leave the available markets looking much bigger than regional resellers have traditionally enjoyed. So expect shake-outs in a whole lot of vertical markets, as vendors horn in on each other’s territories, and a few elephantine winners perhaps emerge.
The argument above assumes that extreme reliability is needed. So there’s nothing necessarily wrong with a small team of business analysts sticking an RDBMS appliance* in a corner and managing it themselves. If it sputters from time to time, who cares; using it still may be easier than getting that data in and out of the cloud. But eventually, if all the data is remote anyway — SaaS, website, etc. — then it may make sense to do analytics remotely as well.
*Previously, that appliance might have been from Netezza; now, my first thought is the cheaper — albeit more limited — Infobright.
The arguments that direct smaller companies toward SaaS apply to large enterprises to, but they aren’t as dispositive. Larger enterprises can actually afford to do their own IT operations if they want to. What’s more, moving away from in-house operations is harder for big firms, due to the larger and more customized portfolio of legacy systems they’re likely to have. That said:
- Almost all enterprises should have their internet-facing systems offsite, even if just via co-location. The core reasons are that ingesting high-volume inbound network traffic is inherently difficult, and security issues make it much tougher yet. In addressing these challenges, specialists enjoy significant economies of scale.
- Most enterprises will have plenty of SaaS silos. If nothing else:
- Complex machinery will increasingly “phone home” for help staying in good working order. That’s a form of SaaS.
- Information providers and aggregators tend to deliver via SaaS.
- Various kinds of collaboration and communication apps, from Google Mail to Dropbox, live in the cloud. Personal productivity applications, from word processing to Photoshop, may be following.
- “Rodney Dangerfield” departments — i.e., ones unhappy with the respect and attention they get from central IT — often turn to SaaS or similar outsourcing. Human resources is an obvious example, from Automatic Data Processing to Employease to, these days, Workday.
That leaves us with the questions as to when and how large enterprises should or will move their core applications to SaaS and/or the cloud. Given the length of this post, I won’t try to answer them now. But for starters:
- Enterprises don’t like to rip and replace their apps, except in consolidation projects, as long as they can avoid doing so.
- Cloud/remote computing economies are less convincing if you already have your computer rooms staffed and set up.
- A key benefit of SaaS is that vendors control and drive the upgrade cycles. One cost of that is restrictions on customization, although you can also build apps and app extensions on Paas//DBaaS/Waas (Platform/DataBase/Whatever as a Service) offerings such as force.com.
- Lock-in is a serious concern, for application and platform offerings alike. Not only are you betting on one vendor’s software black box, you’re also betting on its remote computing operation. If you grow dissatisfied with either, or with their pricing, you may not have much opportunity to escape.
I talked tonight with Lee Edlefsen, Chief Scientist of Revolution Analytics, and now think I understand Revolution’s parallel R much better than I did before.
There are four primary ways that people try to parallelize predictive modeling:
- They can run the same algorithm on different parts of a dataset on different nodes, then return all the results, and claim they’ve parallelized. This is trivial and not really a solution. It is also the last-ditch fallback position for those who parallelize more seriously.
- They can generate intermediate results from different parts of a dataset on different nodes, then generate and return a single final result. This is what Revolution does.
- They can parallelize the linear algebra that underlies so many algorithms. Netezza and Greenplum tried this, but I don’t think it worked out very well in either case. Lee cited a saying in statistical computing “If you’re using matrices, you’re doing it wrong”; he thinks shortcuts and workarounds are almost always the better way to go.
- They can jack up the speed of inter-node communication, perhaps via MPI (Messaging Passing Interface), so that full parallelization isn’t needed. That’s SAS’ main approach.
One confusing aspect of this discussion is that it could reference several heavily-overlapping but not identical categories of algorithms, including:
- External memory algorithms, which operates on datasets too big to fit in main memory, by — for starters — reading in and working on a part of the data at a time. Lee observes that these are almost always parallelizable.
- What Revolution markets as External Memory Algorithms, which are those external memory algorithms it has gotten around to implementing so far. These are all parallelized. They are also all in the category of …
- … algorithms that can be parallelized by:
- Operating on data in parts.
- Getting intermediate results.
- Combining them in some way for a final result.
- Algorithms of the previous category, where the way of combining them specifically is in the form of summation, such as those discussed in the famous paper Map-Reduce for Machine Learning on Multicore. Not all of Revolution’s current parallel algorithms fall into this group.
To be clear, all Revolution’s parallel algorithms are in Category #2 by definition and Category #3 in practice. However, they aren’t all in Category #4.
The canonical example of how to parallelize an algorithm via intermediate results is taking the mean of a large set of numbers. Specifically:
- For each subset of data, you both count the entries and sum the values.
- Then to combine those intermediate results:
- You sum the sums.
- You also sum the counts.
- You divide the former result by the latter.
Unfortunately, it’s hard to clearly articulate a precise description of these parallelizable algorithms. That said:
- What you want is for the end result to be identical irrespective of how the data is split up. (Duh!)
- Lee suggested that it is sufficient but not necessary that the way of combining the intermediate results be both commutative and associative.
- To date, all of Revolution’s algorithms are — you guessed it! — commutative and associative.
I asked Lee about algorithms that were inherently difficult to parallelize in this style, and he expressed optimism that some other approach would usually work in practice. In particular, we had a lively discussion about finding the exact median, or more generally finding n-tiles and the whole “empirical distribution”. Lee said that, for example, it is extremely fast to bin billions of values into 10,000 buckets. Further, he suggested it is very fast in general to do the operation for integer values, and hence also for any values with a reasonably limited number of significant digits.
As should be clear from this discussion, Revolution’s parallel algorithms are indeed parallel for any reasonable kind of distribution of work. Although they were shipped first for multi-core single-server and MPI environments, the recent ports to Teradata and generic Hadoop MapReduce seem to have been fairly straightforward. Revolution seems to have good modularity between the algorithms themselves, the intermediate data passing, and the original algorithm launch, and hence makes strong claims of R code portability — but the list of exceptions in “portable except for ____” did seem to lengthen a bit each time we returned to the subject.
Finally, notes on Revolution’s Teradata implementation include:
- There’s a master process (external stored procedure) which then generates SQL and invokes table operators.
- The whole thing runs in protected mode (i.e. out-of-process). Lee thinks that there’s only a small performance penalty vs. in-process.
- (For some reason I found this amusing) When you send an R job to Teradata, the R code itself is shipped via ODBC.
while notes on Revolution’s initial Hadoop implementation start:
- One way it talks to data in HDFS (Hadoop Distributed File System) is through LibHDFS. The other, when available, is ODBC.
- It uses generic MapReduce. Faster alternatives may be implemented down the road.
- Teradata is seeing interest in in-database R. (September, 2013)