A huge fraction of analytics is about monitoring. People rarely want to frame things in those terms; evidently they think “monitoring” sounds boring or uncool. One cost of that silence is that it’s hard to get good discussions going about how monitoring should be done. But I’m going to try anyway, yet again.
Business intelligence is largely about monitoring, and the same was true of predecessor technologies such as green paper reports or even pre-computer techniques. Two of the top uses of reporting technology can be squarely described as monitoring, namely:
- Watching whether trends are continuing or not.
- Seeing if there are any events — actual or impending as the case may be — that call for response, in areas such as:
- Machine breakages (computer or general metal alike).
- Resource shortfalls (e.g. various senses of “inventory”).
Yes, monitoring-oriented BI needs investigative drilldown, or else it can be rather lame. Yes, purely investigative BI is very important too. But monitoring is still the heart of most BI desktop installations.
Predictive modeling is often about monitoring too. It is common to use statistics or machine learning to help you detect and diagnose problems, and many such applications have a strong monitoring element.
I.e., you’re predicting trouble before it happens, when there’s still time to head it off.
As for incident response, in areas such as security — any incident you respond to has to be noticed first Often, it’s noticed through analytic monitoring.
Hopefully, that’s enough of a reminder to establish the great importance of analytics-based monitoring. So how can the practice be improved? At least three ways come to mind, and only one of those three is getting enough current attention.
The one that’s trendy, of course, is the bringing of analytics into “real-time”. There are many use cases that genuinely need low-latency dashboards, in areas such as remote/phone-home IoT (Internet of Things), monitoring of an enterprise’s own networks, online marketing, financial trading and so on. “One minute” is a common figure for latency, but sometimes a couple of seconds are all that can be tolerated.
I’ve posted a lot about all this, for example in posts titled:
One particular feature that could help with high-speed monitoring is to meet latency constraints via approximate query results. This can be done entirely via your BI tool (e.g. Zoomdata’s “query sharpening”) or more by your DBMS/platform software (the Snappy Data folks pitched me on that approach this week).
Perennially neglected, on the other hand, are opportunities for flexible, personalized analytics. (Note: There’s a lot of discussion in that link.) The best-acknowledged example may be better filters for alerting. False negatives are obviously bad, but false positives are dangerous too. At best, false positives are annoyances; but too often, alert fatigue causes you employees to disregard crucial warning signals altogether. The Gulf of Mexico oil spill disaster has been blamed on that problem. So was a fire in my own house. But acknowledgment != action; improvement in alerting is way too slow. And some other opportunities described in the link above aren’t even well-acknowledged, especially in the area of metrics customization.
Finally, there’s what could be called data anomaly monitoring. The idea is to check data for surprises as soon as it streams in, using your favorite techniques in anomaly management. Perhaps an anomaly will herald a problem in the data pipeline. Perhaps it will highlight genuinely new business information. Either way, you probably want to know about it.
David Gruzman of Nestlogic suggests numerous categories of anomaly to monitor for. (Not coincidentally, he believes that Nestlogic’s technology is a great choice for finding each of them.) Some of his examples — and I’m summarizing here — are:
- Changes in data format, schema, or availability. For example:
- Data can completely stop coming in from a particular source, and the receiving system might not immediately realize that. (My favorite example is the ad tech firm that accidentally stopped doing business in the whole country of Australia.)
- A data format change might make data so unreadable it might as well not arrive.
- A decrease in the number of approval fields might highlight a questionable change in workflow.
- Data quality NULLs or malformed values might increase suddenly, in particular fields and data segments.
- Data value distribution This category covers a lot of cases. A few of them are:
- A particular value is repeated implausibly often. A bug is the likely explanation.
- E-commerce results suddenly decrease, but only from certain client technology configuration. Probably there is a bug affecting only those particular clients.
- Clicks suddenly increase from certain client technologies. A botnet might be at work.
- Sales suddenly increase from a particular city. Again this might be fraud — or more benignly, perhaps some local influencers have praised your offering.
- A particular medical diagnosis becomes much more common in a particular city. Reasons can range from fraud, to a new facility for certain kinds of tests, to a genuine outbreak of disease.
David offered yet more examples of significant anomalies, including ones that could probably only be detected via Nestlogic’s tools. But the ones I cited above can probably be found via any number of techniques — and should be, more promptly and accurately than they currently are.
0. Matt Brandwein of Cloudera briefed me on the new Cloudera Data Science Workbench. The problem it purports to solve is:
- One way to do data science is to repeatedly jump through the hoops of working with a properly-secured Hadoop cluster. This is difficult.
- Another way is to extract data from a Hadoop cluster onto your personal machine. This is insecure (once the data arrives) and not very parallelized.
- A third way is needed.
Cloudera’s idea for a third way is:
- You don’t run anything on your desktop/laptop machine except a browser.
- The browser connects you to a Docker container that holds (and isolates) a kind of virtual desktop for you.
- The Docker container runs on your Cloudera cluster, so connectivity-to-Hadoop and security are handled rather automagically.
In theory, that’s pure goodness … assuming that the automagic works sufficiently well. I gather that Cloudera Data Science Workbench has been beta tested by 5 large organizations and many 10s of users. We’ll see what is or isn’t missing as more customers take it for a spin.
1. Recall that Cloudera installations have 4 kinds of nodes. 3 are obvious:
- Hadoop worker nodes.
- Hadoop master nodes.
- Nodes that run Cloudera Manager.
The fourth kind are edge/gateway nodes. Those handle connections to the outside world, and can also run selected third-party software. They also are where Cloudera Data Science Workbench lives.
2. One point of this architecture is to let each data scientist run the languages and tools of her choice. Docker isolation is supposed to make that practical and safe.
And so we have a case of the workbench metaphor actually being accurate! While a “workbench” is commonly just an integrated set of tools, in this case it’s also a place for you to use other tools your personally like and bring in.
Surely there are some restrictions as to which tools you can use, but I didn’t ask for those to be spelled out.
3. Matt kept talking about security, to an extent I recall in almost no other analytics-oriented briefing. This had several aspects.
- As noted above, a lot of the hassle of Hadoop-based data science relates to security.
- As also noted above, evading the hassle by extracting data is a huge security risk. (If you lose customer data, you’re going to have a very, very bad day.)
- According to Matt, standard uses of notebook tools such as Jupyter or Zeppelin wind up having data stored wherever code is. Cloudera’s otherwise similar notebook-style interface evidently avoids that flaw. (Presumably, it you want to see the output, you rerun the script against the data store yourself.)
4. To a first approximation, the target users of Cloudera Data Science Workbench can be characterized the same way BI-oriented business analysts are. They’re people with:
- Sufficiently good quantitative skills to do the analysis.
- Sufficiently good computer skills to do SQL queries and so on, but not a lot more than that.
Of course, “sufficiently good quantitative skills” can mean something quite different in data science than it does for the glorified arithmetic of ordinary business intelligence.
5. Cloudera Data Science Workbench doesn’t have any special magic in parallelization. It just helps you access the parallelism that’s already out there. Some algorithms are easy to parallelize. Some libraries have parallelized a few algorithms beyond that. Otherwise, you’re on your own.
6. When I asked whether Cloudera Data Science Workbench was open source (like most of what Cloudera provides) or closed source (like Cloudera Manager), I didn’t get the clearest of answers. On the one hand, it’s a Cloudera-specific product, as the name suggests; on the other, it’s positioned as having been stitched together almost entirely from a collection of open source projects.
For starters, let me say:
- SequoiaDB, the company, is my client.
- SequoiaDB, the product, is the main product of SequoiaDB, the company.
- SequoiaDB, the company, has another product line SequoiaCM, which subsumes SequoiaDB in content management use cases.
- SequoiaDB, the product, is fundamentally a JSON data store. But it has a relational front end …
- … and is usually sold for RDBMS-like use cases …
- … except when it is sold as part of SequoiaCM, which adds in a large object/block store and a content-management-oriented library.
- SequoiaDB’s products are open source.
- SequoiaDB’s largest installation seems to be 2 PB across 100 nodes; that includes block storage.
- Figures for DBMS-only database sizes aren’t as clear, but the sweet spot of the cluster-size range for such use cases seems to be 6-30 nodes.
- SequoiaDB, the company, was founded in Toronto, by former IBM DB2 folks.
- Even so, it’s fairly accurate to view SequoiaDB as a Chinese company. Specifically:
- SequoiaDB’s founders were Chinese nationals.
- Most of them went back to China.
- Other employees to date have been entirely Chinese.
- Sales to date have been entirely in China, but SequoiaDB has international aspirations
- SequoiaDB has >100 employees, a large majority of which are split fairly evenly between “engineering” and “implementation and technical support”.
- SequoiaDB’s marketing (as opposed to sales) department is astonishingly tiny.
- SequoiaDB cites >100 subscription customers, including 10 in the global Fortune 500, a large fraction of which are in the banking sector. (Other sectors mentioned repeatedly are government and telecom.)
Unfortunately, SequoiaDB has not captured a lot of detailed information about unpaid open source production usage.
While I usually think that the advantages of open source are overstated, in SequoiaDB’s case open source will have an additional benefit when SequoiaDB does go international — it addresses any concerns somebody might have about using Chinese technology.
SequoiaDB’s technology story starts:
- SequoiaDB is a layered DBMS.
- It manages JSON via update-in-place. MVCC (Multi-Version Concurrency Control) is on the roadmap.
- Indexes are B-tree.
- Transparent sharding and elasticity happen in what by now is the industry-standard/best-practices way:
- There are many (typically 4096) logical partitions, many of which are assigned to each physical partition.
- If the number of physical partitions changes, logical partitions are reassigned accordingly.
- Relational OLTP (OnLine Transaction Processing) functionality is achieved by using a kind of PostgreSQL front end.
- Relational batch processing is done via SparkSQL.
- There also is a block/LOB (Large OBject) storage engine meant for content management applications.
- SequoiaCM boils down technically to:
- SequoiaDB, which is used to store JSON metadata about the LOBs …
- … and whose generic-DBMS coordination capabilities are also used over the block/LOB engine.
- A Java library focused on content management.
SequoiaDB’s relationship with PostgreSQL is complicated, but as best I understand SequoiaDB’s relational operations:
- SQL parsing, optimization, and so on rely mainly on PostgreSQL code. (Of course, there are some hacks, such as to the optimizer’s cost functions.)
- Actual data storage is done via SequoiaDB’s JSON store, using PostgreSQL Foreign Data Wrappers. Each record goes in a separate JSON document. Locks, commits and so on — i.e. “write prevention” — are handled by the JSON store.
- PostgreSQL’s own storage engine is actually part of the stack, but only to manage temp space and the like.
PostgreSQL stored procedures are already in the SequoiaDB product. Triggers and referential integrity are not. Neither, so far as I can tell, are PostgreSQL’s datatype extensibility capabilities.
I neglected to ask how much of that remains true when SparkSQL is invoked.
SequoiaDB’s use cases to date seem to fall mainly into three groups:
- Content management via SequoiaCM.
- “Operational data lakes”.
- Pretty generic replacement of legacy RDBMS.
Internet back-ends, however — and this is somewhat counter-intuitive for an open-source JSON store — are rare, at least among paying subscription customers. But SequoiaDB did tell me of one classic IoT (Internet of Things) application, with lots of devices “phoning home” and the results immediately feeding a JSON-based dashboard.
To understand SequoiaDB’s “operational data lake” story, it helps to understand the typical state of data warehousing at SequoiaDB’s customers and prospects, which isn’t great:
- 2-3 years of data, and not all the data even from that time period.
- Only enough processing power to support structured business intelligence …
- … and hence little opportunity for ad-hoc query.
SequoiaDB operational data lakes offer multiple improvements over that scenario:
- They hold as much relational data as customers choose to dump there.
- That data can be simply copied from operational stores, with no transformation.
- Or if data arrives via JSON — from external organizations or micro-services as the case may be — the JSON can be stored unmodified as well.
- Queries can be run straight against this data soup.
- Of course, views can also be set up in advance to help with querying.
Views are particularly useful with what might be called slowly changing schemas. (I didn’t check whether what SequoiaDB is talking about matches precisely with the more common term “slowly changing dimensions”.) Each time the schema changes, a new table is created in SequoiaDB to receive copies of the data. If one wants to query against the parts of the database structure that didn’t change — well, a view can be establish to allow for that.
Finally, it seems that SequoiaCM uses are concentrated in what might be called “security and checking-up” areas, such:
- Photographs as part of an authentication process.
- Video of in-person banking transactions, both for fraud prevention and for general service quality assurance.
- Storage of security videos (for example from automated teller machines).
SequoiaCM deals seem to be bigger than other SequoiaDB ones, surely in part because the amounts of data managed are larger.
That said, while Steve Bannon is firmly established as Trump’s puppet master, they don’t agree on quite everything, and one of the documented disagreements had been in their view of skilled, entrepreneurial founder-type immigrants: Bannon opposes them, but Trump has disagreed with his view. And as per the speech, Trump seems to be maintaining his disagreement.
At least, that seems implied by his call for “a merit-based immigration system.”
And by the way — Trump managed to give a whole speech without saying anything overtly racist. Indeed, he specifically decried the murder of an Indian-immigrant engineer. By Trump standards, that counts as a kind of progress.
I’d like to argue that a single frame can be used to view a lot of the issues that we think about. Specifically, I’m referring to coordination, which I think is a clearer way of characterizing much of what we commonly call communication or collaboration.
It’s easy to argue that computing, to an overwhelming extent, is really about communication. Most obviously:
- Data is constantly moving around — across wide area networks, across local networks, within individual boxes, or even within particular chips.
- Many major developments are almost purely about communication. The most important computing device today may be a telephone. The World Wide Web is essentially a publishing platform. Social media are huge. Etc.
Indeed, it’s reasonable to claim:
- When technology creates new information, it’s either analytics or just raw measurement.
- Everything else is just moving information around, and that’s communication.
A little less obvious is the much of this communication could be alternatively described as coordination. Some communication has pure consumer value, such as when we talk/email/Facebook/Snapchat/FaceTime with loved ones. But much of the rest is for the purpose of coordinating business or technical processes.
Among the technical categories that boil down to coordination are:
- Operating systems.
- Anything to do with distributed computing.
- Anything to do with system or cluster management.
- Anything that’s called “collaboration”.
That’s a lot of the value in “platform” IT right there.
Meanwhile, in pre-internet apps:
- Some of the early IT wins were in pure accounting and information management. But a lot of the rest were in various forms of coordination, such as logistics and inventory management.
- The glory days of enterprise apps really started with SAP’s emphasis on “business process'”. (“Business process reengineering” was also a major buzzword back in the day.)
This also all fits with the “route” part of my claim that “historically, application software has existed mainly to record and route information.”
And in the internet era:
- “Sharing economy” companies, led by Uber and Airbnb, have created a lot more shareholder value than the most successful pure IT startups of the era.
- Amazon, in e-commerce and cloud computing alike, has run some of the biggest coordination projects of all.
This all ties into one of the key underlying subjects to modern politics and economics, namely the future of work.
- Globalization is enabled by IT’s ability to coordinate far-flung enterprises.
- Large enterprises need fewer full-time employees when individual or smaller-enterprise contractors are easier to coordinate. (It’s been 30 years since I drew a paycheck from a company I didn’t own.)
- And of course, many white collar jobs are being entirely automated away, especially those that can be stereotyped as “paper shuffling”.
By now, I hope it’s clear that “coordination” covers a whole lot of IT. So why do I think using a term with such broad application adds any clarity? I’ve already given some examples above, in that:
- “Coordination” seems clearer than “communication” when characterizing the essence of distributed computing.
- “Coordination” seems clearer than “communication” if we’re discussing the functioning of large enterprises or of large-enterprise-substitutes.
Further — even when we focus on the analytic realm, the emphasis on “coordination” has value. A big part of analytic value comes in determining when to do something. Specifically that arises when:
- Analytics identifies a problem that just occurred, or is about to happen, allowing a timely fix.
- Business intelligence is using for monitoring, of impending problems or otherwise, as a guide to when action is needed.
- Logistics of any kind get optimized.
I’d also say that most recommendation/personalization fits into the “coordination” area, but that’s a bit more of a stretch; you’re welcome to disagree.
I do not claim that analytics’ value can be wholly captured by the “coordination” theme. Decisions about whether to do something major — or about what to do — are typically made by small numbers of people; they turn into major coordination exercises only after a project gets its green light. But such cases, while important, are pretty rare. For the most part, analytic results serve as inputs to business processes. And business processes, on the whole, typically have a lot to do with coordination.
Bottom line: Most of what’s valuable in IT relates to communication or coordination. Apparent counterexamples should be viewed with caution.
- Of course, some claims about coordination wind up being quite bogus. That’s the basis for my long-ago rants about dashboards and “balanced scorecards”.
- I’ve posted multiple times about analytic use cases, emphasizing areas such as “early warning” or optimization.
- I also sometimes specifically break down BI use cases.
- We should all be thinking about the future of work.
The United States and consequently much of the world are in political uproar. Much of that is about very general and vital issues such as war, peace or the treatment of women. But quite a lot of it is to some extent tech-industry-specific. The purpose of this post is outline how and why that is.
- There’s a worldwide backlash against “elites” — and tech industry folks are perceived as members of those elites.
- That perception contains a lot of truth, and not just in terms of culture/education/geography. Indeed, it may even be a bit understated, because trends commonly blamed on “trade” or “globalization” often have their roots in technological advances.
- There’s a worldwide trend towards authoritarianism. Surveillance/ privacy and censorship issues are strongly relevant to that trend.
- Social media companies are up to their neck in political considerations.
Because they involve grave threats to liberty, I see surveillance/privacy as the biggest technology-specific policy issues in the United States. (In other countries, technology-driven censorship might loom larger yet.) My views on privacy and surveillance have long been:
- Fixing the legal frameworks around information use is a difficult and necessary job. The tech community should be helping more than it is.
- Until those legal frameworks are indeed cleaned up, the only responsible alternative is to foot-drag on data collection, on data retention, and on the provision of data to governmental agencies.
Given the recent election of a US president with strong authoritarian tendencies, that foot-dragging is much more important than it was before.
Other important areas of technology/policy overlap include:
- The new head of the Federal Communications Commission is hostile to network neutrality. (Perhaps my compromise proposal for partial, market-based network neutrality should get another look some day.)
- There’s a small silver lining in Trump’s attacks on free trade; the now-abandoned (at least by the US) Trans-Pacific Partnership had gone too far on “intellectual property” rights.
- I’m a skeptic about software patents.
- Government technology procurement processes have long been broken.
- “Sharing economy” companies such as Uber and Airbnb face a ton of challenges in politics and regulation, often on a very local basis.
And just over the past few days, the technology industry has united in opposing the Trump/Bannon restrictions on valuable foreign visitors.
Tech in the wider world
Technology generally has a huge impact on the world. One political/economic way of viewing that is:
- For a couple of centuries, technological advancement has:
- Destroyed certain jobs.
- Replaced them directly with a smaller number of better jobs.
- Increased overall wealth, which hopefully leads to more, better jobs in total.
- Over a similar period, improvements in transportation technology have moved work opportunities from richer countries to poorer areas (countries or colonies as the case may be). This started in farming and extraction, later expanded to manufacturing, and now includes “knowledge workers” as well.
- Both of these trends are very strong in the current computer/internet era.
- Many working- and middle-class people in richer countries now feel that these trends are leaving them worse off.
- To some extent, they’re confusing correlation and causality. (The post-WW2 economic boom would have slowed no matter what.)
- To some extent, they’re ignoring the benefits of technology in their day to day lives. (I groan when people get on the internet to proclaim that technology is something bad.)
- To some extent, however, they are correct.
Further, technology is affecting how people relate to each other, in multiple ways.
- This is obviously the case with respect to cell phones and social media.
- Also, changes to the nature of work naturally lead to changes in the communities where the workers live.
For those of us with hermit-like tendencies or niche interests, that may all be a net positive. But others view these changes less favorably.
Summing up: Technology induces societal changes of such magnitudes as to naturally cause (negative) political reactions.
And in case you thought I was exaggerating the political threat to the tech industry …
… please consider the following quotes from Trump’s most powerful advisor, Steve Bannon:
The “progressive plutocrats in Silicon Valley,” Bannon said, want unlimited ability to go around the world and bring people back to the United States. “Engineering schools,” Bannon said, “are all full of people from South Asia, and East Asia. . . . They’ve come in here to take these jobs.” …
“Don’t we have a problem with legal immigration?” asked Bannon repeatedly.
“Twenty percent of this country is immigrants. Is that not the beating heart of this problem?”
I plan to keep updating the list of links at the bottom of my post Politics and policy in the age of Trump.
The United States presidency was recently assumed by an Orwellian lunatic.* Sadly, this is not an exaggeration. The dangers — both of authoritarianism and of general mis-governance — are massive. Everybody needs in some way to respond.
*”Orwellian lunatic” is by no means an oxymoron. Indeed, many of the most successful tyrants in modern history have been delusional; notable examples include Hitler, Stalin, Mao and, more recently, Erdogan. (By way of contrast, I view most other Soviet/Russian leaders and most jumped-up-colonel coup leaders as having been basically sane.)
There are many candidates for what to focus on, including:
- Technology-specific issues — e.g. privacy/surveillance, network neutrality, etc.
- Issues in which technology plays a large role — e.g. economic changes that affect many people’s employment possibilities.
- Subjects that may not be tech-specific, but are certainly of great importance. The list of candidates here is almost endless, such as health care, denigration of women, maltreatment of immigrants, or the possible breakdown of the whole international order.
But please don’t just go on with your life and leave the politics to others. Those “others” you’d like to rely on haven’t been doing a very good job.
What I’ve chosen to do personally includes:
- Get and stay current in my own knowledge. That’s of course a prerequisite for everything else.
- Raise consciousness among my traditional audience. This post is an example.
- Educate my traditional audience. Some of you are American, well-versed in history and traditional civics. Some of you are American, but not so well-versed. Some of you are from a broad variety of other countries. The sweet spot of my target is the smart, rational, not-so-well-versed Americans. But I hope others are interested as well.
- Prepare for such time as nuanced policy analysis is again appropriate. In the past, I’ve tried to make thoughtful, balanced, compromise suggestions for handling thorny issues such as privacy/surveillance or network neutrality. In this time of crisis, people don’t care, and I don’t blame them at all. But hopefully this ill wind will pass, and serious policy-making will restart. When it does, we should be ready for it.
- Support my family in whatever they choose to do. It’s a small family, but it includes some stars, more articulate and/or politically experienced than I am.
Your choices will surely differ (and later on I will offer suggestions as to what those choices might be). But if you take only one thing from this post and its hopefully many sequels, please take this: Ignoring politics is no longer a rational choice.
This is my first politics/policy-related post since the start of the Trump (or Trump/Bannon) Administration. I’ll keep a running guide to others here, and in the comments below.
- The technology industry in particular is now up to its neck in politics. I gave quite a few examples to show why for tech folks there’s no escaping politics now.
- Some former congressional staffers put out a great guide to influencing your legislators. It’s focused on social justice and anti-discrimination kinds of issues, but can probably be applied more broadly, e.g. to Senator Feinstein’s (D-Cal) involvement in overseeing the intelligence community.
Crate.io and CrateDB basics include:
- Crate.io makes CrateDB.
- CrateDB is a quasi-RDBMS designed to receive sensor data and similar IoT (Internet of Things) inputs.
- CrateDB’s creators were perhaps a little slow to realize that the “R” part was needed, but are playing catch-up in that regard.
- Crate.io is an outfit founded by Austrian guys, headquartered in Berlin, that is turning into a San Francisco company.
- Crate.io says it has 22 employees and 5 paying customers.
- Crate.io cites bigger numbers than that for confirmed production users, clearly active clusters, and overall product downloads.
In essence, CrateDB is an open source and less mature alternative to MemSQL. The opportunity for MemSQL and CrateDB alike exists in part because analytic RDBMS vendors didn’t close it off.
CrateDB’s not-just-relational story starts:
- A column can contain ordinary values (of usual-suspect datatypes) or “objects”, …
- … where “objects” presumably are the kind of nested/hierarchical structures that are common in the NoSQL/internet-backend world, …
- … except when they’re just BLOBs (Binary Large OBjects).
- There’s a way to manually define “strict schemas” on the structured objects, and a syntax for navigating their structure in WHERE clauses.
- There’s also a way to automagically infer “dynamic schemas”, but it’s simplistic enough to be more suitable for development/prototyping than for serious production.
Crate gave an example of data from >800 kinds of sensors being stored together in a single table. This leads to significant complexity in the FROM clauses. But querying the same data in a relational schema would be at least as complicated, and probably worse.
One key to understanding Crate’s architectural choices is to note that they’re willing to have different latency/consistency standards for:
- Writes and single-row look-ups.
- Aggregates and joins.
And so it makes sense that:
- Data is banged into CrateDB in a NoSQL-ish kind of way as it arrives, with RYW consistency.
- The indexes needed for SQL functionality are updated in microbatches as soon as possible thereafter. (Think 100 milliseconds as a base case.) Crate.io characterizes the consistency for this part as “eventual”.
CrateDB will never have real multi-statement transactions, but it has simpler levels of isolation that may be called “transactions” in some marketing contexts.
CrateDB technical highlights include:
- CrateDB records are stored as JSON documents. (Actually, I didn’t ask whether this was true JSON or rather something “JSON-like”.)
- In the purely relational case, the documents may be regarded as glorified text strings.
- I got the impression that BLOB storage was somewhat separate from the rest.
- CrateDB’s sharding story starts with consistent hashing.
- Shards are physical-only. CrateDB lacks the elasticity-friendly feature of there being many logical shards for each physical shard.
- However, you can change your shard count, and any future inserts will go into the new set of shards.
- In line with its two consistency models, CrateDB also has two indexing strategies.
- Single-row/primary-key lookups have a “forward lookup” index, whatever that is.
- Tables also have a columnar index.
- More complex queries and aggregations are commonly done straight against the columnar index, rather than the underlying data.
- CrateDB’s principal columnar indexing strategy sounds a lot like inverted-list, which in turn is a lot like standard text indexing.
- Specific datatypes — e.g. geospatial — can be indexed in different ways.
- The columnar index is shard-specific, and located at the same node as the shard.
- At least the hotter parts of the columnar index will commonly reside in memory. (I didn’t ask whether this was via straightforward caching or some more careful strategy.)
- While I didn’t ask about CrateDB’s replication model in detail, I gathered that:
- Data is written synchronously to all nodes. (That’s sort of implicit in RYW consistency anyway.)
- Common replication factors are either 1 or 3, depending on considerations such as the value of the data. But as is usual, some tables can be replicated across all nodes.
- Data can be read from all replicas, for obvious reasons of performance.
- Where relevant — e.g. the wire protocol or various SQL syntax specifics — CrateDB tends to emulate Postgres.
- The CrateDB stack includes Elasticsearch and Lucene, both of which make sense in connection with Crate’s text/document orientation.
Crate.io is proud of its distributed/parallel story.
- Any CrateDB node can plan a query. Necessary metadata for that is replicated across the cluster.
- Execution starts on a shard-by-shard basis. Data is sorted at each shard before being sent onward.
- Crate.io encourages you to run Spark and CrateDB on the same nodes.
- This is supported by parallel Spark-CrateDB integration of the obvious kind.
- Crate.io notes a happy synergy to this plan, in that Spark stresses CPU while CrateDB is commonly I/O-bound.
The CrateDB-Spark integration was the only support I could find for various marketing claims about combining analytics with data management.
Given how small and young Crate.io is, there are of course many missing features in CrateDB. In particular:
- A query can only reshuffle data once. Hence, CrateDB isn’t currently well-designed for queries that join more than 2 tables together.
- The only join strategy currently implemented is nested loop. Others are in the future.
- CrateDB has most of ANSI SQL 92, but little or nothing specific to SQL 99. In particular, SQL windowing is under development.
- Geo-distribution is still under development (even though most CrateDB data isn’t actually about people).
- I imagine CrateDB administrative tools are still rather primitive.
In any case, creating a robust DBMS is an expensive and time-consuming process. Crate has a long road ahead of it.
After a July visit to DataStax, I wrote
The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.
- Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
- Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.
That turns out to understate the core point, which is that DBAs still matter in non-RDBMS environments. Specifically, it’s too narrow in two ways.
- First, it’s generally too narrow as to what DBAs do; people with DBA-like skills are also involved in other areas such as “data governance”, “information lifecycle management”, storage, or what I like to call data mustering.
- Second — and more narrowly — the first bullet point of the quote is actually incorrect. In fact, the database design part of application development can be done by a specialized person up front in the NoSQL world, just as it commonly is for RDBMS apps.
My wake-up call for that latter bit was a recent MongoDB 3.4 briefing. MongoDB certainly has various efforts in administrative tools, which I won’t recapitulate here. But to my surprise, MongoDB also found a role for something resembling relational database design. The idea is simple: A database administrator defines a view against a MongoDB database, where views:
- Are logical rather than materialized. (At least at this time.)
- Have their permissions and so on set by the DBA.
- Are the sole thing the programmer writes against.
Besides the obvious benefits in development ease and security, MongoDB says that performance can be better as well.* This is of course a new feature, without a lot of adoption at this time. Even so, it seems likely that NoSQL doesn’t obsolete any part of the traditional DBA role.
*I didn’t actually ask what a naive programmer can do to trash performance that views can forestall, but … well, I was once a naive programmer myself.
Two trends that I think could make DBA’s lives even more interesting and challenging in the future are:
- The integration of quick data management into complex analytic processes. Here by “quick data management” I mean, for example, what you do in connection with a complex Hadoop or Spark (set of) job(s). Leaving the data management to a combination of magic and Python scripts doesn’t seem to respect how central data operations are to analytic tasks.
- The integration of data management and streaming. I should probably write about this point separately, but in any case — it seems that streaming stacks will increasingly look like over-caffeinated DBMS.
Bottom line: Database administration skills will be needed for a long time to come.
“Multimodel” database management is a hot new concept these days, notwithstanding that it’s been around since at least the 1990s. My clients at MongoDB of course had to join the train as well, but they’ve taken a clear and interesting stance:
- A query layer with multiple ways to query and analyze data.
- A separate data storage layer in which you have a choice of data storage engines …
- … each of which has the same logical (JSON-based) data structure.
When I pointed out that it would make sense to call this “multimodel query” — because the storage isn’t “multimodel” at all — they quickly agreed.
To be clear: While there are multiple ways to read data in MongoDB, there’s still only one way to write it. Letting that sink in helps clear up confusion as to what about MongoDB is or isn’t “multimodel”. To spell that out a bit further:
- In query, MongoDB mixes multiple paradigms for DML (Data Manipulation Language). The main one is of course JSON.
- When writing, the DML paradigm is unmixed — it’s just JSON.
Further, MongoDB query DML statements can be mixed with analytic functions rooted in Spark.
The main ways to query data in MongoDB, to my knowledge, are:
- Native/JSON. Duh.
- MongoDB has used MySQL as a guide to what SQL coverage they think the market is calling for.
- More to the point, they’re trying to provide enough SQL so that standard business intelligence tools work well (enough) against MongoDB.
- I neglected to ask why this changed from MongoDB’s adamantly non-SQL approach of 2 1/2 years ago.
- MongoDB has been adding text search features for a few releases.
- MongoDB’s newest search feature revolves around “facets”, in the Endeca sense of the term. MongoDB characterizes as a kind of text-oriented GroupBy.
- Graph. MongoDB just introduced a kind of recursive join capability, which is useful for detecting multi-hop relationships (e.g. ancestor/descendant rather than just parent/child). MongoDB declares that the “graph” box is thereby checked.
Three years ago, in an overview of layered and multi-DML architectures, I suggested:
- Layered DBMS and multimodel functionality fit well together.
- Both carried performance costs.
- In most cases, the costs could be affordable.
MongoDB seems to have bought strongly into that view on the query side — which is, of course, exactly the right way for them to have started.
“Real-time” technology excites people, and has for decades. Yet the actual, useful technology to meet “real-time” requirements remains immature, especially in cases which call for rapid human decision-making. Here are some notes on that conundrum.
1. I recently posted that “real-time” is getting real. But there are multiple technology challenges involved, including:
- General streaming. Some of my posts on that subject are linked at the bottom of my August post on Flink.
- Low-latency ingest of data into structures from which it can be immediately analyzed. That helps drive the (re)integration of operational data stores, analytic data stores, and other analytic support — e.g. via Spark.
- Business intelligence that can be used quickly enough. This is a major ongoing challenge. My clients at Zoomdata may be thinking about this area more clearly than most, but even they are still in the early stages of providing what users need.
- Advanced analytics that can be done quickly enough. Answers there may come through developments in anomaly management, but that area is still in its super-early days.
- Alerting, which has been under-addressed for decades. Perhaps the anomaly management vendors will finally solve it.
2. In early 2011, I coined the phrase investigative analytics, about which I said three main things:
- It is meant to contrast with “operational analytics”.
- It is meant to conflate “several disciplines, namely”:
- Statistics, data mining, machine learning, and/or predictive analytics.
- The more research-oriented aspects of business intelligence tools.
- Analogous technologies as applied to non-tabular data types such as text or graph.
- A simple definition would be “Seeking (previously unknown) patterns in data.”
Generally, that has held up pretty well, although “exploratory” is the more widely used term. But the investigative/operational dichotomy obscures one key fact, which is the central point of this post: There’s a widespread need for very rapid data investigation.
- Network operations. This is my paradigmatic example.
- Data is zooming all over the place, in many formats and structures, among many kinds of devices. That’s log data, header data and payload data alike. Many kinds of problems can arise …
- … which operators want to diagnose and correct, in as few minutes as possible.
- Interfaces commonly include real-time business intelligence, some drilldown, and a lot of command-line options.
- I’ve written about various specifics, especially in connection with the vendors Splunk and Rocana.
- Security and anti-fraud. Infosec and cyberfraud, to a considerable extent, are just common problems in network operations. Much of the response is necessarily automated — but the bad guys are always trying to outwit your automation. If you think they may have succeeded, you want to figure that out very, very fast.
- Consumer promotion and engagement. Consumer marketers feel a great need for speed. Some of it is even genuine.
- If an online promotion is going badly (or particularly well), they can in theory react almost instantly. So they’d like to know almost instantly, perhaps via BI tools with great drilldown.
- The same is even truer in the case of social media eruptions and the like. Obviously, the tools here are heavily text-oriented.
- Call centers and even physical stores have some of the same aspects as internet consumer operations.
- Consumer internet backends, for e-commerce, publishing, gaming or whatever. These cases combine and in some cases integrate the previous three points. For example, if you get a really absurd-looking business result, that could be your first indication of network malfunctions or automated fraud.
- Industrial technology, such as factory operations, power/gas/water networks, vehicle fleets or oil rigs. Much as in IT networks, these contain a diversity of equipment — each now spewing its own logs — and have multiple possible modes of failure. More often than is the case in IT networks, you can recognize danger signs, then head off failure altogether via preventive maintenance. But when you can’t, it is crucial to identify the causes of failure fast.
- General IoT (Internet of Things) operation. This covers several of the examples above, as well as cases in which you sell a lot of devices, have them “phone home”, and labor to keep that whole multi-owner network working.
- National security. If I told you what I meant by this one, I’d have to … [redacted].
4. And then there’s the investment industry, which obviously needs very rapid analysis. When I was a stock analyst, I could be awakened by a phone call and told news that I would need to explain to 1000s of conference call listeners 20 minutes later. This was >30 years ago. The business moves yet faster today.
The investment industry has invested greatly in high-speed supporting technology for decades. That’s how Mike Bloomberg got so rich founding a vertical market tech business. But investment-oriented technology indeed remains a very vertical sector; little of it get more broadly applied.
I think the reason may be that investing is about guesswork, while other use cases call for more definitive answers. In particular:
- If you’re wrong 49.9% of the time in investing, you might still be a big winner.
- In high-frequency trading, speed is paramount; you have to be faster than your competitors. In speed/accuracy trade-offs, speed wins.
5. Of course, it’s possible to overstate these requirements. As in all real-time discussions, one needs to think hard about:
- How much speed is important in meeting users’ needs.
- How much additional speed, if any, is important in satisfying users’ desires.
But overall, I have little doubt that rapid analytics is a legitimate area for technology advancement and growth.
Then felt I like some watcher of the skies
When a new planet swims into his ken
— John Keats, “On First Looking Into Chapman’s Homer”
1. In June I wrote about why anomaly management is hard. Well, not only is it hard to do; it’s hard to talk about as well. One reason, I think, is that it’s hard to define what an anomaly is. And that’s a structural problem, not just a semantic one — if something is well enough understood to be easily described, then how much of an anomaly is it after all?
Artificial intelligence is famously hard to define for similar reasons.
“Anomaly management” and similar terms are not yet in the software marketing mainstream, and may never be. But naming aside, the actual subject matter is important.
2. Anomaly analysis is clearly at the heart of several sectors, including:
- IT operations
- Factory and other physical-plant operations
Each of those areas features one or both of the frameworks:
- Surprises are likely to be bad.
- Coincidences are likely to be suspicious.
So if you want to identify, understand, avert and/or remediate bad stuff, data anomalies are the first place to look.
3. The “insights” promised by many analytics vendors — especially those who sell to marketing departments — are also often heralded by anomalies. Already in the 1970s, Walmart observed that red clothing sold particularly well in Omaha, while orange flew off the shelves in Syracuse. And so, in large college towns, they stocked their stores to the gills with clothing in the colors of the local football team. They also noticed that fancy dresses for little girls sold especially well in Hispanic communities … specifically for girls at the age of First Communion.
4. The examples in the previous point may be characterized as noteworthy correlations that surely are reflecting actual causality. (The beer/diapers story would be another example, if only it were true.) Formally, the same is probably true of most actionable anomalies. So “anomalies” are fairly similar to — or at least overlap heavily with — “statistically surprising observations”.
And I do mean “statistically”. As per my Keats quote above, we have a classical model of sudden-shock discovery — an astronomer finding a new planet, a radar operator seeing a blip on a screen, etc. But Keats’ poem is 200 years old this month. In this century, there’s a lot more number-crunching involved.
Please note: It is certainly not the case that anomalies are necessarily found via statistical techniques. But however they’re actually found, they would at least in theory score as positives via various statistical tests.
5. There are quite a few steps to the anomaly-surfacing process, including but not limited to:
- Collecting the raw data in a timely manner.
- Identifying candidate signals (and differentiating them from noise).
- Communicating surprising signals to the most eager consumers (and letting them do their own analysis).
- Giving more tightly-curated information to a broader audience.
Hence many different kinds of vendor can have roles to play.
6. One vendor that has influenced my thinking about data anomalies is Nestlogic, an early-stage start-up with which I’m heavily involved. Here “heavily involved” includes:
- I own more stock in Nestlogic than I have in any other company of which I wasn’t the principal founder.
- I’m in close contact with founder/CEO David Gruzman.
- I’ve personally written much of Nestlogic’s website content.
Nestlogic’s claims include:
- For machine-generated data, anomalies are likely to be found in data segments, not individual records. (Here a “segment” might be all the data coming from a particular set of sources in a particular period of time.)
- The more general your approach to anomaly detection, the better, for at least three reasons:
- In adversarial use cases, the hacker/fraudster/terrorist/whatever might deliberately deviate from previous patterns, so as to evade detection by previously-established filters.
- When there are multiple things to discover, one anomaly can mask another, until it is detected and adjusted for.
- (This point isn’t specific to anomaly management) More general tools can mean that an enterprise has fewer different new tools to adopt.
- Anomalies boil down to surprising data profiles, so anomaly detection bears a slight resemblance to the data profiling approaches used in data quality, data integration and query optimization.
- Different anomaly management users need very different kinds of UI. Less technical ones may want clear, simple alerts, with a minimum of false positives. Others may use anomaly management as a jumping-off point for investigative analytics and/or human real-time operational control.
I find these claims persuasive enough to help Nestlogic with its marketing and fund-raising, and to cite them in my post here. Still, please understand that they are Nestlogic’s and David’s assertions, not my own.
1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:
- The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
- Software as a service, aka SaaS.
- Co-location in off-premises data centers, aka colo.
- On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.
Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.
This is a good example of Monash’s Laws of Commercial Semantics.
2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.
This fact now seems to be widely understood.
3. The public cloud is a natural fit for those use cases in which elasticity truly matters. Many websites and other consumer internet backends have that characteristic. Such systems are often also a good fit for cloud technologies in general.
This is frequently a good reason for new – i.e. “greenfield” – apps to run in the cloud.
4. Security and privacy can be concerns in moving to the cloud. But I’m hearing that more and more industries are overcoming those concerns.
In connection to that point, it might be interesting to note:
- In the 1960s and 1970s, one of the biggest industries for remote computing services — i.e. SaaS — was commercial banking.
- Other big users were hospitals and stockbrokers.
- The US intelligence agencies are building out their own shared, dedicated cloud.
5. Obviously, Amazon is the gorilla in the cloud business. Microsoft Azure gets favorable mentions as well. I don’t hear much about other public cloud providers, however, except that there are a lot of plans to support Google’s cloud just in case.
In particular, I hear less than I expected to about public clouds run by national-champion telecom companies around the world.
6. It’s inconvenient for an application vendor to offer both traditional and SaaS versions of a product. Release cycles and platform support are different in the two cases. But there’s no reason a large traditional application vendor couldn’t pull it off, and the largest are already more or less claiming to. Soon, this will feel like a market necessity across the board.
7. The converse is less universally true. However, some SaaS vendors do lose out from their lack of on-premises options. Key considerations include:
- Does your application need to run close to your customers’ largest databases?
- Do your customers still avoid the public cloud?
If both those things are true, and you don’t have an on-premises option, certain enterprises are excluded from your addressable market.
8. Line-of-business departments are commonly more cloud-friendly than central IT is. Reasons include:
- Departments don’t necessarily see central IT as any “closer” to them than the cloud is.
- Departments don’t necessarily care about issues that give central IT pause.
- Departments sometimes buy things that only are available via remote delivery, e.g. narrowly focused SaaS applications or market data.
I discussed some of this in my recent post on vendor lock-in.
9. When the public cloud was younger, it had various technological limitations. You couldn’t easily get fast storage like flash. You couldn’t control data movement well enough for good MPP (Massively Parallel Processing) in use cases like analytic SQL.
Those concerns seem to have been largely alleviated.
10. It takes a long time for legacy platforms to be decommissioned. At some enterprises, however, that work has indeed been going on for a long time, via virtualization.
11. If you think about system requirements:
- There is a lot of computing power in devices that may be regarded as IoT nodes — phones, TV boxes, thermostats, cars, industrial equipment, sensors, etc. Client-side computing is getting ever more diverse.
- Server-side computing, however, is more homogenous. Enterprises can, should and likely will meet the vast majority of their server requirements on a relatively small number of clusters each.
I argued the latter point in my 2013 post on appliances, clusters, and clouds, using terminology and reasoning that are now only slightly obsolete.
So what will those clusters be? Some will be determined by app choices. Most obviously, if you use SaaS, the SaaS vendor decides which cloud(s) your data is in. And if you’re re-hosting legacy systems via virtualization, that’s another cluster.
Otherwise, clusters will probably be organized by database, in the most expansive sense of term. For example, there could be separate clusters for:
- Operational data managed by your general-purpose RDBMS (Oracle, SQL Server, DB2, whatever).
- Relational data warehousing, whether in an analytic RDBMS or otherwise.
- Log files, perhaps managed in Hadoop or Splunk.
- Your website and other internet back-ends, perhaps running over NoSQL data stores.
- Text documents managed by some kind of search engine.
- Media block or object storage, if the organization’s audio/video/whatever would overwhelm a text search engine. (Text search or document management systems can often also handle low volumes of non-text media.)
Indeed, since computing is rarely as consolidated as CIOs dream of it being, a large enterprise might have several clusters for any of those categories — each running different software for data and storage management — with different deployment choices among colo, true on-prem, and true cloud.
I’ve been an analyst for 35 years, and debates about “real-time” technology have run through my whole career. Some of those debates are by now pretty much settled. In particular:
- Yes, interactive computer response is crucial.
- Into the 1980s, many apps were batch-only. Demand for such apps dried up.
- Business intelligence should occur at interactive speeds, which is a major reason that there’s a market for high-performance analytic RDBMS.
- Theoretical arguments about “true” real-time vs. near-real-time are often pointless.
- What matters in most cases is human users’ perceptions of speed.
- Most of the exceptions to that rule occur when machines race other machines, for example in automated bidding (high frequency trading or otherwise) or in network security.
A big issue that does remain open is: How fresh does data need to be? My preferred summary answer is: As fresh as is needed to support the best decision-making. I think that formulation starts with several advantages:
- It respects the obvious point that different use cases require different levels of data freshness.
- It cautions against people who think they need fresh information but aren’t in a position to use it. (Such users have driven much bogus “real-time” demand in the past.)
- It covers cases of both human and automated decision-making.
Straightforward applications of this principle include:
- In “buying race” situations such as high-frequency trading, data needs to be as fresh as the other guy’s, and preferably even fresher.
- Supply-chain systems generally need data that’s fresh to within a few hours; in some cases, sub-hour freshness is needed.
- That’s a good standard for many desktop business intelligence scenarios as well.
- Equipment-monitoring systems’ need for data freshness depends on how quickly catastrophic or cascading failures can occur or be averted.
- Different specific cases call for wildly different levels of data freshness.
- When equipment is well-instrumented with sensors, freshness requirements can be easy to meet.
E-commerce and other internet interaction scenarios can be more complicated, but it seems safe to say:
- Recommenders/personalizers should take into account information from the current session.
- Try very hard to give customers correct information about merchandise availability or pricing.
In meeting freshness requirements, multiple technical challenges can come into play.
- Traditional batch aggregation is too slow for some analytic needs. That’s a core reason for having an analytic RDBMS.
- Traditional data integration/movement pipelines can also be too slow. That’s a basis for short-request-capable data stores to also capture some analytic workloads. E.g., this is central to MemSQL’s pitch, and to some NoSQL applications as well.
- Scoring models at interactive speeds is often easy. Retraining them quickly is much harder, and at this point only rarely done.
- OLTP (OnLine Transaction Processing) guarantees adequate data freshness …
- … except in scenarios where the transactions themselves are too slow. Questionably-consistent systems — commonly NoSQL — can usually meet performance requirements, but might have issues with the freshness of accurate
- Older generations of streaming technology disappointed. The current generation is still maturing.
Based on all that, what technology investments should you be making, in order to meet “real-time” needs? My answers start:
- Customer communications, online or telephonic as the case may be, should be based on accurate data. In particular:
- If your OLTP data is somehow siloed away from your phone support data, fix that immediately, if not sooner. (Fixing it 5-15 years ago would be ideal.)
- If your eventual consistency is so eventual that customers notice, fix it ASAP.
- If you invest in predictive analytics/machine learning to support your recommenders/personalizers, then your models should at least be scored on fresh data.
- If your models don’t support that, reformulate them.
- If your data pipeline doesn’t support that, rebuild it.
- Actual high-speed retraining of models isn’t an immediate need. But if you’re going to have to transition to that anyway, consider doing do early and getting it over with.
- Your BI should have great drilldown and exploration. Find the most active users of such functionality in your enterprise, even if — especially if! — they built some kind of departmental analytic system outside the enterprise mainstream. Ask them what, if anything, they need that they don’t have. Respond accordingly.
- Whatever expensive and complex equipment you have, slather it with sensors. Spend a bit of research effort on seeing whether the resulting sensor logs can be made useful.
- Please note that this applies both to vehicles and to fixed objects (e.g. buildings, pipelines) as well as traditional industrial machinery.
- It also applies to any products you make which draw electric power.
So yes — I think “real-time” has finally become pretty real.
I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:
- Many of the independent vendors were swooped up by acquisition.
- None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
- Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
- Impala and other Hadoop-based alternatives are technology options.
- Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.
Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.
My answer, in a nutshell, is:
Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud – are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.
To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”
- If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
- If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
- If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.
Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.
And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.
My technical observations around these trends include:
- Advanced analytics commonly require flexible, iterative processing.
- Spark is much better at such processing than earlier Hadoop …
- … which in turn is better than anything that’s been built into an analytic RDBMS.
- Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
- Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.
And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.
- They don’t force you into using SQL or everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
- Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.
But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:
- They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
- One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
- Other integrations, as noted above, can also be weak.
Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.
Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.
Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.
data Artisans and Flink basics start:
- Flink is an Apache project sponsored by the Berlin-based company data Artisans.
- Flink has been viewed in a few different ways, all of which are similar to how Spark is seen. In particular, per co-founder Kostas Tzoumas:
- Flink’s original goal was “Hadoop done right”.
- Now Flink is focused on streaming analytics, as an alternative to Spark Streaming, Samza, et al.
- Kostas seems to see Flink as a batch-plus-streaming engine that’s streaming-first.
Like many open source projects, Flink seems to have been partly inspired by a Google paper.
To this point, data Artisans and Flink have less maturity and traction than Databricks and Spark. For example:
- The first line of Flink code dates back to 2010.
- data Artisans and the Flink open source project both started in 2014.
- When I met him in late June, Kostas told me that Data Artisans had raised $7 million and had 15 employees.
- Flink’s current SQL support is very minor.
Per Kostas, about half of Flink committers are at Data Artisans; others are at Cloudera, Hortonworks, Confluent, Intel, at least one production user, and some universities. Kostas provided about 5 examples of production Flink users, plus a couple of very big names that were sort-of-users (one was using a forked version of Flink, while another is becoming a user “soon”).
The technical story at data Artisans/Flink revolves around the assertion “We have the right architecture for streaming.” If I understood data Artisans co-founder Stephan Ewen correctly on a later call, the two key principles in support of that seem to be:
- The key is to keep data “transport” running smoothly without interruptions, delays or bottlenecks, where the relevant sense of “transport” is movement from one operator/operation to the next.
- In this case, the Flink folks feel that modularity supports efficiency.
- Anything that relates to consistency/recovery is kept almost entirely separate from basic processing, with minimal overhead and nothing that resembles a lock.
- Windowing and so on operate separately from basic “transport” as well.
- The core idea is that special markers — currently in the ~20 byte range in size — are injected into the streams. When the marker gets to an operator, the operator snapshots the then-current state of its part of the stream.
- Should recovery ever be needed, consistency is achieved by assembling all the snapshots corresponding to a single marker, and replaying any processing that happened after those snapshots were taken.
- Actually, this is oversimplified, in that it assumes there’s only a single input stream.
- Alot of Flink’s cleverness, I gather, is involved in assembling a consistent snapshot despite the realities of multiple input streams.
The upshot, Flink partisans believe, is to match the high throughput of Spark Streaming while also matching the low latency of Storm.
The Flink folks naturally have a rich set of opinions about streaming. Besides the points already noted, these include:
- “Exactly once” semantics are best in almost all use cases, as opposed to “at least once”, or to turning off fault tolerance altogether. (Exceptions might arise in extreme performance scenarios, or because of legacy systems’ expectations.)
- Repetitive, scheduled batch jobs are often “streaming processes in disguise”. Besides any latency benefits, reimplementing them using streaming technology might simplify certain issues that can occur around the boundaries of batch windows. (The phrase “continuous processing” could reasonably be used here.)
We discussed joins quite a bit, but this was before I realized that Flink didn’t have much SQL support. Let’s just say they sounded rather primitive even when I assumed they were done via SQL.
Our discussion of windowing was more upbeat. Flink supports windows based either on timestamps or data arrival time, and these can be combined as needed. Stephan thinks this flexibility is important.
As for Flink use cases, they’re about what you’d expect:
- Plenty of data transformation, because that’s how all these systems start out. Indeed, the earliest Flink adoption was for batch transformation.
- Plenty of stream processing.
But Flink doesn’t have all the capabilities one would want for the kinds of investigative analytics commonly done on Spark.
Databricks CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world includes:
- What I called Databricks’ “secondary business” of “licensing stuff to Spark distributors” was really about second/third tier support. Fair enough. But distributors of stacks including Spark, for whatever combination of on-premise and cloud as the case may be, may in many cases be viewed as competitors to Databricks cloud-only service. So why should Databricks help them?
- Databricks’ investment in Spark Summit and similar evangelism is larger than I realized.
- Ali suggests that the fraction of Databricks’ engineering devoted to open source Spark is greater than I understood during my recent visit.
Ali also walked me through customer use cases and adoption in wonderful detail. In general:
- A large majority of Databricks customers have machine learning use cases.
- Predicting and preventing user/customer churn is a huge issue across multiple market sectors.
The story on those sectors, per Ali, is:
- First, Databricks penetrated ad-tech, for use cases such as ad selection.
- Databricks’ second market was “mass media”.
- Disclosed examples include Viacom and NBC/Universal.
- There are “many” specific use cases. Personalization is a big one.
- Conviva-style video operations optimization is a use case for several customers, naturally including Conviva. (Reminder: Conviva was Ion Stoica’s previous company.)
- Health care came third.
- Use cases here seem to be concentrated on a variety of approaches to predict patient outcomes.
- Analytic techniques often combine machine learning with traditional statistics.
- Security is a major requirement in this sector; fortunately, Databricks believes it excels at that.
- Next came what he calls “industrial IT”. This group includes cool examples such as:
- Finding oil.
- Predictive maintenance of wind turbines.
- Predicting weather based on sensor data.
- Finally (for now), there’s financial services. Of course, “financial services” comprises a variety of quite different business segments. Example use cases include:
- Credit card marketing.
- Investment analysis (based on expensive third-party data sets that are already in the cloud).
At an unspecified place in the timeline is national security, for a use case very similar to anti-fraud — identifying communities of bad people. Graph analytics plays a big role here.
And finally, of course we discussed some technical stuff, in philosophy, futures and usage as the case may be. In particular, Ali stressed that Spark 2.0 is the first that “breaks”/changes the APIs; hence the release number. It is now the case that:
- There’s a single API for batch and streaming alike, and for machine learning “too”. This is DataFrames/DataSets. In this API …
- … everything is a table. That said:
- Tables can be nested.
- Tables can be infinitely large, in which case you’re doing streaming.
- Based on this, Ali thinks Spark 2.0 is now really a streaming engine.
Other tidbits included:
- Ali said that every Databricks customer uses SQL. No exceptions.
- Indeed, a “number” of customers are using business intelligence tools. Therefore …
- … Databricks is licensing connector technology from Simba.
- They’re working on model serving, with a REST API, rather than just model building. This was demoed at the recent Spark Summit, but is still in the “nascent” stage.
- Ali insists that every streaming system with good performance does some kind of micro-batching under the hood. But the Spark programmers no longer need to take that directly into account. (In earlier versions, programmatic window sizes needed to be integer multiples of the low-level system’s chosen interval.)
- In the future, when Databricks runs on more than just the Amazon cloud, Databricks customers will of course have cloud-to-cloud portability.
I visited DataStax on my recent trip. That was a tipping point leading to my recent discussions of NoSQL DBAs and misplaced fear of vendor lock-in. But of course I also learned some things about DataStax and Cassandra themselves.
On the customer side:
- DataStax customers still overwhelmingly use Cassandra for internet back-ends — web, mobile or otherwise as the case might be.
- This includes — and “includes” might be understating the point — traditional enterprises worried about competition from internet-only ventures.
Customers in large numbers want cloud capabilities, as a potential future if not a current need.
One customer example was a large retailer, who in the past was awful at providing accurate inventory information online, but now uses Cassandra for that. DataStax brags that its queries come back in 20 milliseconds, but that strikes me as a bit beside the point; what really matters is that data accuracy has gone from “batch” to some version of real-time. Also, Microsoft is a DataStax customer, using Cassandra (and Spark) for the Office 365 backend, or at least for the associated analytics.
Per Patrick McFadin, the four biggest things in DataStax Enterprise 5 are:
- Graph capabilities.
- Cassandra 3.0, which includes a complete storage engine rewrite.
- Tiered storage/ILM (Information Lifecycle Management).
- Policy-based replication.
Some of that terminology is mine, but perhaps my clients at DataStax will adopt it too.
We didn’t go into as much technical detail as I ordinarily might, but a few notes on that tiered storage/ILM bit are:
- It’s a way to have some storage that’s more expensive (e.g. flash) and some that’s cheaper (e.g. spinning disk). Duh.
- Since Cassandra has a strong time-series orientation, it’s easy to imagine how those policies might be specified.
- Technologically, this is tightly integrated with Cassandra’s compaction strategy.
DataStax Enterprise 5 also introduced policy-based replication features, not all of which are in open source Cassandra. Data sovereignty/geo-compliance is improved, which is of particular importance in financial services. There’s also hub/spoke replication now, which seems to be of particular value in intermittently-connected use cases. DataStax said the motivating use case in that area was oilfield operations, where presumably there are Cassandra-capable servers at all ends of the wide-area network.
During my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion Stoica as well. Spark 2.0 is just coming out now, and of course has a lot of enhancements. At a high level:
- Using the new terminology, Spark originally assumed users had data engineering skills, but Spark 2.0 is designed to be friendly to data scientists.
- A lot of this is via a focus on simplified APIs, based on
- Unlike similarly named APIs in R and Python, Spark DataFrames work with nested data.
- Machine learning and Spark Streaming both work with Spark DataFrames.
- There are lots of performance improvements as well, some substantial. Spark is still young enough that Bottleneck Whack-A-Mole yields huge benefits, especially in the SparkSQL area.
- SQL coverage is of course improved. For example, SparkSQL can now perform all TPC-S queries.
The majority of Databricks’ development efforts, however, are specific to its cloud service, rather than being donated to Apache for the Spark project. Some of the details are NDA, but it seems fair to mention at least:
- Databricks’ notebooks feature for organizing and launching machine learning processes and so on is a biggie. Jupyter is an open source analog.
- Databricks has been working on security, and even on the associated certifications.
Two of the technical initiatives Reynold told me about seemed particularly cool. One, on the machine learning side, was a focus on training models online as new data streams in. In most cases this seems to require new algorithms for old model types, with a core idea being that the algorithm does a mini gradient descent for each new data point.
The other cool idea fits the trend of alternatives to the “lambda architecture”. Under the name “structured streaming”, which seems to be a replacement for “DStreaming”, the idea is to do set-based SQL processing even though membership of the set changes over time. Result sets are extracted on a snapshot basis; you can keep either all the results from each snapshot query or just the deltas.
Despite all this, there’s some non-trivial dissatisfaction with Spark, fair or otherwise.
- Some of the reason is that SparkSQL is too immature to be great.
- Some is annoyance that Databricks isn’t putting everything it has into open source.
- Some is that everything has its architectural trade-offs.
To the last point, I raised one of the biggest specifics with Reynold, namely Spark’s lack of a strong built-in data persistence capability. Reynold’s answer was that they’re always working to speed up reading and writing from other forms of persistent storage. E.g., he cited a figure of ~100 million rows/core/second decoded from Parquet.
I visited Databricks in early July to chat with Ion Stoica and Reynold Xin. Spark also comes up in a large fraction of the conversations I have. So let’s do some catch-up on Databricks and Spark. In a nutshell:
- Spark is indeed the replacement for Hadoop MapReduce.
- Spark is becoming the default platform for machine learning.
- SparkSQL (nee’ Shark) is puttering along predictably.
- Databricks reports good success in its core business of cloud-based machine learning support.
- Spark Streaming has strong adoption, but its position is at risk.
- Databricks, the original authority on Spark, is not keeping a tight grip on that role.
I shall explain below. I also am posting separately about Spark evolution, especially Spark 2.0. I’ll also talk a bit in that post about Databricks’ proprietary/closed-source technology.
Spark is the replacement for Hadoop MapReduce.
This point is so obvious that I don’t know what to say in its support. The trend is happening, as originally decreed by Cloudera (and me), among others. People are rightly fed up with the limitations of MapReduce, and — niches perhaps aside — there are no serious alternatives other than Spark.
The greatest use for Spark seems to be the same as the canonical first use for MapReduce: data transformation. Also in line with the Spark/MapReduce analogy:
- Data-transformation-only use cases are important, but they don’t dominate.
- Most other use cases typically have a data transformation element as well …
- … which has to be started before any other work can be done.
And so it seems likely that, at least for as long as Spark is growing rapidly, data transformation will appear to be the biggest Spark use case.
Spark is becoming the default platform for machine learning.
Largely, this is a corollary of:
- The previous point.
- The fact that Spark was originally designed with machine learning as its principal use case.
To do machine learning you need two things in your software:
- A collection of algorithms. Spark, I gather, is one of numerous good alternatives there.
- Support for machine learning workflows. That’s where Spark evidently stands alone.
And thus I have conversations like:
- “Are you doing anything with Spark?”
- “We’ve gotten more serious about machine learning, so yes.”
SparkSQL (nee’ Shark) is puttering along.
SparkSQL is pretty much following the Hive trajectory.
- Useful from Day One as an adjunct to other kinds of processing.
- A tease and occasionally useful as a SQL engine for its own sake, but really not very good, pending years to mature.
Databricks reports good success in its core business of cloud-based machine learning support.
Databricks, to an even greater extent than I previously realized, is focused on its cloud business, for which there are well over 200 paying customers. Notes on that include:
- As you might expect based on my comments above, the majority of usage is for data transformation, but a lot of that is in anticipation of doing machine learning/predictive modeling in the near future.
- Databricks customers typically already have their data in the Amazon cloud.
- Naturally, a lot of Databricks customers are internet companies — ad tech startups and the like. Databricks also reports “strong” traction in the segments:
- Financial services (especially but not only insurance)
- Health care/pharma
- The main languages Databricks customers use are R and Python. Ion said that Python was used more on the West Coast, while R was used more in the East.
Databricks’ core marketing concept seems to be “just-in-time data platform”. I don’t know why they picked that, as opposed to something that emphasizes Spark’s flexibility and functionality.
Spark Streaming’s long-term success is not assured.
To a first approximation, things look good for Spark Streaming.
- Spark Streaming is definitely the leading companion to Kafka, and perhaps also to cloud equivalents (e.g. Amazon Kinesis).
- The “traditional” alternatives of Storm and Samza are pretty much done.
- Newer alternatives from Twitter, Confluent and Flink aren’t yet established.
- Cloudera is a big fan of Spark Streaming.
- Even if Spark Streaming were to generally decline, it might keep substantial “good enough” usage, analogously to Hive and SparkSQL.
- Cool new Spark Streaming technology is coming out.
But I’m also hearing rumbles and grumbles about Spark Streaming. What’s more, we know that Spark Streaming wasn’t a core part of Spark’s design; the use case just happened to emerge. Demanding streaming use cases typically involve a lot of short-request inserts (or updates/upserts/whatever). And if you were designing a system to handle those … would it really be based on Spark?
Databricks is not keeping a tight grip on Spark leadership.
- Databricks’ main business, as noted above, is its cloud service. That seems to be going well.
- Databricks’ secondary business is licensing stuff to Spark distributors. That doesn’t seem to amount to much; it’s too easy to go straight to the Apache distribution and bypass Databricks. No worries; this never seemed it would be a big revenue opportunity for Databricks.
At the moment, Databricks is pretty clearly the general leader of Spark. Indeed:
- If you want the story on where Spark is going, you do what I did — you ask Databricks.
- Similarly, if you’re thinking of pushing the boundaries on Spark use, and you have access to the Databricks folks, that’s who you’ll probably talk to.
- Databricks employs ~1/3 of Spark committers.
- Databricks organizes the Spark Summit.
But overall, Databricks doesn’t seem to care much about keeping Spark leadership. Its marketing efforts in that respect are minimal. Word-of-mouth buzz paints a similar picture. My direct relationship with the company gives the same impression. Oh, I’m sure Databricks would like to remain the Spark leader. But it doesn’t seem to devote much energy toward keeping the role.
Starting with my introduction to Spark, previous overview posts include those in: