Skip navigation.

DBMS2

Syndicate content
Choices in data management and analysis
Updated: 2 hours 49 min ago

Where the innovation is

Mon, 2015-01-19 02:27

I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. :) But if we abandon any hope that this post could be comprehensive, I can at least say:

1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.

  • Volume has been solved. There are Hadoop installations with 100s of petabytes of data, analytic RDBMS with 10s of petabytes, general-purpose Exadata sites with petabytes, and 10s/100s of petabytes of analytic Accumulo at the NSA. Further examples abound.
  • Velocity is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric RDBMS.
  • Variety and Variability have been solved. MongoDB, Cassandra and perhaps others are strong NoSQL choices. Schema-on-need is in earlier days, but may help too.

2. Even so, there’s much room for innovation around data movement and management. I’d start with:

  • Product maturity is a huge issue for all the above, and will remain one for years.
  • Hadoop and Spark show that application execution engines:
    • Have a lot of innovation ahead of them.
    • Are tightly entwined with data management, and with data movement as well.
  • Hadoop is due for another refactoring, focused on both in-memory and persistent storage.
  • There are many issues in storage that can affect data technologies as well, including but not limited to:
    • Solid-state (flash or post-flash) vs. spinning disk.
    • Networked vs. direct-attached.
    • Virtualized vs. identifiable-physical.
    • Object/file/block.
  • Graph analytics and data management are still confused.

3. As I suggested last year, data transformation is an important area for innovation. 

  • MapReduce was invented for data transformation, which is still a large part of what goes on in Hadoop.
  • The smart data preparation crowd is deservedly getting attention.
  • The more different data models — NoSQL and so on — that are used, the greater are the demands on data transformation.

4. There’s a lot going on in investigative analytics. Besides the “platform” technologies already mentioned, in areas such as fast-query, data preparation, and general execution engines, there’s also great innovation higher in the stack. Most recently I’ve written about multiple examples in predictive modeling, such as:

Beyond that:

  • Event-series analytics is another exciting area. (At least on the BI side, I frankly expected it to sweep through the relevant vertical markets more quickly than it has.)
  • I’ve long been disappointed in the progress in text analytics. But sentiment analysis is doing fairly well, many more languages are analyzed than before, and I occasionally hear rumblings of text analytic sophistication inching back towards that already available in the previous decade.
  • While I don’t write about it much, modern BI navigation is an impressive and wonderful thing.

5. Back in 2013, in what was perhaps my previous most comprehensive post on innovation, I drew a link between innovation and refactoring, where what was being refactored was “everything”. Even so, I’ve been ignoring a biggie. Security is a mess, and I don’t see how it can ever be solved unless systems are much more modular from the ground up. By that I mean:

  • “Fencing” processes and resources away from each other improves system quality, in that it defends against both deliberate attacks and inadvertent error.
  • Fencing is costly, both in terms of context-switching and general non-optimization. Nonetheless, I suspect that …
  • … the cost of such process isolation may need to be borne.
  • Object-oriented programming and its associated contracts are good things in this context. But it’s obvious they’re not getting the job done on their own.

More specifically,

  • It is cheap to give single-purpose intelligent devices more computing power than they know what to do with. There is really no excuse for allowing them to be insecure.
  • It is rare for a modern PC to go much above 25% CPU usage, simply because most PC programs are still single-core. This illustrates that — assuming some offsetting improvements in multi-core parallelism — desktop software could take a security performance hit without much pain to users’ wallets.
  • On servers, we may in many cases be talking about lightweight virtual machines.

And to be clear:

  • What I’m talking about would do little to help the authentication/authorization aspects of security, but …
  • … those will never be perfect in any case (because they depend upon fallible humans) …
  • … which is exactly why other forms of security will always be needed.

6. You’ve probably noticed the fuss around an open letter about artificial intelligence, with some press coverage suggesting that AI is a Terminator-level threat to humanity. Underlying all that is a fairly interesting paper summarizing some needs for future research and innovation in AI. In particular, reading the paper reminded me of the previous point about security.

7. Three areas of software innovation that, even though they’re pretty much in my wheelhouse, I have little to say about right now are:

  • Application development technology, languages, frameworks, etc.
  • The integration of analytics into old-style operational apps.
  • The never-ending attempts to make large-enterprise-class application functionality available to outfits with small-enterprise sophistication and budgets.

8. There is, of course, tremendous innovation in robots and other kinds of device. But this post is already long enough, so I’ll address those areas some other time.

Related links

  • In many cases, I think that innovations will prove more valuable — or at least much easier to monetize — when presented to particular vertical markets.
  • Edit: I followed up on the last point with a post about soft robots.
Categories: Other

Migration

Sat, 2015-01-10 00:45

There is much confusion about migration, by which I mean applications or investment being moved from one “platform” technology — hardware, operating system, DBMS, Hadoop, appliance, cluster, cloud, etc. — to another. Let’s sort some of that out. For starters:

  • There are several fundamentally different kinds of “migration”.
    • You can re-host an existing application.
    • You can replace an existing application with another one that does similar (and hopefully also new) things. This new application may be on a different platform than the old one.
    • You can build or buy a wholly new application.
    • There’s also the inbetween case in which you extend an old application with significant new capabilities — which may not be well-suited for the existing platform.
  • Motives for migration generally fall into a few buckets. The main ones are:
    • You want to use a new app, and it only runs on certain platforms.
    • The new platform may be cheaper to buy, rent or lease.
    • The new platform may have lower operating costs in other ways, such as administration.
    • Your employees may like the new platform’s “cool” aspect. (If the employee is sufficiently high-ranking, substitute “strategic” for “cool”.)
  • Different apps may be much easier or harder to re-host. At two extremes:
    • It can be forbiddingly difficult to re-host an OLTP (OnLine Transaction Processing) app that is heavily tuned, tightly integrated with your other apps, and built using your DBMS vendor’s proprietary stored-procedure language.
    • It might be trivial to migrate a few long-running SQL queries to a new engine, and pretty easy to handle the data connectivity part of the move as well.
  • Certain organizations, usually packaged software companies, design portability into their products from the get-go, with at least partial success.

I mixed together true migration and new-app platforms in a post last year about DBMS architecture choices, when I wrote:

  • Sometimes something isn’t broken, and doesn’t need fixing.
  • Sometimes something is broken, and still doesn’t need fixing. Legacy decisions that you now regret may not be worth the trouble to change.
  • Sometimes — especially but not only at smaller enterprises — choices are made for you. If you operate on SaaS, plus perhaps some generic web hosting technology, the whole DBMS discussion may be moot.

In particular, migration away from legacy DBMS raises many issues:

  • Feature incompatibility (especially in stored-procedure languages and/or other vendor-specific SQL).
  • Your staff’s programming and administrative skill-sets.
  • Your investment in DBMS-related tools.
  • Your supply of hockey tickets from the vendor’s salesman.

Except for the first, those concerns can apply to new applications as well. So if you’re going to use something other than your enterprise-standard RDBMS, you need a good reason.

I then argued that such reasons are likely to exist for NoSQL DBMS, but less commonly for NewSQL. My views on that haven’t changed in the interim.

More generally, my pro-con thoughts on migration start:

  • Pure application re-hosting is rarely worthwhile. Migration risks and costs outweigh the benefits, except in a few cases, one of which is the migration of ELT (Extract/Load/Transform) from expensive analytic RDBMS to Hadoop.
  • Moving from in-house to co-located data centers can offer straightforward cost savings, because it’s not accompanied by much in the way of programming costs, risks, or delays. Hence Rackspace’s refocus on colo at the expense of cloud. (But it can be hard on your data center employees.)
  • Moving to an in-house cluster can be straightforward, and is common. VMware is the most famous such example. Exadata consolidation is another.
  • Much of new application/new functionality development is in areas where application lifespans are short — e.g. analytics, or customer-facing internet. Platform changes are then more practical as well.
  • New apps or app functionality often should and do go where the data already is. This is especially true in the case of cloud/colo/on-premises decisions. Whether it’s important in a single location may depend upon the challenges of data integration.

I’m also often asked for predictions about migration. In light of the above, I’d say:

  • Successful DBMS aren’t going away.
    • OLTP workloads can usually be lost only so fast as applications are replaced, and that tends to be a slow process. Claims to the contrary are rarely persuasive.
    • Analytic DBMS can lose workloads more easily — but their remaining workloads often grow quickly, creating an offset.
  • A large fraction of new apps are up for grabs. Analytic applications go well on new data platforms. So do internet apps of many kinds. The underlying data for these apps often starts out in the cloud. SaaS (Software as a Service) is coming on strong. Etc.
  • I stand by my previous view that most computing will wind up on appliances, clusters or clouds.
  • New relational DBMS will be slow to capture old workloads, even if they are slathered with in-memory fairy dust.

And for a final prediction — discussion of migration isn’t going to go away either. :)

Categories: Other

Notes on machine-generated data, year-end 2014

Wed, 2014-12-31 21:49

Most IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find myself in the mood for another survey post, I can’t think of any better idea for a unifying theme.

1. There are many kinds of machine-generated data. Important categories include:

  • Web, network and other IT logs.
  • Game and mobile app event data.
  • CDRs (telecom Call Detail Records).
  • “Phone-home” data from large numbers of identical electronic products (for example set-top boxes).
  • Sensor network output (for example from a pipeline or other utility network).
  • Vehicle telemetry.
  • Health care data, in hospitals.
  • Digital health data from consumer devices.
  • Images from public-safety camera networks.
  • Stock tickers (if you regard them as being machine-generated, which I do).

That’s far from a complete list, but if you think about those categories you’ll probably capture most of the issues surrounding other kinds of machine-generated data as well.

2. Technology for better information and analysis is also technology for privacy intrusion. Public awareness of privacy issues is focused in a few areas, mainly:

  • Government snooping on the contents of communications.
  • Communication traffic analysis.
  • Photos and videos (airport scanners, public cameras, etc.)
  • Commercial ad targeting.
  • Traditional medical records.

Other areas, however, continue to be overlooked, with the two biggies in my opinion being:

  • The potential to apply marketing-like psychographic analysis in other areas, such as hiring decisions or criminal justice.
  • The ability to track people’s movements in great detail, which will be increased greatly yet again as the market matures — and some think this will happen soon — for consumer digital health.

My core arguments about privacy and surveillance seem as valid as ever.

3. The natural database structures for machine-generated data vary wildly. Weblog data structure is often remarkably complex. Log data from complex organizations (e.g. IT shops or hospitals) might comprise many streams, each with a different (even if individually simple) organization. But in the majority of my example categories, record structure is very simple and repeatable. Thus, there are many kinds of machine-generated data that can, at least in principle, be handled well by a relational DBMS …

4. … at least to some extent. In a further complication, much machine-generated data arrives as a kind of time series. Many (but not all) time series call for a strong commitment to event-series styles of analytics. Event series analytics are a challenge for relational DBMS, but Vertica and others have tried to step up with various kinds of temporal predicates or datatypes. Event series are also a challenge for business intelligence vendors, and a potentially significant driver for competitive rebalancing in the BI market.

5. Event series even aside, I wish I understood more about business intelligence for non-tabular data. I plan to fix that.

6. Streaming and memory-centric processing are closely related subjects. What I wrote recently about them for Hadoop still applies: Spark, Kafka, etc. is still the base streaming case going forward; Storm is still around as an alternative; Tachyon or something like it will change the game somewhat. But not all streaming machine-generated data needs to land in Hadoop at all. As noted above, relational data stores (especially memory-centric ones) can suffice. So can NoSQL. So can Splunk.

Not all these considerations are important in all use cases. For one thing, latency requirements vary greatly. For example:

  • High-frequency trading is an extreme race; microseconds matter.
  • Internet interaction applications increasingly require data freshness to the last click or other user action. Computational latency requirements can go down to the single-digit milliseconds. Real-time ad auctions have a race aspect that may drive latency lower yet.
  • Minute-plus response can be fine for individual remote systems. Sometimes they ping home more rarely than that.

There’s also still plenty of true batch mode, but — and I say this as part of a conversation that’s been underway for over 40 years — interactive computing is preferable whenever feasible.

7. My views about predictive analytics are still somewhat confused. For starters:

  • The math and technology of predictive modeling both still seem pretty simple …
  • … but sometimes achieve mind-blowing results even so.
  • There’s a lot of recent innovation in predictive modeling, but adoption of the innovative stuff is still fairly tepid.
  • Adoption of the simple stuff is strong in certain market sectors, especially ones connected to customer understanding, such as marketing or anti-fraud.

So I’ll mainly just link to some of my past posts on the subject, and otherwise leave discussion of predictive analytics to another day.

Finally, back in 2011 I tried to broadly categorize analytics use cases. Based on that and also on some points I just raised above, I’d say that a ripe area for breakthroughs is problem and anomaly detection and diagnosis, specifically for machines and physical installations, rather than in the marketing/fraud/credit score areas that are already going strong. That’s an old discipline; the concept of statistical process control dates back before World War II. Perhaps they’re underway; the Conviva retraining example listed above is certainly imaginative. But I’d like to see a lot more in the area.

Even more important, of course, could be some kind of revolution in predictive modeling for medicine.

Categories: Other