Skip navigation.

Other

“Innovation in Managing the Chaos of Everyday Project Management” is now on YouTube

If you missed Fishbowl’s recent webinar on our new Enterprise Information Portal for Project Management, you can now view a recording of it on YouTube.

 

Innovation in Managing the Chaos of Everyday Project Management discusses our strategy for leveraging the content management and collaboration features of Oracle WebCenter to enable project-centric organizations to build and deploy a project management portal. This solution was designed especially for groups like E & C firms and oil and gas companies, who need applications to be combined into one portal for simple access.

If you’d like to learn more about the Enterprise Information Portal for Project Management, visit our website or email our sales team at sales@fishbowlsolutions.com.

The post “Innovation in Managing the Chaos of Everyday Project Management” is now on YouTube appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other

WibiData’s approach to predictive modeling and experimentation

DBMS2 - Tue, 2014-12-16 06:29

A conversation I have too often with vendors goes something like:

  • “That confidential thing you told me is interesting, and wouldn’t harm you if revealed; probably quite the contrary.”
  • “Well, I guess we could let you mention a small subset of it.”
  • “I’m sorry, that’s not enough to make for an interesting post.”

That was the genesis of some tidbits I recently dropped about WibiData and predictive modeling, especially but not only in the area of experimentation. However, Wibi just reversed course and said it would be OK for me to tell more or less the full story, as long as I note that we’re talking about something that’s still in beta test, with all the limitations (to the product and my information alike) that beta implies.

As you may recall:

With that as background, WibiData’s approach to predictive modeling as of its next release will go something like this:

  • There is still a strong element of classical modeling by data scientists/statisticians, with the models re-scored in batch, perhaps nightly.
  • But of course at least some scoring should be done as real-time as possible, to accommodate fresh data such as:
    • User interactions earlier in today’s session.
    • Technology for today’s session (device, connection speed, etc.)
    • Today’s weather.
  • WibiData Express is/incorporates a Scala-based language for modeling and query.
  • WibiData believes Express plus a small algorithm library gives better results than more mature modeling libraries.
    • There is some confirming evidence of this …
    • … but WibiData’s customers have by no means switched over yet to doing the bulk of their modeling in Wibi.
  • WibiData will allow line-of-business folks to experiment with augmentations to the base models.
  • Supporting technology for predictive experimentation in WibiData will include:
    • Automated multi-armed bandit testing (in previous versions even A/B testing has been manual).
    • A facility for allowing fairly arbitrary code to be included into otherwise conventional model-scoring algorithms, where conventional scoring models can come:
      • Straight from WibiData Express.
      • Via PMML (Predictive Modeling Markup Language) generated by other modeling tools.
    • An appropriate user interface for the line-of-business folks to do certain kinds of injecting.

Let’s talk more about predictive experimentation. WibiData’s paradigm for that is:

  • Models are worked out in the usual way.
  • Businesspeople have reasons for tweaking the choices the models would otherwise dictate.
  • They enter those tweaks as rules.
  • The resulting combination — models plus rules — are executed and hence tested.

If those reasons for tweaking are in the form of hypotheses, then the experiment is a test of those hypotheses. However, WibiData has no provision at this time to automagically incorporate successful tweaks back into the base model.

What might those hypotheses be like? It’s a little tough to say, because I don’t know in fine detail what is already captured in the usual modeling process. WibiData gave me only one real-life example, in which somebody hypothesized that shoppers would be in more of a hurry at some times of day than others, and hence would want more streamlined experiences when they could spare less time. Tests confirmed that was correct.

That said, I did grow up around retailing, and so I’ll add:

  • Way back in the 1970s, Wal-Mart figured out that in large college towns, clothing in the football team’s colors was wildly popular. I’d hypothesize such a rule at any vendor selling clothing suitable for being worn in stadiums.
  • A news event, blockbuster movie or whatever might trigger a sudden change in/addition to fashion. An alert merchant might guess that before the models pick it up. Even better, she might guess which psychographic groups among her customers were most likely to be paying attention.
  • Similarly, if a news event caused a sudden shift in buyers’ optimism/pessimism/fear of disaster, I’d test that a response to that immediately.

Finally, data scientists seem to still be a few years away from neatly solving the problem of multiple shopping personas — are you shopping in your business capacity, or for yourself, or for a gift for somebody else (and what can we infer about that person)? Experimentation could help fill the gap.

Categories: Other

Notes and links, December 12, 2014

DBMS2 - Fri, 2014-12-12 05:05

1. A couple years ago I wrote skeptically about integrating predictive modeling and business intelligence. I’m less skeptical now.

For starters:

  • The predictive experimentation I wrote about over Thanksgiving calls naturally for some BI/dashboarding to monitor how it’s going.
  • If you think about Nutonian’s pitch, it can be approximated as “Root-cause analysis so easy a business analyst can do it.” That could be interesting to jump to after BI has turned up anomalies. And it should be pretty easy to whip up a UI for choosing a data set and objective function to model on, since those are both things that the BI tool would know how to get to anyway.

I’ve also heard a couple of ideas about how predictive modeling can support BI. One is via my client Omer Trajman, whose startup ScalingData is still semi-stealthy, but says they’re “working at the intersection of big data and IT operations”. The idea goes something like this:

  • Suppose we have lots of logs about lots of things.* Machine learning can help:
    • Notice what’s an anomaly.
    • Group* together things that seem to be experiencing similar anomalies.
  • That can inform a BI-plus interface for a human to figure out what is happening.

Makes sense to me.

* The word “cluster” could have been used here in a couple of different ways, so I decided to avoid it altogether.

Finally, I’m hearing a variety of “smart ETL/data preparation” and “we recommend what columns you should join” stories. I don’t know how much machine learning there’s been in those to date, but it’s usually at least on the roadmap to make the systems (yet) smarter in the future. The end benefit is usually to facilitate BI.

2. Discussion of graph DBMS can get confusing. For example:

  • Use cases run the gamut from short-request to highly analytic; no graph DBMS is well-suited for all graph use cases.
  • Graph DBMS have huge problems scaling, because graphs are very hard to partition usefully; hence some of the more analytic use cases may not benefit from a graph DBMS at all.
  • The term “graph” has meanings in computer science that have little to do with the problems graph DBMS try to solve, notably directed acyclic graphs for program execution, which famously are at the heart of both Spark and Tez.
  • My clients at Neo Technology/Neo4j call one of their major use cases MDM (Master Data Management), without getting much acknowledgement of that from the mainstream MDM community.

I mention this in part because that “MDM” use case actually has some merit. The idea is that hierarchies such as organization charts, product hierarchies and so on often aren’t actually strict hierarchies. And even when they are, they’re usually strict only at specific points in time; if you care about their past state as well as their present one, a hierarchical model might have trouble describing them. Thus, LDAP (Lightweight Directory Access Protocol) engines may not be an ideal way to manage and reference such “hierarchies:; a graph DBMS might do better.

3. There is a surprising degree of controversy among predictive modelers as to whether more data yields better results. Besides, the most common predictive modeling stacks have difficulty scaling. And so it is common to model against samples of a data set rather than the whole thing.*

*Strictly speaking, almost the whole thing — you’ll often want to hold at least a sample of the data back for model testing.

Well, WibiData’s couple of Very Famous Department Store customers have tested WibiData’s ability to model against an entire database vs. their alternative predictive modeling stacks’ need to sample data. WibiData says that both report significantly better results from training over the whole data set than from using just samples.

4. Scaling Data is on the bandwagon for Spark Streaming and Kafka.

5. Derrick Harris and Pivotal turn out to have been earlier than me in posting about Tachyon bullishness.

6. With the Hortonworks deal now officially priced, Derrick was also free to post more about/from Hortonworks’ pitch. Of course, Hortonworks is saying Hadoop will be Big Big Big, and suggesting we should thus not be dismayed by Hortonworks’ financial performance so far. However, Derrick did not cite Hortonworks actually giving any reasons why its competitive position among Hadoop distribution vendors should improve.

Beyond that, Hortonworks says YARN is a big deal, but doesn’t seem to like Spark Streaming.

Categories: Other

A few numbers from MapR

DBMS2 - Wed, 2014-12-10 00:55

MapR put out a press release aggregating some customer information; unfortunately, the release is a monument to vagueness. Let me start by saying:

  • I don’t know for sure, but I’m guessing Derrick Harris was incorrect in suspecting that this release was a reaction to my recent post about Hortonworks’ numbers. For one thing, press releases usually don’t happen that quickly.
  • And as should be obvious from the previous point — notwithstanding that MapR is a client, I had no direct involvement in this release.
  • In general, I advise clients and other vendors to put out the kind of aggregate of customer success stories found in this release. However, I would like to see more substance than MapR offered.

Anyhow, the key statement in the MapR release is:

… the number of companies that have a paid subscription for MapR now exceeds 700.

Unfortunately, that includes OEM customers as well as direct ones; I imagine MapR’s direct customer count is much lower.

In one gesture to numerical conservatism, MapR did indicate by email that it counts by overall customer organization, not by department/cluster/contract (i.e., not the way Hortonworks does).

The MapR press release also said:

As of November 2014, MapR has one or more customers in eight vertical markets that have purchased more than one million dollars of MapR software and services.  These vertical markets are advertising/media, financial services, healthcare, internet, information technology, retail, security, and telecom.

Since the word “each” isn’t in that quote, so we don’t even know whether MapR is referring to individual big customers or just general sector penetration. We also don’t know whether the revenue is predominantly subscription or some other kind of relationship.

MapR also indicated that the average customer more than doubled its annualized subscription rate vs. a year ago; the comparable figure — albeit with heavy disclaimers — from Hortonworks was 25%.

Categories: Other

Hadoop’s next refactoring?

DBMS2 - Sun, 2014-12-07 08:59

I believe in all of the following trends:

  • Hadoop is a Big Deal, and here to stay.
  • Spark, for most practical purposes, is becoming a big part of Hadoop.
  • Most servers will be operated away from user premises, whether via SaaS (Software as a Service), co-location, or “true” cloud computing.

Trickier is the meme that Hadoop is “the new OS”. My thoughts on that start:

  • People would like this to be true, although in most cases only as one of several cluster computing platforms.
  • Hadoop, when viewed as an operating system, is extremely primitive.
  • Even so, the greatest awkwardness I’m seeing when different software shares a Hadoop cluster isn’t actually in scheduling, but rather in data interchange.

There is also a minor issue that if you distribute your Hadoop work among extra nodes you might have to pay a bit more to your Hadoop distro support vendor. Fortunately, the software industry routinely solves more difficult pricing problems than that.

Recall now that Hadoop — like much else in IT — has always been about two things: data storage and program execution. The evolution of Hadoop program execution to date has been approximately:

  • Originally, MapReduce and JobTracker were the way to execute programs in Hadoop, period, at least if we leave HBase out of the discussion.
  • In a major refactoring, YARN replaced a lot of what JobTracker did, with the result that different program execution frameworks became easier to support.
  • Most of the relevant program execution frameworks — such as MapReduce, Spark or Tez — have data movement and temporary storage near their core.

Meanwhile, Hadoop data storage is mainly about HDFS (Hadoop Distributed File System). Its evolution, besides general enhancement, has included the addition of file types suitable for specific kinds of processing (e.g. Parquet and ORC to accelerate analytic database queries). Also, there have long been hacks that more or less bypassed central Hadoop data management, and let data be moved in parallel on a node-by-node basis. But several signs suggest that Hadoop data storage should and will be refactored too. Three efforts in particular point in that direction:

The part of all this I find most overlooked is inter-program data exchange. If two programs both running on Hadoop want to exchange data, what do they do, other than reading and writing to HDFS, or invoking some kind of a custom connector? What’s missing is a nice, flexible distributed memory layer, which:

  • Works well with Hadoop execution engines (Spark, Tez, Impala …).
  • Works well with other software people might want to put on their Hadoop nodes.
  • Interfaces nicely to HDFS, Isilon, object storage, et al.
  • Is fully parallel any time it needs to talk with persistent or external storage.
  • Can be fully parallel any time it needs to talk with any other software on the Hadoop cluster.

Tachyon could, I imagine, become that. HDFS caching probably could not.

In the past, I’ve been skeptical of in-memory data grids. But now I think that a such a grid could take Hadoop to the next level of generality and adoption.

Related links

Categories: Other

Notes on the Hortonworks IPO S-1 filing

DBMS2 - Sun, 2014-12-07 07:53

Given my stock research experience, perhaps I should post about Hortonworks’ initial public offering S-1 filing. :) For starters, let me say:

  • Hortonworks’ subscription revenues for the 9 months ended last September 30 appear to be:
    • $11.7 million from everybody but Microsoft, …
    • … plus $7.5 million from Microsoft, …
    • … for a total of $19.2 million.
  • Hortonworks states subscription customer counts (as per Page 55 this includes multiple “customers” within the same organization) of:
    • 2 on April 30, 2012.
    • 9 on December 31, 2012.
    • 25 on April 30, 2013.
    • 54 on September 30, 2013.
    • 95 on December 31, 2013.
    • 233 on September 30, 2014.
  • Per Page 70, Hortonworks’ total September 30, 2014 customer count was 292, including professional services customers.
  • Non-Microsoft subscription revenue in the quarter ended September 30, 2014 seems to have been $5.6 million, or $22.5 million annualized. This suggests Hortonworks’ average subscription revenue per non-Microsoft customer is a little over $100K/year.
  • This IPO looks to be a sharply “down round” vs. Hortonworks’ Series D financing earlier this year.
    • In March and June, 2014, Hortonworks sold stock that subsequently was converted into 1/2 a Hortonworks share each at $12.1871 per share.
    • The tentative top of the offering’s price range is $14/share.
    • That’s also slightly down from the Series C price in mid-2013.

And, perhaps of interest only to me — there are approximately 50 references to YARN in the Hortonworks S-1, but only 1 mention of Tez.

Overall, the Hortonworks S-1 is about 180 pages long, and — as is typical — most of it is boilerplate, minutiae or drivel. As is also typical, two of the most informative sections of the Hortonworks S-1 are:

The clearest financial statements in the Hortonworks S-1 are probably the quarterly figures on Page 62, along with the tables on Pages F3, F4, and F7.

Special difficulties in interpreting Hortonworks’ numbers include:

  • A large fraction of revenue has come from a few large customers, most notably Microsoft. Details about those revenues are further confused by:
    • Difficulty in some cases getting a fix on the subscription/professional services split. (It does seem clear that Microsoft revenues are 100% subscription.)
    • Some revenue deductions associated with stock deals, called “contra-revenue”.
  • Hortonworks changed the end of its fiscal year from April to December, leading to comparisons of a couple of eight-month periods.
  • There was a $6 million lawsuit settlement (some kind of employee poaching/trade secrets case), discussed on Page F-21.
  • There is some counter-intuitive treatment of Windows-related development (cost of revenue rather than R&D).

One weirdness is that cost of professional services revenue far exceeds 100% of such revenue in every period Hortonworks reports. Hortonworks suggests that this is because:

  • Professional services revenue is commonly bundled with support contracts.
  • Such revenue is recognized ratably over the life of the contract, as opposed to a more natural policy of recognizing professional services revenue when the services are actually performed.

I’m struggling to come up with a benign explanation for this.

In the interest of space, I won’t quote Hortonworks’ S-1 verbatim; instead, I’ll just note where some of the more specifically informative parts may be found.

  • Page 53 describes Hortonworks’ typical sales cycles (they’re long).
  • Page 54 says the average customer has increased subscription payments 25% year over year, but emphasize that the sample size is too small to be reliable.
  • Pages 55-63 have a lot of revenue and expense breakdowns.
  • Deferred revenue numbers (which are a proxy for billings and thus signed contracts) are on Page 65.
  • Pages II 2-3 list all (I think) Hortonworks financings in a concise manner.

And finally, Hortonworks’ dealings with its largest customers and strategic partners are cited in a number of places. In particular:

  • Pages 52-3 cover dealings with Yahoo, Teradata, Microsoft, and AT&T.
  • Pages 82-3 discusses OEM revenue from Hewlett-Packard, Red Hat, and Teradata, none of which amounts to very much.
  • Page 109 covers the Teradata agreement. It seems that there’s less going on than originally envisioned, in that Teradata made a nonrefundable prepayment far greater than turns out to have been necessary for subsequent work actually done. That could produce a sudden revenue spike or else positive revenue restatement as of February, 2015.
  • Page F-10 has a table showing revenue from Hortonworks’ biggest customers (Company A is Microsoft and Company B is Yahoo).
  • Pages F37-38 further cover Hortonworks’ relationships with Yahoo, Teradata and AT&T.

Correction notice: Some of the page numbers in this post were originally wrong, surely because Hortonworks posted an original and amended version of this filing, and I got the two documents mixed up.  A huge Thank You goes to Merv Adrian for calling my attention to this, and I think I’ve now fixed them. I apologize for the errors!

Related links

Categories: Other

Reminder: Fishbowl Solutions Webinar Tomorrow at 1 PM CST

Cole OrndorffThere’s still time to register for the webinar that Fishbowl Solutions and Oracle will be holding tomorrow from 1 PM-2 PM CST! Innovation in Managing the Chaos of Everyday Project Management will feature Fishbowl’s AEC Practice Director Cole Orndorff. Orndorff, who has a great deal of experience with enterprise information portals, said the following about the webinar:

“According to Psychology Today, the average employee can lose up to 40% of their productivity switching from task to task. The number of tasks executed across a disparate set of systems over the lifecycle of a complex project is overwhelming, and in most cases, 20% of each solution is utilized 80% of the time.

I am thrilled to have the opportunity to present on how improving workforce effectiveness can enhance your margins. This can be accomplished by providing a consistent, intuitive user experience across the diverse systems project teams use and by reusing the intellectual assets that already exist in your organization.”

To register for the webinar, visit Oracle’s website. To learn more about Fishbowl’s new Enterprise Information Portal for Project Management, visit our website.

The post Reminder: Fishbowl Solutions Webinar Tomorrow at 1 PM CST appeared first on Fishbowl Solutions' C4 Blog.

Categories: Fusion Middleware, Other