Other
The elusive XPath nodeset serialization
I have been involved in various capacity with five different specifications that define a GET (or GET-like) operation that takes as input an XPath expression used to pinpoint the subset of the XML document that should be retrieved (here is a quick history as of a couple of years ago, more has happened since). And I must shamefully admit that all but one are simply impossible to implement in an interoperable way.
That’s because they instruct implementers to return an XPath nodeset in the response SOAP message but say nothing about how to serialize the nodeset. While an XPath nodeset contains the kind of things that make up an XML document, it is not an XML document by itself. There is an infinite number of possible ways to serialized an XPath nodeset into XML. To have any hope of interoperability on this, a serialization algorithm has to be clearly described by the specification. Which hasn’t happened.
Let’s start with WS-ResourceProperties (WS-RP). It has a QueryResourceProperties operation that takes an XPath expression as input. The specification says that “the response MUST contain an XML serialization of the results of evaluating the QueryExpression against the resource properties document“. Great, thanks. The example provided happens to return a nodeset with only one node (a boolean), which is implicitly serialized into the text representation of that boolean. What if there is more than one node in the nodeset? What about other types of nodes?
Moving on to WS-Management, which defines a SOAP header that uses XPath to qualify a WS-Transfer GET request such that it only retrieves a subset of the target XML document. While it does a better job than WS-RP at describing the input (e.g. it specifies the context node and what namespace declarations are in scope for the XPath evaluation) it is even more cavalier than WS-RP in describing the output: “the output (lines 53-55) is like that supplied by a typical XPath processor and might or might not contain XML namespace information or attributes“. By “a typical XPath processor” we should understand MSXML I suppose. But as far as I know a “typical XML processor” doesn’t return XML, it returns language-specific data structures (e.g. a C# or Java object, like a nu.xom.Nodes instance). And here too, the examples only use single-node nodesets.
WS-ResourceTransfer (WS-RT) was supposed to be the convergence of these two efforts, so presumably it would have learned from their mistakes. While it is better written in general than its predecessors, it fails just as badly with regards to specifying the nodeset serialization. And once again, the example provided uses a nodeset with just one node.
And then came the CMDBf query operation which, for some unclear reason, was deemed in need of a built-in XPath transformation of records. As I pointed out in my review of CMDBf 1.0 at the time, this feature was added without taking the pain to define the XML serialization of the resulting nodeset. And there isn’t even an example of the XPath serialization.
It is sad in a way, but the only specification that acknowledges the problem and addresses it came before any of the four above even got started. It is the WSMF (Web Services Management Framework) work that we did at HP, and more specifically the “note on dynamic attributes and meta information” (not available at HP anymore but available from archive.org) . This specification was the first one to define a GET operation that is qualified by an XPath expression. Unlike its successors it also explicitly narrowed down the types of nodes that could be selected (”The manager MUST NOT send as input an XPath statement that returns a nodeset containing nodes other than element, attribute and namespace nodes“). And for those valid types it described how to serialized them in XML (”When a node in the result nodeset is an attribute node, for the sake of the response it is serialized as an element node which has the same name as the name of the original attribute (see example 4 for an illustration). The element is in the same namespace as the namespace the attribute it represents is in. This applies to namespace nodes as well, they are serialized like an attributes in the xmlns namespace“). Turning an attribute into an element of the same QName might not be the smartest thing in retrospect (after all there may be an element by that QName already) but at least we recognized and addressed the problem.
But all is good now, I am told, because XPath 2.0 is here, along with a clean data model and a well-described serialization.
Not so. Anyone wanting to use XPath for a SOAP-based query language still would have to specify a serialization.
The first problem with the W3C serialization is that the XML output method doesn’t work for all nodesets. Try to use it on a nodeset that contains a top-level attribute node and you get error err:SENR0001. And even for the nodesets it accepts, it sometimes returns less-than-useful results. For example, if your XPath is of the form /employee/name/text() and you have four employees, the result will look something like this:
“Joe SmithKathy O’ConnorHelen MartinBrian Jones”
Concatenated text values without separators. I guess W3C is like a department store, they don’t offer complimentary wrapping anymore…
That’s why the nux.xom.xquery.ResultSequenceSerializer class had to define its own wrapping mechanims to produce a useful XML serialization. The API gives you the choice between the W3C_ALGORITHM and the WRAP_ALGORITHM.
Bottom line, and however much some would like to think of it that way, XPath (1 or 2) is not an XML subsetting/transformation mechanism. It could be used to create one (as XSLT does), but you have to do your own plumbing.
In addition to the technical aspects of this discussion, what else can be learned from this sad state of things? The fact that all these specifications define an XPath-driven query mechanism that is simply broken (beyond the simplest use cases) withouth anyone even noticing tells me that there isn’t a real need for full XPath query over SOAP (and I am talking about XPath 1.0, the introduction of XPath 2.0 in CMDBf is even more out there). A way to retrieve individual elements (and maybe text values) is all that is needed for 99% of the use cases addressed by these specifications. Users would be better served (especially in a version 1.0) by specifications that cover the simple case correctly than by overly generic, complex and poorly documented features. There is always time to add features later if the initial specification is successful enough that users encounter its limitations.
Vertica update
Another TDWI conference approaches. Not coincidentally, I had another Vertica briefing. Primary subjects included some embargoed stuff, plus (at my instigation) outsourced data marts. But I also had the opportunity to follow up on a couple of points from February’s briefing, namely:
Vertica has about 35 paying customers. That doesn’t sound like a lot more than they had a quarter ago, but first quarters can be slow.
Vertica’s list price is $150K/terabyte of user data. That sounds very high versus the competition. On the other hand, if you do the math versus what they told me a few months ago — average initial selling price $250K or less, multi-terabyte sites — it’s obvious that discounting is rampant, so I wouldn’t actually assume that Vertica is a high-priced alternative.
Vertica does stress several reasons for thinking their TCO is competitive. First, with all that compression and performance, they think their hardware costs are very modest. Second, with the self-tuning, they think their DBA costs are modest too. Finally, they charge only for deployed data; the software that stores copies of data for development and test is free.
Database blades are not what they used to be
In which we bring you another instantiation of Monash’s First Law of Commercial Semantics: Bad jargon drives out good.
When Enterprise DB announced a partnership with Truviso for a “blade,” I naturally assumed they were using the term in a more-or-less standard way, and hence believed that it was more than a “Barney” press release.* Silly me. Rather than referring to something closely akin to “datablade,” EnterpriseDB’s “blade” program turns out to just to be a catchall set of partnerships.
*A “Barney” announcement is one whose entire content boils down to “I love you; you love me.”
According to EnterpriseDB CTO Bob Zurek, the main features of the “blade” program include:
-
Accreditation
-
Joint distribution, including distribution by the blade partner of Postgres Plus
-
Interface between the blade partner and EnterpriseDB’s field organization
Of the 16 blade partnerships announced in the initial press release, only one much resembles the datablade concept. That would be HyperBac, which is offering compression and encryption, as part of high-performance backup. (Bob says HyperBac’s compression reduces exported file size by around 90%, and it’s also extremely fast.) From where I sit, that’s a modified data access method, and hence worthy of the term “blade.”
Bob said that the next closest thing EnterpriseDB has to a true datablade at this time, and getting closer, actually is none of the other 15 partnerships. It’s Oracle compatibility. That makes sense; Oracle compatibility starts in the parser, and might have data access method and hence optimization implications as well. However, in saying this Bob presumably was not counting support for datatypes such as text and geospatial. Unless I’m very wrong about how they’re implemented, those are about as genuine as datablades ever get.
Outsourced data marts
Call me slow on the uptake if you like, but it’s finally dawned on me that outsourced data marts are a nontrivial segment of the analytics business. For example:
- I was just briefed by Vertica, and got the impression that data mart outsourcers may be Vertica’s #3 vertical market, after financial services and telecom. Certainly it seems like they are Vertica’s #3 market if you bundle together data mart outsourcers and more conventional OEMs.
- When Netezza started out, a bunch of its early customers were credit data-based analytics outsourcers like Acxiom.
- After nagging DATAllegro for a production reference, I finally got a good one — TEOCO. TEOCO specializes in figuring out whether inter-carrier telcom bills are correct. While there’s certainly a transactional invoice-processing aspect to this, the business seems to hinge mainly around doing calculations to figure out correct charges.
- I was talking with Pervasive about Pervasive Datarush, a beta product that lets you do super-fast analytics on data even if you never load it into a DBMS in the first place. I challenged them for use cases. One user turns out to be an insurance claims rule-checking outsourcer.
- One of Infobright’s references is a French CRM analytics outsourcer, 1024 Degres.
- 1010data has built up a client base of 50-60, including a number of financial and retail blue-chippers, with a soup-to-nuts BI/analysis/columnar database stack.
- I haven’t heard much about Verix in a while, but their niche was combining internal sales figures with external point-of-sale/prescription data to assess retail (especially pharma) microtrends.
To a first approximation, here’s what I think is going on.
Privacy laws force some outsourcing. It’s often OK to use credit data to decide what you’ll market at whom, even when it’s not OK to actually see the credit data itself. What’s more, in some cases data can’t leave a country, so if you don’t have critical business mass in that particular country, it’s natural to use an outsourcer who does.
Privacy even aside, owners of proprietary data are natural analytics outsourcers. Either you ship your data to your customers to do with as they please — and impose on them the expense of managing it — or you manage it for them.
Analytic “secret sauce” software providers also are natural outsourcers. Most proprietary analytic rules are pretty simple-minded. Outsourcing preserves mystique and pricing power.
The usual benefits of SaaS apply. Fast set-up, no fixed costs, etc. are all goodness, just as they are in the transactional world.
With that as background, the big change in the analytics outsourcing market is the same as the one sweeping the rest of the analytics world — interactive access to detail data is finally becoming affordable. If you just run weekly or monthly reports, and there may be no reason to distinguish between analytic and transactional processing. But if you want to allow ad-hoc query, unlimited drilldown, or live dashboards, then you’re talking a serious data mart technology stack.
And I do mean “data mart”. Outsourcing an enterprise data warehouse, with all of your proprietary transactional data, doesn’t make much sense unless you’re a complete SaaS shop already outsourcing that data in the first place.
System Center “Cross Platform Extension”: too many distractions
I was hoping that by the time MMS was over there would be more clarity about the “Cross Platform Extension” to System Center that Microsoft announced there. But most of the comments I have seen have focused on two non-technical aspects: Microsoft is interested in heterogeneous management and Microsoft makes use of open source. That’s also the focus of Coté’s coverage.
So what? Is it still that exciting, in 2008, to learn that Microsoft recognizes that Linux and OSS are major players in enterprise computing? If Steve Ballmer eventually gets hold of Yahoo, do you think his first priority will be to move all the servers to Windows or to build up its search and advertising audience? It’s been now 10 years since the Halloween documents came out. They can be seen as the start of Microsoft’s realization that Linux/OSS are here for good. It is not surprising to see that one of their main authors is now the driving force behind WS-Management, an effort that illustrates the acceptance of heterogeneity and the need to deal with it (on Microsoft’s terms if possible, of course). The WS-Management effort started years ago and it was a clear sign that Microsoft knew it had to tackle heterogeneous management (despite the reassuring talk that “it’s all about making Windows the most manageable platform” to HP and others). Basically, Microsoft is using WS-Management to support heterogeneity without having to do too much work: by creating an industry standard that everyone writes to and that Microsoft uses internally. Heterogeneous management is intrinsic to DSI if DSI is to be anything more than a demo.
But all of this was known before MMS 2008 to anyone who was paying attention. Instead of all this Microsoft/OSS/heterogeneous talk, I am a lot more interested in the technical aspects of the “Cross Platform Extension”.
OpenPegasus has been around for a long time, as a C++ CIMOM with a bunch of associated providers and CIM-XML interoperability over HTTP with CIM clients. I don’t know where WS-Management support was on the OpenPegasus development timeline, but even without Microsoft getting involved it would have eventually happened. And this should have been sufficient for System Center to access the CIMOM (BTW, does System Center not support CIM-XML when WS-Management is not present and if it does then what is different in practice with WS-Management?).
I can see how Microsoft would bring some extra (and much welcome) development resources for the WS-Management implementation (BTW the guys at Intel already have an open-source C implementation of WS-Management) as well as some extra marketing/visibility/distribution. Nice, but not earth-shattering. Do they bring anything else to OpenPegasus?
And what else is in the “Cross Platform Extension” in addition to an OpenPegasus WS-Management-capable CIMOM? Is there any extra modeling capability beyond CIM? Any Microsoft-specific classes? Any discovery/reconciliation capability? How much actual configuration management versus just monitoring? Security? Health models? Desired state management? Or is it just a WS-Management CIMOM? Any pointer to specific information is welcome.
Of course the underlying question is whether others than Microsoft can manage resources that have an OpenPegasus-based System Center management pack on them. The Open Management Consortium guys have talked about an open management agent. Could, against all expectations, Microsoft be the one delivering it?
In the IT management world, there are the big 4 (HP, BMC, CA and IBM), the little 4 (Zenoss, Hyperic, GroundWorks and openQRM) and the mighty 3 (Oracle, Microsoft and EMC). Sorry John, I am reclaiming the use of the “mighty” term: your “mighty 2″ (or 2.5) are really still the “little 2″ (or 2.5). At least for now.
The interesting thing is that in that industry configuration there are topics on which the little ones and the mighty ones share common interests. For example, the big 4 have a lot more management packs for all kinds of resources, built up over the years. Some standard-based mechanism that partially resets the stage helps the little ones and the mighty ones better compete against the big 4. Even better if it has an attractive (and extensible) implementation ready in the form of an agent. But let’s be clear that it takes more than a CIMOM to make a management pack. You need domains-specific expertise in the form of health models, deployment/configuration scripts and/or descriptors, configuration validation, role management etc. Thus my questions about what else (beyond CIM over WS-Management) Microsoft is bringing to the table. SML and CML are supposed to address this space, but I didn’t hear them mentioned once in the MMS coverage.
[UPDATED on 2008/5/7: Another perspective on Microsoft and open source: Microsoft Ex-Pats Developing Open Source Software Outside of Redmond]
[UPDATED 2008/5/7: I got an answer to the question about System Center support for CIM-XML: it doesn't have it. So indeed it's either WS-Management of WMI. If you're a Linux box, that means it's WS-Management.]
Truviso and EnterpriseDB blend event processing with ordinary database management
Truviso and EnterpriseDB announced today that there’s a Truviso “blade” for Postgres Plus. By email, EnterpriseDB Bob Zurek endorsed my tentative summary of what this means technically, namely:
-
There’s data being managed transactionally by EnterpriseDB.
-
Truviso’s DML has all along included ways to talk to a persistent Postgres data store.
-
If, in addition, one wants to do stream processing things on the same data, that’s now possible, using Truviso’s usual DML.
Note: Extended-relational DBMS like Postgres, Oracle, DB2, and Informix/Illustra have long offered the ability to add blades/cartridges. It’s easy to understand what these do when they simply add native management for a new datatype, and extend the parser, optimizer, and data access methods accordingly. But blades are used in other ways as well, and I’ve always found that somewhat confusing. A little bit of that appears to be going on in this case.
Bob added that there have been a lot of inquiries about the announcement today, without specifying from whom. Truviso marketing chief Roman Bukary, late of SAP, sent over some generic use cases, which pretty much boil down to my first two bullet points above. (More precisely, they agree if you replace “transactionally” with “persistently”; Roman also foresees data warehousing uses.)
I like this announcement. With one probable exception, it’s a good fit for every major use of event processing; the exception is super-low-latency apps, where no extraneous overhead is tolerable. (Those are found mainly in algorithmic trading, but could arise in security and network management as well.) But then, Truviso is being positioned away from its initial currency trading focus anyway.
Super-low-latency aside, the other big current use case for event processing is data reduction. I.e., you have a lot of incoming data – e.g., via satellite telemetry or intelligence intercepts or network monitoring sensors, or monitoring character movement in an MMO (Massively Multiplayer Online) game. You try to grab all the “interesting” stuff, while disregarding or even throwing away the rest. But the “throwing away” part is a little worrisome. So if instead you can seamlessly persist everything, even for a short period of time (e.g., measured in days), that’s goodness. Even if you can’t keep it all even for a short while – well, if the point of data reduction is to retain only a fraction of the incoming data, this scheme could make it easier to persist the keepers.
Another current use case for event processing is rules engines. Progress Apama has a rules paradigm all the way down, while Coral8 tells happily of a customer who uses event processing for all kinds of rules-based real-time CRM. But the Coral8 example is closely integrated with conventional persistent data stores, and the same is likely for other similar applications. Business activity monitoring (BAM) would be a special case of this.
As you know, my ultimate dream for business intelligence/analytic uses of event processing goes beyond BAM. I think many individuals in an enterprise should each track many different (but related) KPIs (Key Performance Indicators). Current query loads for reporting, dashboards, ad hoc query, etc. could easily go up by 2-3 orders of magnitude. When that happens, you want to consider different ways of doing things, specifically memory-centric ones. Normal memory-centric data processing might get the job done, but I have a suspicion that the right architecture will wind up looking a lot like event processing.
Once again, that’s a use for event processing that naturally integrates tightly with a persistent database.
Related links:
-
An earlier press release declaring Truviso’s love for PostgreSQL
Please subscribe to our feed!
Oracle/BEA, WS-Management and MMS: announcements of the day
A few announcements came out today.
The good news: Oracle’s acquisition of BEA closes. Unobstructed technical work can start.
The conveniently-timed news: WS-Management officially a standard.
Speaking of MMS 2008, any announcement there? Not much so far, as explained by Ian Blyth. If I parse the cross-platform part of the press release correctly, it says that management of non-Windows resources by Operations Manager is based on WS-Management, but WS-Management alone is not enough so Microsoft is providing a development kit for several non-Microsoft operating systems. It will be interesting to see what exactly is produced by these management packs. Can they be called on by management tools other Operations Manager or is the stuff that rides on top of WS-Management too proprietary to allow this? No word on SML/CML.
By the end of the week we may have a clearer picture, including what’s going on with the previously-announced reset on System Center Service Manager. Coté is on the scene and will undoubtedly share his thoughts.
As a side note, the way the MMS main page loads betrays the fact that, in 2008, Microsoft (or more likely its event marketing contractor) is using the same clueless HTML design approach that I first saw in 1995 and recently wrote about. All the text in the center of the MMS home page is contained in one large picture (available here). They didn’t even bother with a “ALT” field, so good luck to blind users. The part that says “Registration Overview Page” was made blue and underlined to suggest that it is a link, but it is just a part of the picture. Which, presumably, was supposed to be turned into a link using an image map. Well, turns out they can’t even get that right.
They tried to use a client-side image map (not available in 1995) but somehow the actual map code is commented out in the HTML source:
<!–<map name=Map> <area shape=RECT coords=18,549,210,572 href=”registrationoverview.aspx”> <area shape=RECT coords=17,596,222,634 href=”registrationoverview.aspx”> </map>–>
As a result, the single most preeminent link on the home page is dead. And there is no server-side image map mechanism as a backup (which I remember used to be best practice when client support for client-side image maps was spotty).
Looking at the HTML source also reveals that tables are over-used. That’s the kind of HTML I can write, and I don’t mean that as a compliment.
[UPDATED 2008/5/5: As expected/hoped, Coté did share his thoughts on this "cross-platform" move from the MMS floor.]
The Mark Logic story in XML database management
Mark Logic* has an interesting, complex story. They sell a technology stack based on an XML DBMS with text search designed in from the get go. They usually want to be known as a “content” technology provider rather than a DBMS vendor, but not quite always.
*Note: Product name = MarkLogic, company name = Mark Logic.
I’ve agreed to do a white paper and webcast for Mark Logic (sponsored, of course). But before I start serious work on those, I want to blog based on what I know. As always, feedback is warmly encouraged.
Some of the big differences between MarkLogic and other DBMS are:
-
MarkLogic’s primary DML/DDL (Data Manipulation/Description Language) is XQuery. Indeed, Mark Logic is in many ways the chief standard-bearer for pure XQuery, as opposed to SQL/XQuery hybrids.
-
MarkLogic’s XML processing is much faster than many alternatives. A client told me last year that – in an application that had nothing to do with MarkLogic’s traditional strength of text search – MarkLogic’s performance beat IBM DB2/Viper’s by “an order of magnitude.” And I think they were using the phrase correctly (i.e., 10X or so).
-
MarkLogic indexes all kinds of entities and facts, automagically, without any schema-prebuilding. (Nor, I gather, do they depend on individual documents carrying proper DTDs.) So there actually isn’t a lot of DDL. (Mark Logic claims in one test MarkLogic had more or less 0 DDL, vs. 20,000 lines in DB2/Viper.) What MarkLogic indexes includes, as Mark Logic puts it:
-
Every word
-
Every piece of structure
-
Every parent-child relationship
-
Every value.
-
-
As opposed to most extended-relational DBMS, MarkLogic indexes all kinds of information in a single, tightly integrated index. Mark Logic claims this is part of the reason for MarkLogic’s good performance, and asserts that competitors’ lack of full integration often causes overhead and/or gets in the way of optimal query plans. (For example, Mark Logic claims that Microsoft SQL Server’s optimizer is so FUBARed that it always does the text part of a search first.) Interestingly, Intersystems’ object-oriented Cache’ does pretty much the same thing.
-
MarkLogic is proud of its text search extensions to XQuery. I’ve neglected to ask how that relates to the XQuery standards process. (For example, text search wasn’t integrated into the SQL standard until SQL3.)
Other architectural highlights include:
-
MarkLogic uses timestamps and appends for updates, rather than updates-in-place, much like Netezza or Illustra. Cleanup is done in the background. As long as your volume of changes (as opposed to inserts or reads) is sufficiently low, this can be more efficient than traditional approaches. Timestamping also makes it easy to write certain application functionality in publishing (“go live” times for content is a current use) and compliance (a possible future).
-
MarkLogic is ACID-compliant. Thus, you can read data as soon as it’s inserted, without a separate re-indexing step. Other native XML systems may not have that property (e.g., Mark Logic asserts DB2 Viper doesn’t.)
-
Mark Logic claims MarkLogic has relatively efficient (optional) range indexes. (This was in response to a question; details are secret.) Inverted-list DBMS like ADABAS and Model 204 have been doing decently efficient range queries for 30 years, so this claim is both credible and not terribly important.
Related links:
-
A companion post over on Text Technologies takes a text search view of MarkLogic.
-
One of the leading sites on text analytics and general enterprise software marketing, Dave Kellogg’s Mark Logic CEO Blog.
Please subscribe to our feed!
Unhealthy fun with IP aspects of optionality in specifications
The previous blog post has re-awaken the spec lawyer in me (on the hobby glamor scale, spec lawyering ranks just below collecting dead bugs). Which brought back to my mind a peculiar aspect of the “Microsoft Open Specification Promise“.
The promise was published to address fears some people had that adopting Microsoft-created specifications (especially non-standard ones) would put them at risk of patent claims from Microsoft. The core of the promise is only two paragraphs long. The first one contains this section:
“To clarify, ‘Microsoft Necessary Claims’ are those claims of Microsoft-owned or Microsoft-controlled patents that are necessary to implement only the required portions of the Covered Specification that are described in detail and not merely referenced in such Specification.”
That seams to pretty clearly state that only the required portions of a specification are covered by this promise. Which is a very significant limitation, as specifications often tend to (over-) use optional features. But if you read further, the list of “Covered Specifications” (those to which the promise applies), contains this statement:
“this Promise also applies to the required elements of optional portions of such specifications.”
I find this very puzzling because it seems to contradict the previous statement. And more importantly, it’s hard to understand what it really means. That’s where the fun starts:
For example, if my spec defines a document <a> with an optional element <b> that itself has an optional sub-element <c>, as in:
<a>
...
<b>
...
<c>...</c>
</b>
</a>
The <b> element is a required part of the “b” optional portion of the spec (the portion of the spec that defines that element), so I guess it is covered, but is <c>? That’s an optional element of an optional portion (the “b” portion) of the spec, so it isn’t. Unless you consider the portion of the spec that defines <c> (the “c” portion of the spec) to be an optional portion of the spec itself. In which case the <c> element is covered.
But if you take that second line of reasoning, then everything in the spec is covered because for any feature, no matter how “optional” it is, there is a portion (optional or not) of the specification that describes this feature. And if you are implementing that portion, for example the portion that defines element <foo>, by definition element <foo> is required for it (how can an element not be a required part of its own definition?). But if Microsoft intended to cover all parts of the specification, why not say so rather than this recursion-inducing “required elements of optional portions” statement? And if not, why do they choose to only cover optional elements that are one degree removed from the base of the specification?
Wouldn’t it be fun to see a court of law deal with a suit that hinges on this statement (provided that you’re not a party in the suit, of course)?
When a real spec lawyer took a look at this promise, he didn’t comment on the second statement, the one that raises the most questions in my mind.
[UPDATED 2008/4/29: The "promise" has seen many updates. The original (which is the one Andy Updegrove reviewed at the previous link) came out on 2006/9/12. The one I reviewed is dated 2008/3/25. There is no change history on the Microsoft site, but the Wayback machine has archived some older versions. The oldest one I can find is dated 2006/10/23 and it does not contain the sentence about "required elements of optional portions" that puzzles me. So it's likely that the version Andy reviewed didn't include this either and as such was clearly limited to required portions of the specifications (something that Andy pointed out).]
WS-Transfer, its WSDL and its WS-I compliance: the art of engineered uselessness
Several years ago, Chris Ferris wrote a blog entry in which he explains that WS-Transfer is not WS-I Basic Profile (BP) compliant.
Chris’ main point is correct: the WSDL document in appendix II of the WS-Transfer specification is not compliant with the WS-I Basic Profile. But what does this mean and why should one care?
If you search for the word “wsdl” in WS-Transfer, you first find it in the table that declares namespace prefixes used in the specification. But the prefix is not used in the specification, so it could just as well be removed from that table.
We see it next mentioned in the “compliance” boilerplate where it is declared to be the least authoritative of all information in the specification.
The next occurrence is all the way down in section 8, as a reference to the WSDL 1.1 W3C note. The only place where that reference is used, is further below, in Appendix II.
In short, for all practical purposes there is no mention of WSDL in WS-Transfer except for this one appendix that contains a WSDL document. Since there is no MUST or REQUIRED statement that refers to it, it is at best a testing tool that one can use to validate WS-Transfer messages produced. There is no requirement at all that the implementation produces that WSDL (e.g. as a response to a WS-MeX request) or consumes it.
And if you look at the content of the WSDL, it is mostly XML gymnastics aimed at creating “empty” and “any” types to express almost nothing useful about the messages sent and received.
You don’t have to take my statement that the WS-Transfer WSDL is useless at face value. Here are two other proofs:
- Chris doesn’t just point out the WS-I BP violation in the WS-Transfer WSDL, he also proposes a way to fix it. He writes: “I actually think that a more appropriate approach to handling WS-Transfer’s ‘Get’ would be to specify the output message as you would any doc-literal operation and merely annotate the operation with the appropriate wsa:Action attribute values” (he also provides an example). And he is perfectly right. If you really want a WSDL for your WS-Transfer operations, create one that is specific to the resource type (server, toaster…) that you are dealing with. By definition that WSDL can’t be baked into the model-agnostic WS-Transfer specification. While Chris doesn’t say it, the natural conclusion of his remark is that there is not point for a WSDL in WS-Transfer (because any resource-agnostic WSDL is useless).
- The WS-Transfer XSD and WSDL have been modified, sometimes in backward-incompatible ways, without changing the target namespace. From the original version to the first W3C submission, some minor changes (message names, introduction of WS-Addressing). From the first W3C submission to the current submission, some potentially backward-incompatible changes (the GET input can now be non-empty, the CREATE response can now contain anything as a result of trying to support different versions of WS-Addressing). On top of that, all these XSD and WSDL documents embedded in various versions of the spec are “non-normative”. The normative versions are said to be the ones at xmlsoap.org (XSD, WSDL). Those have not changed, which means that both versions on the W3C web site contain an incorrect version of the XSD/WSDL in the spec. Shouldn’t that lack of XML hygiene be a big deal for a specification that is implemented (via WS-Management, which references the W3C submission) in resources with long product development cycles, such as servers from Dell, HP and others that have WS-Management support directly on the motherboard? It would, if the XSD and WSDL had any relevance for the implementers. The fact that there was no outcry is yet another proof that the WS-Transfer XSD and the WSDL are irrelevant.
So yes, Chris is right that the WS-Transfer WSDL (BTW all versions have the problem that Chris describes even though it could have been fixed in a backward-compatible way when the WSDL was altered) is not WS-I BP compliant. But since that WSDL is useless anyway, this shouldn’t keep anyone up at night. The WS-Transfer WSDL serves no purpose other than to annoy people who like things to be WS-I BP compliant.
But is it just the WS-Transfer WSDL that’s useless, or it is all of WS-Transfer?
I am not planning to go into WS-* vs. REST territory here. To those who are confused by the similarity between the names of WS-Transfer operations and HTTP methods and see WS-Transfer as a way to do “REST over SOAP” I’ll just point out that WS-Transfer is rarely used on its own but rather in conjunction with many other SOAP messages (like those defined by WS-Eventing and WS-Enumeration, plus countless custom operations). So much for uniform interfaces. WS-Transfer, at least as it is used today, is not about REST.
Rather, the reasons why I question the usefulness of WS-Transfer are more pragmatic than architectural. I can think of three potential justifications to carve out WS-Transfer as a separate specification, none of which is really convincing at this point in time.
The first reason is simply to avoid repeating the same text over and over again. If many specifications are going to describe the same SOAP message, just describe it once and refer to that description. Sounds good. But I know of three specifications that use WS-Transfer: WS-Management, WS-MeX and the Devices Profile for Web Services.
WS-MeX and the Devices Profile only use the GET operation. Which means that the only specification text that they can re-use from WS-Transfer is something like “send an empty get request and get something back”. WS-Transfer can’t say what that something is, only the domain-specific specifications can. As a result, you are spending as much time referencing WS-Transfer as would be spent defining a simple GET operation. For all practical purposes, you can implement WS-MeX and the Devices Profile without ever reading WS-Transfer.
The second potential reason is to provide a stand-alone piece of functionality that can be implemented once (e.g. as a library/module) and re-used for different purposes. Something that automatically kicks in when a WS-Transfer wsa:Action is detected. Think of a stand-alone encryption/decryption library for example, that looks for specific SOAP headers. Or WS-Eventing, for which a library can take over the task of managing the subscription lifecycle. Except WS-Transfer defines so little that it’s not clear what a stand-alone WS-Transfer implementation would do. Receive messages and do what with them? It is so tied to the back-end that there isn’t much you can do in a general fashion. Unless you are creating a library for a database product and you see WS-Transfer as a query interface for your database. But this only makes sense if you want to provide more fine-grained access to the XML content, which WS-Transfer does not do.
Which takes us to the third potential value of WS-Transfer, as a foundational specification on which to build extensions. Of the three this is the only one that I believed in at some point. WS-ResourceTransfer (WS-RT) was the main attempt at doing this. Any service that uses WS-Transfer could, via the magic of the SOAP processing model, offer a more precise/powerful access to the resources. But while this was possible in theory it hasn’t really panned out in practice for many reasons:
- Some people (hints: Armonk; Blue) pushed hard to put WS-RT instructions in the body rather than in headers, seriously compromising its ability to seamlessly compose with existing SOAP messages.
- WS-MeX and the Devices Profile typically deal with documents small enough that manipulating them as a whole is rarely a problem. This only leaves WS-Management which has its own “fragment transfer” mechanism so it doesn’t really need a stand-alone mechanism.
- XQuery is now developing support for an update capability.
What then is left, in the Spring of 2008, to justify the need for WS-Transfer as a separate layer, rather than considering it an integral part of WS-Management? Not much. WS-MeX, in an earlier version, used to define its own GET operation and it wouldn’t be any worse off if it had stayed that way (or returned to it). Ditto for the Device Profile. At this point, it’s mostly a matter of pragmatically cleaning up the mess without creating another one.
In retrospect (color me partially guilty), maybe one shouldn’t use the same architectural rules when attempting to design an interoperable standard stack for an industry than when refactoring a software project. Maybe one should resist the urge to refactor the “code” (or rather the PowerPoint stack) every time one detects the smallest conceptual redundancy. There is a cost in constant changes. There is a cost in specification cross-dependencies. WSDM experienced it firth hand with the different versions of WS-Addressing (another dependency that didn’t need to be). WS-Management is seeing it from the perspective of standardization.
ParAccel pricing
I made a round of queries about data warehouse software or appliance pricing, and am posting the results as I get them. Earlier installments featured Teradata and Netezza. Now ParAccel is up.
ParAccel’s software license fees are actually very simple — $50K per server or $100K per terabyte, whichever is less. (If you’re wondering how the per-TB fee can ever be the smaller one, please recall that ParAccel offers a memory-centric approach to sub-TB databases.)
Details about how much data fits on a node are hard to come by, as is clarity about maintenance costs. Even so, pricing turns out to be one of the rare subjects on which ParAccel is more forthcoming than most competitors.
Yet another data warehouse database and appliance overview
For a recent project, it seemed best to recapitulate my thoughts on the overall data warehouse specialty DBMS and appliance marketplace. While what resulted is highly redundant with what I’ve posted in this blog before, I’m sharing anyway, in case somebody finds this integrated presentation more useful. The original is excerpted to remove confidential parts.
… This is a crowded market, with a lot of subsegments, and blurry, shifting borders among the subsegments.
… Everybody starts out selling consumer marketing and telecom call-detail-record apps. …
Oracle and similar products are optimized for updates above everything else. That is, short rows of data are banged into tables. The main indexing scheme is the “b-tree,” which is optimized for finding specific rows of data as needed, and also for being updated quickly in lockstep with updates to the data itself.
By way of contrast, an analytic DBMS is optimized for some or all of:
-
Small numbers of bulk updates, not large numbers of single-row updates.
-
Queries that may involve examining or returning lots of data, rather than finding single records on a pinpoint basis.
-
Doing arithmetic calculations – commonly simple arithmetic, sorts, etc. – on the data.
Database and/or DBMS design techniques that have been applied to analytic uses include:
-
“Denormalizing” the database, by pre-joining tables. This makes queries cheaper, but updates more costly. It’s implicit in single-fact-table designs.
-
“Star indexes”, which capture the benefits of denormalization. But they are large, and costly to update.
-
“Materialized views”, which precompute query results (joins and or aggregations). These obviously accelerate queries that use those results, but you have to pay the cost of continually updating them as data changes.
-
“Range partitioning”, in which data in (say) certain date ranges is clustered together on disk for more efficient processing.
-
“Hypercubes”, aka “MOLAP” (Multi-Dimensional OnLine Analytic Processing). The costs and benefits are extreme forms of those I’ve already cited. At least, the costs are; the benefits aren’t seeming so extreme any more, causing the technology to be increasingly outmoded.
-
“Bit-mapped indexes.” This is another approach to indexing that is fast on queries, at the cost of making updates slower. In its pure form, it’s well-suited for columns with low “cardinality” – i.e., a small number of values. (E.g., colors, sizes, etc.) But it can be extended to cover higher-cardinality cases.
-
Database administration tools to help with the complex choices involved in writing SQL, selecting indexes, etc.
-
Recommended hardware configurations, because the right mix of disks, processors, etc. might otherwise be non-obvious.
That’s pretty much the list of techniques used in general-purpose DBMS products such as Oracle and Microsoft SQL Server. But if you put them all together, you’re still left with the problems:
-
The techniques that greatly accelerate queries also greatly slow down updates.
-
You use a lot of extra disk space for all those indexes.
-
There’s a tremendous amount of labor involved in getting it all right.
-
Because of these drawbacks, you’re likely to optimize only for certain subsets of the queries you’d really like to run. Indeed, you may not make all of your data available for analytic querying.
Specialty analytic DBMS can do a lot better than general-purpose DBMS because:
-
They can run on “shared-nothing” MPP (Massively Multi-Parallel Processing) architectures. Most vendors make this choice, because:
-
Using larger numbers of smaller parts is fundamentally cheaper, if you don’t have a lot of MPP overhead. Most of the vendors have figured out clever ways to avoid that overhead.
-
For larger databases, I/O becomes an absolute bottleneck. But in a shared-nothing DBMS, you can do I/O truly in parallel.
-
If you simplify your software sufficiently, you may be able to get great compression, which has myriad benefits – most obviously to disk costs and I/O, but it can go further than that. Most contenders post-Netezza are good to great at compression. Netezza is playing catch-up. Teradata isn’t really better than Oracle, et al.
-
Disks spin slowly. The fastest disk drive you can buy has 15,000 RPMs, vs. the 1,200 RPMs hard disk technology was introduced with in 1956. (Most systems use 7,500 or 10,000 RPMs.) So random-access disk reads have become the single greatest bottleneck to analytic processing. One solution is to optimize your DBMS for table scans or other sequential reads – i.e., read more bytes of data, but at a much higher per-byte rate. To varying degrees, the analytic DBMS with row-based architectures are optimized for sequential reads. I published two white papers focusing on this point in 2007, sponsored by DATAllegro. http://www.monash.com/whitepapers.html
-
You also can break rows apart, and organize data by columns. Columnar architectures have tremendous advantages if you only ever want to retrieve a small fraction of a row. They also can help with compression and general query speed. They are hard to update, however. Vertica has some very clever techniques to beat the update speed problem. ParAccel argues that this cleverness isn’t needed, and more straightforward techniques suffice.
-
You can have specialized hardware designs or optimizations, even beyond the shared-nothing MPP. Netezza has an FPGA, which is almost a custom chip. Kickfire has some kind of custom chip. Calpont keeps trying and failing with a custom chip. Teradata is a lot like standard hardware, but they have their own switching system. DATAllegro and other vendors do use standard hardware, but rely on more inter-node communication than might otherwise be there. Columnar vendors, however, tend to be fairly hardware-agnostic.
…
Beyond raw database size, characteristics of the database and workload that affect which analytic DBMS works best include:
-
Do you have to do any significant volume of low-latency updates at all? If so, how low? (15 minute latency is a common but still minority data warehousing requirement, both in cases where there’s a legitimate business benefit and in cases where there is not. Most products meet that requirement, some more gracefully than others.)
-
Are your result sets likely to be huge? (E.g., inputs into SAS data mining software). Fairly large? Single-row? Columnar systems are bad at single-row result sets.
-
How many queries are likely to be running at once? The ability to handle concurrency well is a function of product maturity even more than basic architecture. Each time Netezza or DATAllegro has a major release, they tell me that now their concurrency is great and confess it wasn’t so hot in the prior version. Very high concurrency is a call center or feeding a website’s personalization. Medium concurrency is reporting and dashboards for a large but not huge enterprise. Low concurrency is serving a few specialized data analysts in a department.
-
What absolute response time do you need? (Are you serving a call center? A personalized web site? A user who doesn’t mind tapping her fingers for a few minutes, but doesn’t want to wait a few hours? A user who wants a response within a few seconds?) Different DBMS are optimized a bit differently. But frankly, if a system has great price/performance, it usually will be good in any scenario.
-
How much are you doing in the way of arithmetic calculations? An application very light on data volume and heavy on arithmetic is sometimes a genuine excuse for using MOLAP. Otherwise, it’s nice to have good flexibility with a feature called “user defined functions”.
-
When you bring back a row, do you typically want the whole row, or are many of the columns of that row just wasted I/O? If it’s the latter, columnar systems shine. This is particularly common in consumer marketing/targeting types of applications, where you may start with 1000 or more columns of data.
-
Are you basically querying a single large “fact table” across many “dimensions”, or a small group of closely-related fact tables? Or is the database schema significantly more complicated than that? Vertica only allows one fact table. At the other extreme, Teradata has for decades been optimized for any kind of schema. Most systems let you use any kind of schema you want, but that doesn’t mean they perform well in all scenarios.
Optimizing WordPress database usage
There’s an amazingly long comment thread on Coding Horror about WordPress optimization. Key points and debates include:
- WordPress makes scads of database calls on every page. (20 is the supposed default number. That sounds a little high to me, but not wholly incredible.)
- Therefore one should use a caching plug-in. WP-Cache is the preferred one. WP-Super-Cache gets some votes as perhaps being even better.
- In theory the database cache should handle most of the problem. (After all, many of those database queries are the same for every page.) In practice, it often doesn’t, even if you use dedicated (as opposed to shared) web hosting.
- LAMP vs. Microsoft stack (uh-oh).
- Drupal vs. WordPress vs. Movable Type vs. Joomla vs. do-it-yourself (uh-oh too).
Another theme is — well, it’s WordPress “theme” design. Do you really need all those calls? The most dramatic example I can think of one I experienced soon after I started this blog. Some themes have the cool feature that, in the category list on the sidebar, there’s a count of the number of posts in the category. Each category. I love that feature, but its performance consequences are not pretty.
As previously noted, we’ll be doing an emergency site upgrade ASAP. Once we’re upgraded to WordPress 2.5, I hope to deploy a rich set of back-end plug-ins. One of the caching ones will be among them.
Please subscribe to our feed!
Windows XP Service Pack 3
Microsoft announced SP3 for Windows XP today. This white paper gives an overview of its content. It will ship on April 29th through Windows Update. Many of the updates are related to improved management, which makes sense at this stage of the game for the OS. It also makes sense as a attempt to position the OS against the rising desktop Linux threat. I wanted to see what specific management-related updates were contained. They are:
- MMC (Microsoft Management Console) 3.0
- MSI (Microsoft Windows Installer) 3.1 v2
- BITS (Background Intelligent Transfer Service) 2.5
- NAP (Network Access Protection)
- Some changes in default security settings for System Center Essentials policies
This is all good but it seems to take a very System Center-centric view of Windows management. There may be some more third-party-friendly improvements in the complete list of updates contained, but the link provided by the white paper (Knowledge Base article 936929) doesn’t seem to work at this time.
This was announced by Chris Keroack, the release manager for SP3, on this forum.
I can’t resist the temptation to translation into common English a few selected sentences from the white paper:
“For customers with existing Windows XP installations, Windows XP SP3 fills gaps in the updates they might have missed—for example, by declining individual updates when using Automatic Updates…”
Translation: You obviously did not know what you were doing when you refused that update last year so we will now force it on you inside a bundle.
“Developing service packs for operating systems like Windows XP, which is nearing its end-of-sales period, is a standard practice, and Microsoft does this for the convenience of its customers and partners.”
Translation: Don’t get too excited you will still eventually have to move to Vista.
“With few exceptions, Microsoft is not adding Windows Vista features to Windows XP through SP3. As noted earlier, one exception is the addition of NAP to Windows XP to help organizations running Windows XP to take advantage of new features in Windows Server 2008.”
Translation: We are not going to let this cut into the sales of Vista, except when not doing it cuts into the sales of Windows Server 2008.
[UPDATED 2008/4/29: Turns out it's not shipping today on Windows Update, as previously announced, because SP3 breaks Microsoft's own Retail Management System (RMS) application. As does Vista SP1. I wonder if other vendors can also ask Microsoft to hold a service pack if it breaks their application...]
DATAllegro finally has a blog
It took a lot of patient nagging, but DATAllegro finally has a blog. Based on the first post, I predict:
- DATAllegro’s blog will live up to CEO Stuart Frost’s talent for clear, interesting writing.
- Like a number of other vendor blogs — e.g., Netezza’s — DATAllegro’s will have infrequent but usually long posts.
The crunchiest part of the first post is probably
Another very important aspect of performance is ensuring sequential reads under a complex workload. Traditional databases do not do a good job in this area - even though some of the management tools might tell you that they are! What we typically see is that the combination of RAID arrays and intervening storage infrastructure conspires to break even large reads by the database into very small reads against each disk. The end result is that most large DW installations have very large arrays of expensive, high-speed disks behind them - and still suffer from poor performance.
I’ve pounded the table about sequential reads multiple times — including in a (DATAllegro-sponsored) white paper — but the point about misleading management tools is new to me.
Now if I could just get a production DATAllegro reference, I’d be completely happy …
Netezza pricing
In connection with the announcement of the Teradata 2500, I asked some Teradata competitors about pricing. Netezza’s response amounted to “We don’t disclose list pricing, but our cheapest system handles about 3 1/4 TB and sells for under $200K.” So Netezza’s actual pricing is well below the list price of the Teradata 2500.
Teradata introduces lower-cost appliances
After months of leaks, Teradata has unveiled its new lines of data warehouse appliances, raising the total number either from 1 to 3 (my view) or 0 to 2 (what you believe if you think Teradata wasn’t previously an appliance vendor). Most significant is the new Teradata 2500 series, meant to compete directly with the smaller data warehouse specialists. Highlights include:
-
An oddly precise estimated capacity of “6.12 terabytes”/node (user data). This estimate is based on 30% compression, which is low by industry standards, and surely explains part of the price umbrella the Teradata 2500 is offering other vendors.
-
$125K/TB of user data. Obviously, list pricing and actual pricing aren’t the same thing, and many vendors don’t even bother to disclose official price lists. But the Teradata 2500 seems more expensive than most smaller-vendor alternatives.
-
Scalability up to 24 nodes (>140 TB).
-
Full Teradata application-facing functionality. Some of Teradata’s rivals are still working on getting all of their certifications with tier-1 and tier-2 business intelligence tools. Teradata has a rich application ecosystem.
-
What will be controversial performance, until customer-benchmark trends clearly emerge.
The Teradata 2500 is coming out of the chute with two customers – a new-customer retailer buying a single cabinet (i.e., 6.12 TB), and an existing customer for whom fewer details seem available. So far as I can tell, the sales force has had the product since late January, although the first leaks I got incorrectly suggested the system would only scale to a limited number of nodes.
Other products in the announcement included:
-
The Teradata 5550, a routine annual upgrade to the Teradata 5500.
-
The Teradata 550. This is a low-end, single-server SMP box introduced 9 or so months ago, originally meant for application development and testing. But some customers have been using it for deployment, and Teradata is now officially acknowledging that. It only scales to 2-3 TB of user data.
The Teradata 2500’s performance should be below the Teradata 5550’s for three reasons:
-
More disk per node.
-
Less CPU per node (2 cores vs. 4).
-
The removal of some “workload management” performance features found in the 5500 series.
The same considerations apply to a comparison between the Teradata 2500 and the older Teradata 5000, but in that case they’re offset by a year of Moore’s Law benefit.
Teradata’s performance claims for the 2500, in essence, are:
-
The 2500 is focused on decision-support applications, where all that workload-management stuff doesn’t matter as.
-
Although we can do additional things well our competitors can’t, we also rival them in performance in their sweet area, namely sequential/table-scan-oriented decision support.
-
In fact, we beat them on lots of customer benchmarks.
-
By the way, even the simplified workload management capability gives good concurrency when compared with what the little guys offer.
Teradata competitors’ stories are along the lines of:
-
We clobber Teradata in customer benchmarks.
-
Now they’re offering a system a lot slower than the ones we already beat.
DATAllegro offers a detailed critique of the Teradata 2500 based on pre-release information, both on functionality and the numbers. (E.g., they argue that 6.12 TB of user data counted the Teradata way isn’t as much as it sounds like; I’m checking on that.)
So what does this all mean? If the Teradata 2500 were as aggressively priced as I originally thought (my bad – I simply misheard their per-terabyte prices for absolute figures), this announcement would be a huge event. As matters stand – well, DBMS and other enterprise vendors’ “crippled” products don’t have a stellar history. I wouldn’t be surprised if, a year from now, we saw an upgraded Teradata 2500 series, with more aggressive pricing and features.
Alternatively: In the initial release, Teradata has chosen not to have any interoperability between the 5500, 2500, and 550 series. I think that should and perhaps will change, with the 55xx and 25xx working together in a hub/spoke manner. Otherwise, missing-features arguments like the one DATAllegro makes will be too compelling. For that matter, I wouldn’t be surprised if Teradata bought a smaller rival, in which case heterogeneous hub/spoke synchronization would be a really good idea as soon as they could implement it.
If hub/spoke integration is one feature I’d recommend Teradata get cracking on, the other – and even bigger – one is compression. All CPU/disk trade-offs notwithstanding, better compression is an obvious and big price/performance win.
Please subscribe to our feed!
Technorati Tags: Teradata 2500, Teradata 5550
It is now safe to steal my identity
Note to whoever stole the laptop of a Fidelity employee two years ago, with personal information (SSN and more) for everyone enrolled in HP’s retirement plan: it is now safe to make use of the information. Congratulations on being patient.
I received an email telling me that the “credit watch” service in which all affected HP employees (and ex-employees) were enrolled for free has expired. Of course, we are invited to start paying Equifax to keep it running. $65 per year (and that’s supposedly a discounted rate, mind you, half the “normal” price) to run a DB query once a week on my behalf. Not bad. I should be in that business.
In what ways is the lost data less dangerous two years later? The “1 or 2 years of free credit watch” offer that is typical after events such security violations is obviously just a PR move to allow the guilty party to look like they are taking responsibility for their embarrassing display of incompetence. And it probably costs them very little, if anything, to provide this, considering how good a customer acquisition strategy it is for the “credit watch” department of the credit agencies. The fact that Fidelity and their pears don’t have to bear any real cost for this is the reason why it keeps happening.
If I sound a bit detached about this, it’s not that I am not worried about someone impersonating me by using my SSN and birth date. It’s just that I am not more worried about that specific laptop theft than I am about the hundreds of employees at medical offices, dental offices, insurances companies, banks etc that already have access to this information.
The solution is to publish every single SSN on a web site and stop pretending they can be used for authentication.
Kickfire kicks off
I chatted with Raj Cherabuddi and others on the Kickfire (formerly C2) team for over an hour on Monday, and now have a better sense of their story. There are some very basic questions I still don’t have answers to; I’ll fill those in when I can.
Highlights of what I have and haven’t figured out so far include:
-
Kickfire’s technology has two main parts: A SQL co-processor chip and a MySQL storage engine.
-
Kickfire makes a Type 0 appliance. If I understood correctly, it contains the chip, a couple of standard CPU cores, and 64 gigs of RAM. Or else it contains just the chip, and is meant to be hooked up to a 2U box with 64 gigs of RAM. I’m confused.
-
The Kickfire box can handle up to 3 terabytes of user data. The disk required for that is 4-5 terabytes without redundancy, 2X with. Based on that formulation and other clues, I’m guessing Kickfire — unlike other appliance vendors — doesn’t build in storage itself.
-
I don’t know whether the Kickfire chip is true custom silicon or an FPGA emulation.
-
The essential idea of the chip is dataflow programming for SQL, with pipelining between operations. This eliminates the overhead of registers and context switching. I don’t know what the trade-offs are, if any.
-
Kickfire’s database software is columnar, operating on compressed data even in RAM. In that, Kickfire’s story is most similar to Vertica’s, although I’m guessing Exasol may do something similar as well. Like Vertica, Kickfire uses multiple compression methods (they’re reluctant to give detail, but agreed it would be fair to say they use both something like dictionary/token and something like delta compression).
-
Kickfire’s software is ACID-compliant. You can do incremental loads or trickle feeds. Bulk load speed is 100 Gb/hour. Kickfire’s solution for the traditional problem of updating column stores is called “snapshots.” Without giving details, they position that as similar to the Vertica solution.
-
Like other MySQL storage engines, Kickfire inherits whatever data connectivity, stored procedure capabilities, user-defined functions ability, etc. that MySQL has.
-
Kickfire has no paying customers, but does have a slide showing many logos of “prospects and beta customers.”
-
Kickfire has no MPP capabilities at this time, but says adding those is “on the roadmap” and will be “easy.”
-
Kickfire submitted a 100 Gb TPC-H result, in which it beat the previous leaders — Exasol, ParAccel, and Microsoft – on price-performance, and lagged only Exasol and ParAccel on absolute performance. Kickfire is extremely proud of this. Indeed, I don’t recall another vendor ascribing that much weight to them in the entire history of TPCs.* Kickfire seems unfazed by the fact that its result is for a system listed with a ship date 6 months in the future (I’m guessing that’s the latest the TPC will allow), while the other results are for systems available today.
*Somebody – perhaps adman extraordinaire Rick Bennett? — may want to check my memory on this, but I think Oracle’s famed “Gentlemen, start your snails” ad in the early 1990s was about PC World tests, not TPCs. Oracle also had an ad about WW1-style planes nosediving, but I don’t think those referenced TPCs either.
Less bloat, more oxygen
I follow Coté for his coverage of the IT management market. He also covers the so-called RIA (”Rich Internet Application”) playground, so through his blog (e.g. this post today) I involuntarily get news and comments about Flash, AIR, Silverlight and other I-hate-the-Web technologies. And I keep thinking “I hope they won’t mess up the Web too much for the rest of us on their way down to failure”.
Every time I run into a “no Flash, no service” site, I have a flashback (if you think the pun is funny then consider it intended) to 1995. That’s when Jean-Michel Jarre (the French musician, of Oxygène fame) launched his first web site, jarre.net (now de-commissioned). As a pioneer of electronic music, it wasn’t surprising to see him be one of the first artists to use the Web. As someone who likes to illuminate entire cities with laser beams, it wasn’t surprising to see him use overkill technology. So his Photoshop-wielding consultant created an entire site where each page was just one big image, with embedded text. It took forever to load and the stupidity of the approach shocked me so much that I remember it 13 years later. All the links were based on server-side image maps (the x/y coordinates of the pixel that you clicked on get sent to the server where a map links these coordinates to a target URL). The way HTML was at the time, you couldn’t use fancy fonts, colored text and elaborate wrapping (but you could blink!). And we all know that you simply can’t provide dates and locations of upcoming concerts without colored text, twisted fonts and a fancy layout.
The Internet Archive doesn’t have a copy of this original Jarre site, I don’t know if it has survived anywhere other than in my scarred-for-life brain. And if you go to JM Jarre’s current site, guess what? It is a Flash-only site. With my non-Flash Firefox all I get is a black page with a sentence (in French only, and not even grammatically correct) pointing me to the Flash download page. Looking at it with my Flash-enabled IE confirms (after a long wait for the Flash content to download) what I expected: other than a few videos (which could indeed use a simple Flash player embedded in the HTML page), there is no value whatsoever in using Flash for this site. The photos of his 80’s haircut would look just as good/bad in HTML.
Just like there are some usages for which image maps are appropriate, there are some for which Flash and friends are the right tool. But if they were only used where they belong, there wouldn’t be nearly as much hype around them. Poor Coté would have to spend more time with boring IT management geeks and less with Flash hipsters.



