DBMS2

Subscribe to DBMS2 feed
Choices in data management and analysis
Updated: 1 hour 33 min ago

I’m having issues with comment spam

Wed, 2016-05-18 15:12

My blogs are having a bad time with comment spam. While Akismet and other safeguards are intercepting almost all of the ~5000 attempted spam comments per day, the small fraction that get through are still a large absolute number to deal with.

There’s some danger I’ll need to restrict comments here to combat it. (At the moment they’ve been turned off almost entirely on Text Technologies, which may be awkward if I want to put a post up there rather than here.) If I do, I’ll say so in a separate post. I apologize in advance for any inconvenience.

Categories: Other

Some checklists for making technical choices

Mon, 2016-02-15 10:27

Whenever somebody asks for my help on application technology strategy, I start by trying to ascertain three things. The absolute first is actually a prerequisite to almost any kind of useful conversation, which is to ascertain in general terms what the hell it is that we are talking about. :)

My second goal is to ascertain technology constraints. Three common types are:

  • Compatible with legacy systems and/or enterprise standards.
  • Cheap, free and/or open source.
  • Proven, vetted by sufficiently many references, and/or generally having an “enterprise-y” reputation.

That’s often a short and straightforward discussion, except in those awkward situations when all three of my bullet points above are applicable at once.

The third item is usually more interesting. I try to figure out what is to be accomplished. That’s usually not a simple matter, because the initial list of goals and requirements is almost never accurate. It’s actually more common that I have to tell somebody to be more ambitious than that I need to rein them in.

Commonly overlooked needs include:

  • If you want to sell something and have happy users, you need a good UI.
  • You will also soon need tools and a UI for administration.
  • Customers demand low-latency/fresh data. Your explanation of why they don’t really need it doesn’t contradict the fact that they want it.
  • Providing data access and saying “You can hook up any BI tool you want and build charts” is not generally regarded as offering a good UI.
  • When “adding analytics” to something previously focused on short-request processing, it is common to underestimate the variety of things users will soon want to do. (One common reason for this under-estimate is that after years of being told it can’t be done, they’ve learned not to ask.)

And if you take one thing away from this post, then take this:

  • If you “know” exactly which features are or aren’t helpful to users, …
  • .. and if you supply only what you “know” they should use, …
  • … then you will discover that what you “knew” wasn’t really accurate.

I guarantee it.

So far what I’ve said can be summarized as “Figure out what you’re trying to do, and what constraints there are on your choices for doing it.” The natural next step is to list the better-thought-of choices that meet your constraints, and — voila! — you have a short list. That’s basically correct, but there’s one significant complication.

Speaking of complications, what I’m portraying as a kind of linear/waterfall decision process of course usually involves lots of iteration, meandering around and general wheel-spinning. Real life is messy.

Simply put, there are many different kinds of application project. Other folks’ experience may not be as applicable to your case as you hope, because your case is different. So the rest of this post contains a checklist of distinctions among various different kinds of application project.

For starters, there are at least two major kind(s) of software development.

  • Many projects fit the traditional development model, elements of which are:
    • You — and this is very much a plural “you” — code something up more or less from scratch, using whatever language(s) and/or framework(s) you think make sense.
    • You break the main project into pieces in obvious ways (e.g. server back end vs. mobile front), and then into further pieces for manageability.
    • There may also be database designs, test harnesses, connectors to other apps and so on.
  • But there are many other projects in which smaller bits of configuration and/or scripting are the essence of what you do.
    • This is particularly common in analytics, where there might be business intelligence tools, ETL tools, scripts running against Hadoop and so on. The original building of a data warehouse/hub/lake/reservoir may also fit this model.
    • It’s also what you do to get a major purchased packaged application into actual production.
    • It also is often what happens for websites that serve “content”.

Other significant distinctions include:

  • In-house vs. software-for-resale. If the developing organization is handing code to somebody else, then we’re probably talking about a more traditional kind of project. But if the whole thing is growing organically in-house, the script-spaghetti alternative may well be viable (in those projects for which it seems appropriate). Important subsidiary distinctions start with:
    • (If in-house) Truly in-house vs. out-sourced.
    • (If for resale) On-premises vs. SaaS. Or maybe not.
  • Kind(s) of analytics, if any. Technologies and development processes used can be very different depending upon whether the application features:
    • Business intelligence (not particularly real-time) as its essence.
    • Reporting or other BI as added functionality to an essentially operational app.
    • Low-latency BI, perhaps supported by (other) short-request processing.
    • Predictive model scoring.
  • The role(s) of the user(s). This influences how appealing and easy the UI needs to be.* Requirements are very different, for example, among:
    • Classic consumer-facing websites, with recommenders and so on.
    • Marketing websites targeted at a small group of business-to-business customers.
    • Data-sharing websites for existing consumer stakeholders.
    • Cheery benefits-information websites that the HR department wants employees to look at.
    • Purely internal apps meant to be used by (self-)important executives.
    • Internal apps meant to be used by line workers who will be given substantial training on them.
  • Certain kinds of application project stand almost separately from the rest of these considerations, because their starting point is legacy apps. Examples may be found among:
    • Migration/consolidation projects.
    • Refactoring projects.
    • Addition of incremental functionality.

*It also influences security, all good practices for securing internal apps notwithstanding.

Much also depends on the size and sophistication of the organization. What the “organization” is depends a bit on context:

  • In the case of software products, SaaS (Software as a Service) or other internet services, it is primarily the vendor. However …
  • … in B2B cases the sophistication of the customer organizations can also matter.
  • In the case of in-house enterprise development, there’s only one enterprise involved (duh). However …
  • … the “department” vs. “IT” distinction may be very important.

Specific considerations of this kind start:

  • Is me-too functionality enough, or does the enterprise seek competitive advantage through technology?
  • What kinds of technical risk does it seem prudent and desirable to take?

And that, in a nutshell, is why strategizing about application technology is often more complicated than it first appears.

Related links

Categories: Other

Kafka and more

Mon, 2016-01-25 05:28

In a companion introduction to Kafka post, I observed that Kafka at its core is remarkably simple. Confluent offers a marchitecture diagram that illustrates what else is on offer, about which I’ll note:

  • The red boxes — “Ops Dashboard” and “Data Flow Audit” — are the initial closed-source part. No surprise that they sound like management tools; that’s the traditional place for closed source add-ons to start.
  • “Schema Management”
    • Is used to define fields and so on.
    • Is not equivalent to what is ordinarily meant by schema validation, in that …
    • … it allows schemas to change, but puts constraints on which changes are allowed.
    • Is done in plug-ins that live with the producer or consumer of data.
    • Is based on the Hadoop-oriented file format Avro.

Kafka offers little in the way of analytic data transformation and the like. Hence, it’s commonly used with companion products. 

  • Per Confluent/Kafka honcho Jay Kreps, the companion is generally Spark Streaming, Storm or Samza, in declining order of popularity, with Samza running a distant third.
  • Jay estimates that there’s such a companion product at around 50% of Kafka installations.
  • Conversely, Jay estimates that around 80% of Spark Streaming, Storm or Samza users also use Kafka. On the one hand, that sounds high to me; on the other, I can’t quickly name a counterexample, unless Storm originator Twitter is one such.
  • Jay’s views on the Storm/Spark comparison include:
    • Storm is more mature than Spark Streaming, which makes sense given their histories.
    • Storm’s distributed processing capabilities are more questionable than Spark Streaming’s.
    • Spark Streaming is generally used by folks in the heavily overlapping categories of:
      • Spark users.
      • Analytics types.
      • People who need to share stuff between the batch and stream processing worlds.
    • Storm is generally used by people coding up more operational apps.

If we recognize that Jay’s interests are obviously streaming-centric, this distinction maps pretty well to the three use cases Cloudera recently called out.

Complicating this discussion further is Confluent 2.1, which is expected late this quarter. Confluent 2.1 will include, among other things, a stream processing layer that works differently from any of the alternatives I cited, in that:

  • It’s a library running in client applications that can interrogate the core Kafka server, rather than …
  • … a separate thing running on a separate cluster.

The library will do joins, aggregations and so on, and while relying on core Kafka for information about process health and the like. Jay sees this as more of a competitor to Storm in operational use cases than to Spark Streaming in analytic ones.

We didn’t discuss other Confluent 2.1 features much, and frankly they all sounded to me like items from the “You mean you didn’t have that already??” list any young product has.

Related links

Categories: Other

Pages