DBMS2

Subscribe to DBMS2 feed
Choices in data management and analysis
Updated: 9 hours 56 min ago

Challenges in anomaly management

Sun, 2016-06-05 12:35

As I observed yet again last week, much of analytics is concerned with anomaly detection, analysis and response. I don’t think anybody understands the full consequences of that fact,* but let’s start with some basics.

*me included

An anomaly, for our purposes, is a data point or more likely a data aggregate that is notably different from the trend or norm. If I may oversimplify, there are three kinds of anomalies:

  • Important signals. Something is going on, and it matters. Somebody — or perhaps just an automated system — needs to know about it. Time may be of the essence.
  • Unimportant signals. Something is going on, but so what?
  • Pure noise. Even a fair coin flip can have long streaks of coming up “heads”.

Two major considerations are:

  • Whether the recipient of a signal can do something valuable with the information.
  • How “costly” it is for the recipient to receive an unimportant signal or other false positive.

What I mean by the latter point is:

  • Something that sets a cell phone buzzing had better be important, to the phone’s owner personally.
  • But it may be OK if something unimportant changes one small part of a busy screen display.

Anyhow, the Holy Grail* of anomaly management is a system that sends the right alerts to the right people, and never sends them wrong ones. And the quest seems about as hard as that for the Holy Grail, although this one uses more venture capital and fewer horses.

*The Holy Grail, in legend, was found by 1-3 knights: Sir Galahad (in most stories), Sir Percival (in many), and Sir Bors (in some). Leading vendors right now are perhaps around the level of Sir Kay.

Difficulties in anomaly management technology include:

  • Performance is a major challenge. Ideally, you’re running statistical tests on all data — at least on all fresh data — at all times.
  • User experiences are held to high standards.
    • False negatives are very bad.
    • False positives can be very annoying.
    • Robust role-based alert selection is often needed.
    • So are robust visualization and drilldown.
  • Data quality problems can look like anomalies. In some cases, bad data screws up anomaly detection, by causing false positives. In others, it’s just another kind of anomaly to detect.
  • Anomalies are inherently surprising. We don’t know in advance what they’ll be.

Consequences of the last point include:

  • It’s hard to tune performance when one doesn’t know exactly how the system will be used.
  • It’s hard to set up role-based alerting if one doesn’t know exactly what kinds of alerts there will be.
  • It’s hard to choose models for the machine learning part of the system.

Donald Rumsfeld’s distinction between “known unknowns” and “unknown unknowns” is relevant here, although it feels wrong to mention Rumsfeld and Sir Galahad in the same post.

And so a reasonable summary of my views might be:

Anomaly management is an important and difficult problem. So far, vendors have done a questionable job of solving it.

But there’s a lot of activity, which I look forward to writing about in considerable detail.

Related link

  • The most directly relevant companies I’ve written about are probably Rocana and Splunk.
Categories: Other

Adversarial analytics and other topics

Mon, 2016-05-30 05:15

Five years ago, in a taxonomy of analytic business benefits, I wrote:

A large fraction of all analytic efforts ultimately serve one or more of three purposes:

  • Marketing
  • Problem and anomaly detection and diagnosis
  • Planning and optimization

That continues to be true today. Now let’s add a bit of spin.

1. A large fraction of analytics is adversarial. In particular:

  • Many of the analytics companies I talk with tell me that they have important use cases in security, anti-fraud or both.
  • Click fraud steals a large fraction of the revenue in online advertising and other promotion. Combating it is a major application need.
  • Spam is another huge, ongoing fight.
    • When Google et al. fight web spammers — which is of course a great part of what web search engine developers do — they’re engaged in adversarial information retrieval.
    • Blog comment spam is still a problem, even though the vast majority of instances can now be caught.
    • Ditto for email.
  • There’s an adversarial aspect to algorithmic trading. You’re trying to beat other investors. What’s more, they’re trying to identify your trading activity, so you’re trying to obscure it. Etc.
  • Unfortunately, unfree countries can deploy analytics to identify attempts to evade censorship. I plan to post much more on that point soon.
  • Similarly, de-anonymization can be adversarial.
  • Analytics supporting national security often have an adversarial aspect.
  • Banks deploy analytics to combat money-laundering.

Adversarial analytics are inherently difficult, because your adversary actively wants you to get the wrong answer. Approaches to overcome the difficulties include:

  • Deploying lots of data. Email spam was only defeated by large providers who processed lots of email and hence could see when substantially the same email was sent to many victims at once. (By the way, that’s why “spear-phishing” still works. Malicious email sent to only one or a few victims still can’t be stopped.)
  • Using unusual analytic approaches. For example, graph analytics are used heavily in adversarial situations, even though they have lighter adoption otherwise.
  • Using many analytic tests. For example, Google famously has 100s (at least) of sub-algorithms contributing to its search rankings. The idea here is that even the cleverest adversary might find it hard to perfectly simulate innocent behavior.

2. I was long a skeptic of “real-time” analytics, although I always made exceptions for a few use cases. (Indeed, I actually used a form of real-time business intelligence when I entered the private sector in 1981, namely stock quote machines.) Recently, however, the stuff has gotten more-or-less real. And so, in a post focused on data models, I highlighted some use cases, including:

  • It is increasingly common for predictive decisions to be made at [real-timeish] speeds. (That’s what recommenders and personalizers do.) Ideally, such decisions can be based on fresh and historical data alike.
  • The long-standing desire for business intelligence to operate on super-fresh data is, increasingly, making sense, as we get ever more stuff to monitor. However …
  • … most such analysis should look at historical data as well.
  • Streaming technology is supplying ever more fresh data.

Let’s now tie those comments into the analytic use case trichotomy above. From the standpoint of mainstream (or early-life/future-mainstream) analytic technologies, I think much of the low-latency action is in two areas:

  • Recommenders/personalizers.
  • Monitoring and troubleshooting networked equipment. This is generally an exercise in anomaly detection and interpretation.

Beyond that:

  • At sufficiently large online companies, there’s a role for low-latency marketing decision support.
  • Low-latency marketing-oriented BI can also help highlight system malfunctions.
  • Investments/trading has a huge low-latency aspect, but that’s somewhat apart from the analytic mainstream. (And it doesn’t fit well into my trichotomy anyway.)
  • Also not in the analytic mainstream are the use cases for low-latency (re)planning and optimization.

Related links

My April, 2015 post Which analytic technology problems are important to solve for whom? has a round-up of possibly relevant links.

Categories: Other

Surveillance data in ordinary law enforcement

Wed, 2016-05-18 22:45

One of the most important issues in privacy and surveillance is also one of the least-discussed — the use of new surveillance technologies in ordinary law enforcement. Reasons for this neglect surely include:

  • Governments, including in the US, lie about this subject a lot. Indeed, most of the reporting we do have is exposure of the lies.
  • There’s no obvious technology industry ox being gored. What I wrote in another post about Apple, Microsoft et al. upholding their customers’ rights doesn’t have a close analogue here.

One major thread in the United States is:

  • The NSA (National Security Agency) collects information on US citizens. It turns a bunch of this over to the “Special Operations Division” (SOD) of the Drug Enforcement Administration (NSA).
  • The SOD has also long collected its own clandestine intelligence.
  • The SOD turns over information to the DEA, FBI (Federal Bureau of Investigation), IRS (Internal Revenue Service) and perhaps also other law enforcement agencies.
  • The SOD mandates that the recipient agencies lie about the source of the information, even in trials and court filings. This is called “parallel construction”, in that the nature of the lie is to create another supposed source for the original information, which has the dual virtues of:
    • Making it look like the information was obtained by allowable means.
    • Protecting confidentiality of the information’s true source.
  • There is a new initiative to allow the NSA to share more surveillance information on US citizens with other agencies openly, thus reducing the “need” to lie, and hopefully gaining efficiency/effectiveness in information-sharing as well.

Similarly, StingRay devices that intercept cell phone calls (and thus potentially degrade service) are used by local police departments, who then engage in “parallel construction” for several reasons, one simply being an NDA with manufacturer Harris Corporation.

Links about these and other surveillance practices are below.

At this point we should note the distinction between intelligence/leads and admissible evidence.

  • Intelligence (or leads) is any information that can be used to point law enforcement or security forces at people who either plan to do or already have done unlawful and/or very harmful things.
  • Admissible evidence is information that can legally be used to convict people of crimes or otherwise bring down penalties and sanctions upon then.

I won’t get into the minutiae of warrants, subpoenas, probable cause and all that, but let’s just say:

  • In theory there’s a semi-bright line between intelligence and admissible evidence; i.e., there’s some blurring, but in most cases the line can be pretty easily seen.
  • In practice there’s a lot of blurring. Parallel construction is only one of the ways the semi-bright line gets scuffed over.
  • Even so, this distinction has great value. The number of people who have been badly harmed in the US by inappropriate use of inadmissible intelligence isn’t very high …
  • … yet.

“Yet” is the key word. My core message in this post is that — despite the lack of catastrophe to date — the blurring of the intelligence/evidence line needs to be greatly reversed:

Going forward, the line between intelligence and admissible evidence needs to be established and maintained in a super-bright state.

As you may recall, I’ve said that for years, in a variety of different phrasings. Still, it’s a big enough deal that I feel I should pound the table about it from time to time — especially now, when public policy in other aspects of surveillance is going pretty well, but this area is headed for disaster. My argument for this view can be summarized in two bullet points:

  • Massive surveillance is inevitable.
  • Unless the uses of the resulting information are VERY limited, freedoms will be chilled into oblivion.

I recapitulate the chilling effects argument frequently, so for the rest of this post let’s focus on the first bullet point. Massive surveillance will be a fact of life for reasons including:

  • As a practical political matter, domestic surveillance will be used at least for anti-terrorism. If you doubt that — please just consider the number of people who support Donald Trump.
  • Actually, the constituency for anti-terrorism surveillance is much more than just the paranoid idiots. Indeed — and notwithstanding the great excesses of anti-terrorism propaganda around the world — that constituency includes me. :) My reasons start:
    • In a country of well over 300 million people, there probably are a few who are both crazy and smart enough to launch Really Bad Attacks. Stopping them before they act is a Very Good Idea.
    • The alternative is security — or more likely security theater — measures that are intrusive across the board. I like unfettered freedom of movement, for example. But I can barely stand the TSA (Transportation Security Administration).
  • Commercial “surveillance” is intense. And it’s essential to the internet economy.

And so I return to the point I’ve been making for years: Surveillance WILL happen. So the use of surveillance information needs to be tightly limited.

Related links:

  • Reason’s recent rant about parallel construction contains a huge number of links. Ditto a calmer Rodney Balko blog for the Washington Post. (March, 2016).
  • Reuters gave details of the SOD’s thou-shalt-lie mandates in August, 2013.
  • If you have a clearance and work in the civilian sector, you may be subject to 24/7 surveillance, aka continuous evaluation, for fear that you might be the next Ed Snowden. (March, 2016)
  • License plate scanning databases are already a big deal in law enforcement. (October, 2015)
  • StingRay-type devices are powerful, and have been for quite a few years. They’re really powerful. Procedures related to StingRay surveillance are in flux. (2015)
  • Chilling effects are real. (April, 2016)
  • At least one federal court has decided that tracking URLs visited without a warrant is an illegal wiretap. Other courts think your URL visits, shopping history, etc. are fair game. (November, 2015)
  • Pakistan in effect bugged citizens’ cell phones to track their movements and force polio vaccines on them. (November, 2015)
  • This is not totally on-topic, but it does support worries about what the government can do with surveillance-based analytics — law enforcement can wildly exaggerate the significance of its “scientific” evidence, and gain bogus convictions as a result. (2015-2016).
  • The Electronic Frontier Foundation offers a dated but fact-filled overview of NSA domestic spying (2012-2013).
Categories: Other

Governments vs. tech companies — it’s complicated

Wed, 2016-05-18 22:42

Numerous tussles fit the template:

  • A government wants access to data contained in one or more devices (mobile/personal or server as the case may be).
  • The computer’s manufacturer or operator doesn’t want to provide it, for reasons including:
    • That’s what customers prefer.
    • That’s what other governments require.
    • Being pro-liberty is the right and moral choice. (Yes, right and wrong do sometimes actually come into play. :) )

As a general rule, what’s best for any kind of company is — pricing and so on aside — whatever is best or most pleasing for their customers or users. This would suggest that it is in tech companies’ best interest to favor privacy, but there are two important quasi-exceptions:

  • Recommendation/personalization. E-commerce and related businesses rely heavily on customer analysis and tracking.
  • When the customer is the surveiller. Governments pay well for technology that is used to watch over their citizens.

I used the “quasi-” prefix because screwing the public is risky, especially in the long term.

Something that is not even a quasi-exception to the tech industry’s actual or potential pro-privacy bias is governmental mandates to let their users be watched. In many cases, governments compel privacy violations, by threat of severe commercial or criminal penalties. Tech companies should and often do resist these mandates as vigorously as they can, in the courts and/or via lobbying as the case may be. Yes, companies have to comply with the law. However, it’s against their interests for the law to compel privacy violations, because those make their products and services less appealing.

The most visible example of all this right now is the FBI/Apple kerfuffle. To borrow a phrase — it’s complicated. Among other aspects:

  • Syed Rizwan Farook, one of the San Bernardino terrorist murderers, had 3 cell phones. He carefully destroyed his 2 personal phones before his attack, but didn’t bother with his iPhone from work.
  • Notwithstanding this clue that the surviving phone contained nothing of interest, the FBI wanted to unlock it. It needed technical help to do so.
  • The FBI got a court order commanding Apple’s help. Apple refused and appealed the order.
  • The FBI eventually hired a third party to unlock Farook’s phone, for a price that was undisclosed but >$1.3 million.
  • Nothing of interest was found on the phone.
  • Stories popped up of the FBI asking for Apple’s help unlocking numerous other iPhones. The courts backed Apple or not depending on how they interpreted the All Writs Act. The All Writs Act was passed in the first-ever session of the US Congress, in 1789, and can reasonably be assumed to reflect all the knowledge that the Founders possessed about mobile telephony.
  • It’s widely assumed that the NSA could have unlocked the phones for the FBI — but it didn’t.

Russell Brandom of The Verge collected links explaining most of the points above.

With that as illustration, let’s go to some vendor examples:

All of these cases seem consistent with my comments about vendors’ privacy interests above.

Bottom line: The technology industry is correct to resist government anti-privacy mandates by all means possible.

Categories: Other

Privacy and surveillance require our attention

Wed, 2016-05-18 22:41

This year, privacy and surveillance issues have been all over the news. The most important, in my opinion, deal with the tension among:

  • Personal privacy.
  • Anti-terrorism.
  • General law enforcement.

More precisely, I’d say that those are the most important in Western democracies. The biggest deal worldwide may be China’s movement towards an ever-more-Orwellian surveillance state.

The main examples on my mind — each covered in a companion post — are:

Legislators’ thinking about these issues, at least in the US, seems to be confused but relatively nonpartisan. Support for these assertions includes:

I do think we are in for a spate of law- and rule-making, especially in the US. Bounds on the possible outcomes likely include:

  • Governments will retrain broad powers for anti-terrorism If there was any remaining doubt, the ISIS/ISIL/Daesh-inspired threats guarantees that surveillance will be intense.
  • Little will happen in the US to clip the wings of internet personalization/recommendation. To a lesser extent, that’s probably true in other developed countries as well.
  • Non-English-speaking countries will maintain data sovereignty safeguards, both out of genuine fear of (especially) US snooping and as a pretext to support their local internet/cloud service providers.

As always, I think that the eventual success or failure of surveillance regulation will depend greatly on the extent to which it accounts for chilling effects. The gravity of surveillance’s longer-term dangers is hard to overstate, yet  they still seem broadly overlooked. So please allow me to reiterate what I wrote in 2013 — surveillance + analytics can lead to very chilling effects.

When government — or an organization such as your employer, your insurer, etc. — watches you closely, it can be dangerous to deviate from the norm. Even the slightest non-conformity could have serious consequences.

And that would be a horrific outcome.

So I stand by my privacy policy observations and prescriptions from the same year:

… direct controls on surveillance … are very weak; government has access to all kinds of information. … And they’re going to stay weak. … Consequently, the indirect controls on surveillance need to be very strong, for they are what stands between us and a grim authoritarian future. In particular:

  • Governmental use of private information needs to be carefully circumscribed, including in most aspects of law enforcement.
  • Business discrimination based on private information needs in most cases to be proscribed as well.

The politics of all this is hard to predict. But I’ll note that in the US:

  • There’s an emerging consensus that the criminal justice system is seriously flawed, on the side of harshness. However …
  • … criminal justice reform is typically very slow.
  • The libertarian movement (Ron Paul, Rand Paul, aspects of the Tea Party folks, etc.) seems to have lost steam.
  • The courts cannot be relied upon to be consistent. Questions about Supreme Court appointments even aside, Fourth Amendment jurisprudence in the US has long been confusing and confused.
  • Few legislators understand technology.

Realistically, then, the main plausible path to a good outcome is that the technology industry successfully pushes for one. That’s why I keep writing about this subject in what is otherwise a pretty pure technology blog.

Bottom line: The technology industry needs to drive privacy/ surveillance public policy in directions that protect individual liberties. If it doesn’t, we’re all screwed.

Categories: Other

I’m having issues with comment spam

Wed, 2016-05-18 15:12

My blogs are having a bad time with comment spam. While Akismet and other safeguards are intercepting almost all of the ~5000 attempted spam comments per day, the small fraction that get through are still a large absolute number to deal with.

There’s some danger I’ll need to restrict comments here to combat it. (At the moment they’ve been turned off almost entirely on Text Technologies, which may be awkward if I want to put a post up there rather than here.) If I do, I’ll say so in a separate post. I apologize in advance for any inconvenience.

Categories: Other

Some checklists for making technical choices

Mon, 2016-02-15 10:27

Whenever somebody asks for my help on application technology strategy, I start by trying to ascertain three things. The absolute first is actually a prerequisite to almost any kind of useful conversation, which is to ascertain in general terms what the hell it is that we are talking about. :)

My second goal is to ascertain technology constraints. Three common types are:

  • Compatible with legacy systems and/or enterprise standards.
  • Cheap, free and/or open source.
  • Proven, vetted by sufficiently many references, and/or generally having an “enterprise-y” reputation.

That’s often a short and straightforward discussion, except in those awkward situations when all three of my bullet points above are applicable at once.

The third item is usually more interesting. I try to figure out what is to be accomplished. That’s usually not a simple matter, because the initial list of goals and requirements is almost never accurate. It’s actually more common that I have to tell somebody to be more ambitious than that I need to rein them in.

Commonly overlooked needs include:

  • If you want to sell something and have happy users, you need a good UI.
  • You will also soon need tools and a UI for administration.
  • Customers demand low-latency/fresh data. Your explanation of why they don’t really need it doesn’t contradict the fact that they want it.
  • Providing data access and saying “You can hook up any BI tool you want and build charts” is not generally regarded as offering a good UI.
  • When “adding analytics” to something previously focused on short-request processing, it is common to underestimate the variety of things users will soon want to do. (One common reason for this under-estimate is that after years of being told it can’t be done, they’ve learned not to ask.)

And if you take one thing away from this post, then take this:

  • If you “know” exactly which features are or aren’t helpful to users, …
  • .. and if you supply only what you “know” they should use, …
  • … then you will discover that what you “knew” wasn’t really accurate.

I guarantee it.

So far what I’ve said can be summarized as “Figure out what you’re trying to do, and what constraints there are on your choices for doing it.” The natural next step is to list the better-thought-of choices that meet your constraints, and — voila! — you have a short list. That’s basically correct, but there’s one significant complication.

Speaking of complications, what I’m portraying as a kind of linear/waterfall decision process of course usually involves lots of iteration, meandering around and general wheel-spinning. Real life is messy.

Simply put, there are many different kinds of application project. Other folks’ experience may not be as applicable to your case as you hope, because your case is different. So the rest of this post contains a checklist of distinctions among various different kinds of application project.

For starters, there are at least two major kind(s) of software development.

  • Many projects fit the traditional development model, elements of which are:
    • You — and this is very much a plural “you” — code something up more or less from scratch, using whatever language(s) and/or framework(s) you think make sense.
    • You break the main project into pieces in obvious ways (e.g. server back end vs. mobile front), and then into further pieces for manageability.
    • There may also be database designs, test harnesses, connectors to other apps and so on.
  • But there are many other projects in which smaller bits of configuration and/or scripting are the essence of what you do.
    • This is particularly common in analytics, where there might be business intelligence tools, ETL tools, scripts running against Hadoop and so on. The original building of a data warehouse/hub/lake/reservoir may also fit this model.
    • It’s also what you do to get a major purchased packaged application into actual production.
    • It also is often what happens for websites that serve “content”.

Other significant distinctions include:

  • In-house vs. software-for-resale. If the developing organization is handing code to somebody else, then we’re probably talking about a more traditional kind of project. But if the whole thing is growing organically in-house, the script-spaghetti alternative may well be viable (in those projects for which it seems appropriate). Important subsidiary distinctions start with:
    • (If in-house) Truly in-house vs. out-sourced.
    • (If for resale) On-premises vs. SaaS. Or maybe not.
  • Kind(s) of analytics, if any. Technologies and development processes used can be very different depending upon whether the application features:
    • Business intelligence (not particularly real-time) as its essence.
    • Reporting or other BI as added functionality to an essentially operational app.
    • Low-latency BI, perhaps supported by (other) short-request processing.
    • Predictive model scoring.
  • The role(s) of the user(s). This influences how appealing and easy the UI needs to be.* Requirements are very different, for example, among:
    • Classic consumer-facing websites, with recommenders and so on.
    • Marketing websites targeted at a small group of business-to-business customers.
    • Data-sharing websites for existing consumer stakeholders.
    • Cheery benefits-information websites that the HR department wants employees to look at.
    • Purely internal apps meant to be used by (self-)important executives.
    • Internal apps meant to be used by line workers who will be given substantial training on them.
  • Certain kinds of application project stand almost separately from the rest of these considerations, because their starting point is legacy apps. Examples may be found among:
    • Migration/consolidation projects.
    • Refactoring projects.
    • Addition of incremental functionality.

*It also influences security, all good practices for securing internal apps notwithstanding.

Much also depends on the size and sophistication of the organization. What the “organization” is depends a bit on context:

  • In the case of software products, SaaS (Software as a Service) or other internet services, it is primarily the vendor. However …
  • … in B2B cases the sophistication of the customer organizations can also matter.
  • In the case of in-house enterprise development, there’s only one enterprise involved (duh). However …
  • … the “department” vs. “IT” distinction may be very important.

Specific considerations of this kind start:

  • Is me-too functionality enough, or does the enterprise seek competitive advantage through technology?
  • What kinds of technical risk does it seem prudent and desirable to take?

And that, in a nutshell, is why strategizing about application technology is often more complicated than it first appears.

Related links

Categories: Other