data Artisans and Flink basics start:
- Flink is an Apache project sponsored by the Berlin-based company data Artisans.
- Flink has been viewed in a few different ways, all of which are similar to how Spark is seen. In particular, per co-founder Kostas Tzoumas:
- Flink’s original goal was “Hadoop done right”.
- Now Flink is focused on streaming analytics, as an alternative to Spark Streaming, Samza, et al.
- Kostas seems to see Flink as a batch-plus-streaming engine that’s streaming-first.
Like many open source projects, Flink seems to have been partly inspired by a Google paper.
To this point, data Artisans and Flink have less maturity and traction than Databricks and Spark. For example:
- The first line of Flink code dates back to 2010.
- data Artisans and the Flink open source project both started in 2014.
- When I met him in late June, Kostas told me that Data Artisans had raised $7 million and had 15 employees.
- Flink’s current SQL support is very minor.
Per Kostas, about half of Flink committers are at Data Artisans; others are at Cloudera, Hortonworks, Confluent, Intel, at least one production user, and some universities. Kostas provided about 5 examples of production Flink users, plus a couple of very big names that were sort-of-users (one was using a forked version of Flink, while another is becoming a user “soon”).
The technical story at data Artisans/Flink revolves around the assertion “We have the right architecture for streaming.” If I understood data Artisans co-founder Stephan Ewen correctly on a later call, the two key principles in support of that seem to be:
- The key is to keep data “transport” running smoothly without interruptions, delays or bottlenecks, where the relevant sense of “transport” is movement from one operator/operation to the next.
- In this case, the Flink folks feel that modularity supports efficiency.
- Anything that relates to consistency/recovery is kept almost entirely separate from basic processing, with minimal overhead and nothing that resembles a lock.
- Windowing and so on operate separately from basic “transport” as well.
- The core idea is that special markers — currently in the ~20 byte range in size — are injected into the streams. When the marker gets to an operator, the operator snapshots the then-current state of its part of the stream.
- Should recovery ever be needed, consistency is achieved by assembling all the snapshots corresponding to a single marker, and replaying any processing that happened after those snapshots were taken.
- Actually, this is oversimplified, in that it assumes there’s only a single input stream.
- Alot of Flink’s cleverness, I gather, is involved in assembling a consistent snapshot despite the realities of multiple input streams.
The upshot, Flink partisans believe, is to match the high throughput of Spark Streaming while also matching the low latency of Storm.
The Flink folks naturally have a rich set of opinions about streaming. Besides the points already noted, these include:
- “Exactly once” semantics are best in almost all use cases, as opposed to “at least once”, or to turning off fault tolerance altogether. (Exceptions might arise in extreme performance scenarios, or because of legacy systems’ expectations.)
- Repetitive, scheduled batch jobs are often “streaming processes in disguise”. Besides any latency benefits, reimplementing them using streaming technology might simplify certain issues that can occur around the boundaries of batch windows. (The phrase “continuous processing” could reasonably be used here.)
We discussed joins quite a bit, but this was before I realized that Flink didn’t have much SQL support. Let’s just say they sounded rather primitive even when I assumed they were done via SQL.
Our discussion of windowing was more upbeat. Flink supports windows based either on timestamps or data arrival time, and these can be combined as needed. Stephan thinks this flexibility is important.
As for Flink use cases, they’re about what you’d expect:
- Plenty of data transformation, because that’s how all these systems start out. Indeed, the earliest Flink adoption was for batch transformation.
- Plenty of stream processing.
But Flink doesn’t have all the capabilities one would want for the kinds of investigative analytics commonly done on Spark.
Databricks CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world includes:
- What I called Databricks’ “secondary business” of “licensing stuff to Spark distributors” was really about second/third tier support. Fair enough. But distributors of stacks including Spark, for whatever combination of on-premise and cloud as the case may be, may in many cases be viewed as competitors to Databricks cloud-only service. So why should Databricks help them?
- Databricks’ investment in Spark Summit and similar evangelism is larger than I realized.
- Ali suggests that the fraction of Databricks’ engineering devoted to open source Spark is greater than I understood during my recent visit.
Ali also walked me through customer use cases and adoption in wonderful detail. In general:
- A large majority of Databricks customers have machine learning use cases.
- Predicting and preventing user/customer churn is a huge issue across multiple market sectors.
The story on those sectors, per Ali, is:
- First, Databricks penetrated ad-tech, for use cases such as ad selection.
- Databricks’ second market was “mass media”.
- Disclosed examples include Viacom and NBC/Universal.
- There are “many” specific use cases. Personalization is a big one.
- Conviva-style video operations optimization is a use case for several customers, naturally including Conviva. (Reminder: Conviva was Ion Stoica’s previous company.)
- Health care came third.
- Use cases here seem to be concentrated on a variety of approaches to predict patient outcomes.
- Analytic techniques often combine machine learning with traditional statistics.
- Security is a major requirement in this sector; fortunately, Databricks believes it excels at that.
- Next came what he calls “industrial IT”. This group includes cool examples such as:
- Finding oil.
- Predictive maintenance of wind turbines.
- Predicting weather based on sensor data.
- Finally (for now), there’s financial services. Of course, “financial services” comprises a variety of quite different business segments. Example use cases include:
- Credit card marketing.
- Investment analysis (based on expensive third-party data sets that are already in the cloud).
At an unspecified place in the timeline is national security, for a use case very similar to anti-fraud — identifying communities of bad people. Graph analytics plays a big role here.
And finally, of course we discussed some technical stuff, in philosophy, futures and usage as the case may be. In particular, Ali stressed that Spark 2.0 is the first that “breaks”/changes the APIs; hence the release number. It is now the case that:
- There’s a single API for batch and streaming alike, and for machine learning “too”. This is DataFrames/DataSets. In this API …
- … everything is a table. That said:
- Tables can be nested.
- Tables can be infinitely large, in which case you’re doing streaming.
- Based on this, Ali thinks Spark 2.0 is now really a streaming engine.
Other tidbits included:
- Ali said that every Databricks customer uses SQL. No exceptions.
- Indeed, a “number” of customers are using business intelligence tools. Therefore …
- … Databricks is licensing connector technology from Simba.
- They’re working on model serving, with a REST API, rather than just model building. This was demoed at the recent Spark Summit, but is still in the “nascent” stage.
- Ali insists that every streaming system with good performance does some kind of micro-batching under the hood. But the Spark programmers no longer need to take that directly into account. (In earlier versions, programmatic window sizes needed to be integer multiples of the low-level system’s chosen interval.)
- In the future, when Databricks runs on more than just the Amazon cloud, Databricks customers will of course have cloud-to-cloud portability.
I visited DataStax on my recent trip. That was a tipping point leading to my recent discussions of NoSQL DBAs and misplaced fear of vendor lock-in. But of course I also learned some things about DataStax and Cassandra themselves.
On the customer side:
- DataStax customers still overwhelmingly use Cassandra for internet back-ends — web, mobile or otherwise as the case might be.
- This includes — and “includes” might be understating the point — traditional enterprises worried about competition from internet-only ventures.
Customers in large numbers want cloud capabilities, as a potential future if not a current need.
One customer example was a large retailer, who in the past was awful at providing accurate inventory information online, but now uses Cassandra for that. DataStax brags that its queries come back in 20 milliseconds, but that strikes me as a bit beside the point; what really matters is that data accuracy has gone from “batch” to some version of real-time. Also, Microsoft is a DataStax customer, using Cassandra (and Spark) for the Office 365 backend, or at least for the associated analytics.
Per Patrick McFadin, the four biggest things in DataStax Enterprise 5 are:
- Graph capabilities.
- Cassandra 3.0, which includes a complete storage engine rewrite.
- Tiered storage/ILM (Information Lifecycle Management).
- Policy-based replication.
Some of that terminology is mine, but perhaps my clients at DataStax will adopt it too.
We didn’t go into as much technical detail as I ordinarily might, but a few notes on that tiered storage/ILM bit are:
- It’s a way to have some storage that’s more expensive (e.g. flash) and some that’s cheaper (e.g. spinning disk). Duh.
- Since Cassandra has a strong time-series orientation, it’s easy to imagine how those policies might be specified.
- Technologically, this is tightly integrated with Cassandra’s compaction strategy.
DataStax Enterprise 5 also introduced policy-based replication features, not all of which are in open source Cassandra. Data sovereignty/geo-compliance is improved, which is of particular importance in financial services. There’s also hub/spoke replication now, which seems to be of particular value in intermittently-connected use cases. DataStax said the motivating use case in that area was oilfield operations, where presumably there are Cassandra-capable servers at all ends of the wide-area network.
During my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion Stoica as well. Spark 2.0 is just coming out now, and of course has a lot of enhancements. At a high level:
- Using the new terminology, Spark originally assumed users had data engineering skills, but Spark 2.0 is designed to be friendly to data scientists.
- A lot of this is via a focus on simplified APIs, based on
- Unlike similarly named APIs in R and Python, Spark DataFrames work with nested data.
- Machine learning and Spark Streaming both work with Spark DataFrames.
- There are lots of performance improvements as well, some substantial. Spark is still young enough that Bottleneck Whack-A-Mole yields huge benefits, especially in the SparkSQL area.
- SQL coverage is of course improved. For example, SparkSQL can now perform all TPC-S queries.
The majority of Databricks’ development efforts, however, are specific to its cloud service, rather than being donated to Apache for the Spark project. Some of the details are NDA, but it seems fair to mention at least:
- Databricks’ notebooks feature for organizing and launching machine learning processes and so on is a biggie. Jupyter is an open source analog.
- Databricks has been working on security, and even on the associated certifications.
Two of the technical initiatives Reynold told me about seemed particularly cool. One, on the machine learning side, was a focus on training models online as new data streams in. In most cases this seems to require new algorithms for old model types, with a core idea being that the algorithm does a mini gradient descent for each new data point.
The other cool idea fits the trend of alternatives to the “lambda architecture”. Under the name “structured streaming”, which seems to be a replacement for “DStreaming”, the idea is to do set-based SQL processing even though membership of the set changes over time. Result sets are extracted on a snapshot basis; you can keep either all the results from each snapshot query or just the deltas.
Despite all this, there’s some non-trivial dissatisfaction with Spark, fair or otherwise.
- Some of the reason is that SparkSQL is too immature to be great.
- Some is annoyance that Databricks isn’t putting everything it has into open source.
- Some is that everything has its architectural trade-offs.
To the last point, I raised one of the biggest specifics with Reynold, namely Spark’s lack of a strong built-in data persistence capability. Reynold’s answer was that they’re always working to speed up reading and writing from other forms of persistent storage. E.g., he cited a figure of ~100 million rows/core/second decoded from Parquet.
I visited Databricks in early July to chat with Ion Stoica and Reynold Xin. Spark also comes up in a large fraction of the conversations I have. So let’s do some catch-up on Databricks and Spark. In a nutshell:
- Spark is indeed the replacement for Hadoop MapReduce.
- Spark is becoming the default platform for machine learning.
- SparkSQL (nee’ Shark) is puttering along predictably.
- Databricks reports good success in its core business of cloud-based machine learning support.
- Spark Streaming has strong adoption, but its position is at risk.
- Databricks, the original authority on Spark, is not keeping a tight grip on that role.
I shall explain below. I also am posting separately about Spark evolution, especially Spark 2.0. I’ll also talk a bit in that post about Databricks’ proprietary/closed-source technology.
Spark is the replacement for Hadoop MapReduce.
This point is so obvious that I don’t know what to say in its support. The trend is happening, as originally decreed by Cloudera (and me), among others. People are rightly fed up with the limitations of MapReduce, and — niches perhaps aside — there are no serious alternatives other than Spark.
The greatest use for Spark seems to be the same as the canonical first use for MapReduce: data transformation. Also in line with the Spark/MapReduce analogy:
- Data-transformation-only use cases are important, but they don’t dominate.
- Most other use cases typically have a data transformation element as well …
- … which has to be started before any other work can be done.
And so it seems likely that, at least for as long as Spark is growing rapidly, data transformation will appear to be the biggest Spark use case.
Spark is becoming the default platform for machine learning.
Largely, this is a corollary of:
- The previous point.
- The fact that Spark was originally designed with machine learning as its principal use case.
To do machine learning you need two things in your software:
- A collection of algorithms. Spark, I gather, is one of numerous good alternatives there.
- Support for machine learning workflows. That’s where Spark evidently stands alone.
And thus I have conversations like:
- “Are you doing anything with Spark?”
- “We’ve gotten more serious about machine learning, so yes.”
SparkSQL (nee’ Shark) is puttering along.
SparkSQL is pretty much following the Hive trajectory.
- Useful from Day One as an adjunct to other kinds of processing.
- A tease and occasionally useful as a SQL engine for its own sake, but really not very good, pending years to mature.
Databricks reports good success in its core business of cloud-based machine learning support.
Databricks, to an even greater extent than I previously realized, is focused on its cloud business, for which there are well over 200 paying customers. Notes on that include:
- As you might expect based on my comments above, the majority of usage is for data transformation, but a lot of that is in anticipation of doing machine learning/predictive modeling in the near future.
- Databricks customers typically already have their data in the Amazon cloud.
- Naturally, a lot of Databricks customers are internet companies — ad tech startups and the like. Databricks also reports “strong” traction in the segments:
- Financial services (especially but not only insurance)
- Health care/pharma
- The main languages Databricks customers use are R and Python. Ion said that Python was used more on the West Coast, while R was used more in the East.
Databricks’ core marketing concept seems to be “just-in-time data platform”. I don’t know why they picked that, as opposed to something that emphasizes Spark’s flexibility and functionality.
Spark Streaming’s long-term success is not assured.
To a first approximation, things look good for Spark Streaming.
- Spark Streaming is definitely the leading companion to Kafka, and perhaps also to cloud equivalents (e.g. Amazon Kinesis).
- The “traditional” alternatives of Storm and Samza are pretty much done.
- Newer alternatives from Twitter, Confluent and Flink aren’t yet established.
- Cloudera is a big fan of Spark Streaming.
- Even if Spark Streaming were to generally decline, it might keep substantial “good enough” usage, analogously to Hive and SparkSQL.
- Cool new Spark Streaming technology is coming out.
But I’m also hearing rumbles and grumbles about Spark Streaming. What’s more, we know that Spark Streaming wasn’t a core part of Spark’s design; the use case just happened to emerge. Demanding streaming use cases typically involve a lot of short-request inserts (or updates/upserts/whatever). And if you were designing a system to handle those … would it really be based on Spark?
Databricks is not keeping a tight grip on Spark leadership.
- Databricks’ main business, as noted above, is its cloud service. That seems to be going well.
- Databricks’ secondary business is licensing stuff to Spark distributors. That doesn’t seem to amount to much; it’s too easy to go straight to the Apache distribution and bypass Databricks. No worries; this never seemed it would be a big revenue opportunity for Databricks.
At the moment, Databricks is pretty clearly the general leader of Spark. Indeed:
- If you want the story on where Spark is going, you do what I did — you ask Databricks.
- Similarly, if you’re thinking of pushing the boundaries on Spark use, and you have access to the Databricks folks, that’s who you’ll probably talk to.
- Databricks employs ~1/3 of Spark committers.
- Databricks organizes the Spark Summit.
But overall, Databricks doesn’t seem to care much about keeping Spark leadership. Its marketing efforts in that respect are minimal. Word-of-mouth buzz paints a similar picture. My direct relationship with the company gives the same impression. Oh, I’m sure Databricks would like to remain the Spark leader. But it doesn’t seem to devote much energy toward keeping the role.
Starting with my introduction to Spark, previous overview posts include those in:
I learned some newish terms on my recent trip. They’re meant to solve the problem that “data scientists” used to be folks with remarkably broad skill sets, few of whom actually existed in ideal form. So instead now it is increasingly said that:
- “Data engineers” can code, run clusters, and so on, in support of what’s always been called “data science”. Their knowledge of the math of machine learning/predictive modeling and so on may, however, be limited.
- “Data scientists” can write and run scripts on single nodes; anything more on the engineering side might strain them. But they have no-apologies skills in the areas of modeling/machine learning.
- I raised concerns about the “data science” term 4 years ago.
In this video blog, Fishbowl Solutions’ Technical Project Manager, Justin Ames, and Marketing Team Lead, Jason Lamon, discuss Fishbowl’s Agile (like) approach to managing Oracle WebCenter portal projects. Justin shares an overview of what Agile and Scrum mean, how it is applied to portal development, and the customer benefits of applying Agile to an overall portal project.
“This is my first large project being managed with an Agile-like approach, and it has made a believer out of me. The Sprints and Scrum meetings led by the Fishbowl Solutions team enable us to focus on producing working portal features that can be quickly validated. And because it is an iterative build process, we can quickly make changes. This has lead to the desired functionality we are looking for within our new employee portal based on Oracle WebCenter.”
Staff VP, Compensation and HRIS
Large Health Insurance Provider
The post Fishbowl’s Agile (like) Approach to Oracle WebCenter Portal Projects appeared first on Fishbowl Solutions' C4 Blog.
Vendor lock-in is an important subject. Everybody knows that. But few of us realize just how complicated the subject is, nor how riddled it is with paradoxes. Truth be told, I wasn’t fully aware either. But when I set out to write this post, I found that it just kept growing longer.
1. The most basic form of lock-in is:
- You do application development for a target set of platform technologies.
- Your applications can’t run without those platforms underneath.
- Hence, you’re locked into those platforms.
2. Enterprise vendor standardization is closely associated with lock-in. The basic idea is that you have a mandate or strong bias toward having different apps run over the same platforms, because:
- That simplifies your environment, requiring less integration and interoperability.
- That simplifies your staffing; the same skill sets apply to multiple needs and projects.
- That simplifies your vendor support relationships; there’s “one throat to choke”.
- That simplifies your price negotiation.
3. That last point is double-edged; you have more power over suppliers to whom you give more business, but they also have more power over you. The upshot is often an ELA (Enterprise License Agreement), which commonly works:
- For a fixed period of time, the enterprise may use as much of a given product set as they want, with costs fixed in advance.
- A few years later, the price is negotiated, based on current levels of usage.
Thus, doing an additional project using ELAed products may appear low-cost.
- Incremental license and maintenance fees may be zero in the short-term.
- Incremental personnel costs may be controlled because the needed skills are already in-house.
Often those appearances are substantially correct. That’s a big reason why incumbent software is difficult to supplant unless the upstart substitute is superior in fundamental and important ways.
4. Subscriptions are closely associated with lock-in.
- Most obviously, the traditional software industry gets its profits from high-margin support/maintenance services.
- Cloud lock-in has rapidly become a big deal.
- The open source vendors meeting lock-in resistance, noted above, have subscription business models.
Much of why customers care about lock-in is the subscription costs it’s likely to commit them to.
5. Also related to lock-in are thick single-vendor technology stacks. If you run Oracle applications, you’re going to run the Oracle DBMS too. And if you run that, you’re likely to run other Oracle software, and perhaps use Exadata hardware as well. The cloud ==> lock-in truism is an example of this point as well.
6. There’s a lot of truth to the generality that central IT cares about overall technology architecture, while line-of-business departments just want to get the job done. This causes departments to both:
- Oppose standardization.
- Like thick technology stacks.
Thus, departmental influence on IT both encourages and discourages lock-in.
7. IBM is all about lock-in. IBM’s support for Linux, Eclipse and so on don’t really contradict that. IBM’s business models is to squeeze serve its still-large number of strongly loyal customers as well as it can.
8. Microsoft’s business model over the decades has also greatly depended on lock-in.
- Indeed, it exploited Windows/Office lock-in so vigorously as to incur substantial anti-trust difficulties.
- Server-side Windows tends to be involved in thick stacks — DBMS, middleware, business intelligence, SharePoint and more. Many customers (smaller enterprises or in some cases departments) are firmly locked into these stacks.
- Microsoft is making a strong cloud push with Azure, which inherently involves lock-in.
Yet sometimes, Microsoft is more free and open.
- Office for Macintosh allowed the Mac to be a viable Windows competitor. (And Microsoft was well-paid for that, generating comparable revenue for per Mac to what it got for each Windows PC.)
- Visual Studio is useful for writing apps to run against multiple DBMS.
- Just recently, Microsoft SQL Server was ported to Linux.
9. SAP applications run over several different DBMS, including its own cheap MaxDB. That counteracts potential DBMS lock-in. But some of its newer apps are HANA-specific. That, of course, has the opposite effect.
10. And with that as background, we can finally get what led me to finally write this post. Multiple clients have complaints that may be paraphrased as:
- Customers are locked into expensive traditional DBMS such as Oracle.
- Yet they’re so afraid of lock-in now that they don’t want to pay for our vendor-supplied versions of open source database technologies; they prefer to roll their own.
- Further confusing matters, they also are happy to use cloud technologies, including the associated database technologies (e.g. . Redshift or other Amazon offerings), creating whole new stacks of lock-in.
So open source vendors of NoSQL data managers and similar technologies felt like they were the only kind of vendor suffering from fear of lock-in.
I agree with them that enterprises who feel this way are getting it wrong. Indeed:
- The management of even NoSQL DBMS is a big issue, and help in that area has high cash value for customers.
- Serious users need support.
- Support and management tools happen to be synergistic with each other.
This is the value proposition that propelled Cloudera. It’s also a strong reason to give money to whichever MongoDB, DataStax, Neo Technology et al. sponsors open source technology that you use.
General disclosure: My fingerprints have been on this industry strategy since before the term “NoSQL” was coined. It’s been an aspect of many different consulting relationships.
Some enterprises push back, logically or emotionally as the case may be, by observing that the best internet companies — e.g., Facebook — are allergic to paying for software, even open source. My refutations of that argument include:
- Facebook has more and better engineers than you do.
- Facebook has a lot more servers than you do, and would presumably face much higher prices than you would if you each chose to forgo the in-house alternative.
- Facebook pays for open source software in a different way than through subscription fees — it invents and enhances it. Multiple important projects have originated at Facebook, and it contributes to many others. Are you in a position to do the same thing?
And finally — most of Facebook’s users get its service for free. (Advertisers are the ones who pay cash; all others just pay in attention to the ads.) So if getting its software for free actually does screw up its SLAs (Service Level Agreements) — well, free generally comes with poorer SLAs than paid. But if you’re in the business of serving paying customers, then you might want to have paying-customer kinds of SLAs, even on the parts of your technology — e.g. websites urging people to do business with you — that you provide for free yourself.
- I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
- The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
- I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
- I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.
A running list of recent posts is:
- As a companion to this post, I’m publishing a very long one on vendor lock-in.
Subjects I’d like to add to that list include:
- Spark (it’s prospering).
- Databricks (ditto, appearances to the contrary notwithstanding).
- Flink (it’s interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need).
- DataStax, MemSQL, Zoomdata, and Neo Technology (also prospering).
- Cloudera (multiple topics, as usual).
- Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
- Enterprises’ inconsistent views about vendor lock-in.
- Microsoft’s reinvention (it feels real).
- Metadata (it’s ever more of a thing).
- Machine learning (it’s going to be a big portion of my research going forward).
- Transitions to the cloud — this subject affects almost everything else.
I’ll edit these lists as appropriate when further posts go up.
Let’s cover some other subjects right here.
1. While Kafka is widely agreed to be the universal delivery mechanism for streams, the landscape for companion technologies is confused.
- Back in January I wrote that the leaders were mainly Spark Streaming, followed by Storm.
- I overlooked the fact that Storm creator Twitter was replacing Storm with something called Heron.*
- If there’s any buzz about Confluent’s replacement for distant-third-place contender Samza, I missed it.
- Opinions about Spark Streaming are mixed. Some folks want to get away from it; others like it just fine.
And of course Flink is hoping to blow everybody else in the space away.
*But that kind of thing is not necessarily a death knell. Cassandra inventor Facebook soon replaced Cassandra with HBase, yet Cassandra is doing just fine.
As for the “lambda architecture” — that has always felt like a kludge, and various outfits are trying to obsolete it in various ways. As just one example, Cloudera described that to me during my visit as one of the main points of Kudu.
2. The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.
- Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
- Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.
I had a moment of clarity on this point while visiting my clients at DataStax, and discussing their goal — shared by numerous companies — of being properly appreciated for the management tools they provide. In the room with me were CEO Billy Bosworth and chief evangelist Patrick McFadin — both of whom are former DBAs themselves.
3. I visited ClearStory, and Sharmila Mulligan showed me her actual sales database, as well as telling me some things about funding. The details are all confidential, but ClearStory is clearly doing better than rumor might suggest.
4. Platfora insisted on meeting circumstances in which it was inconvenient for me to take notes. So I have no details to share. But they sounded happy.
5. Pneubotics — with a cool new video on its home page — has found its first excellent product/market fit. Traditional heavy metallic robots are great at painting and related tasks when they can remain stationary, or move on rigid metal rails. Neither of those options works well, however, for large curved or irregular surfaces as might be found in the aerospace industry. Customer success for the leading soft robot company has ensued.
This all seems pretty close to the inspection/maintenance/repair area that I previously suggested could be a good soft robotics fit.
What is RequireJS and why is it important?
Require was born out of the need to reduce this code complexity. As such, it improves the speed and quality of our code. At its heart, RequireJS was designed to encourage and support modular development.
What is modular development?
In this module, we call define with an array of the dependencies needed. The dependencies are passed into the factory function as arguments. Importantly, the function is only executed once the required dependencies are loaded.
What does Require look like in Oracle JET
In an Oracle JET application, RequireJS is set up in the main.js (aka “bootstrap”) file. First we need to configure the paths to the various scripts/libraries needed for the app. Here is an example of the RequireJS configuration in the main.js file of the Oracle JET QuickStart template. It establishes the names and paths to all of the various libraries necessary to run the application:
Next we have the top-level “require” call which “starts”our application. It follows the AMD API method of encapsulating the module with the require, and passing in dependencies as an array of string values, then executing the callback function once the dependencies have loaded.
Here we are requiring any scripts and modules needed to load the application, and subsequently calling the function that creates the initial view. Any other code which is used in the initial view of the application is also written here (routing, for example). Note, we only pass in the dependencies that we need to load the initial application, saving valuable resources.
Using RequireJS in other modules/viewModels
As I observed yet again last week, much of analytics is concerned with anomaly detection, analysis and response. I don’t think anybody understands the full consequences of that fact,* but let’s start with some basics.
An anomaly, for our purposes, is a data point or more likely a data aggregate that is notably different from the trend or norm. If I may oversimplify, there are three kinds of anomalies:
- Important signals. Something is going on, and it matters. Somebody — or perhaps just an automated system — needs to know about it. Time may be of the essence.
- Unimportant signals. Something is going on, but so what?
- Pure noise. Even a fair coin flip can have long streaks of coming up “heads”.
Two major considerations are:
- Whether the recipient of a signal can do something valuable with the information.
- How “costly” it is for the recipient to receive an unimportant signal or other false positive.
What I mean by the latter point is:
- Something that sets a cell phone buzzing had better be important, to the phone’s owner personally.
- But it may be OK if something unimportant changes one small part of a busy screen display.
Anyhow, the Holy Grail* of anomaly management is a system that sends the right alerts to the right people, and never sends them wrong ones. And the quest seems about as hard as that for the Holy Grail, although this one uses more venture capital and fewer horses.
*The Holy Grail, in legend, was found by 1-3 knights: Sir Galahad (in most stories), Sir Percival (in many), and Sir Bors (in some). Leading vendors right now are perhaps around the level of Sir Kay.
Difficulties in anomaly management technology include:
- Performance is a major challenge. Ideally, you’re running statistical tests on all data — at least on all fresh data — at all times.
- User experiences are held to high standards.
- False negatives are very bad.
- False positives can be very annoying.
- Robust role-based alert selection is often needed.
- So are robust visualization and drilldown.
- Data quality problems can look like anomalies. In some cases, bad data screws up anomaly detection, by causing false positives. In others, it’s just another kind of anomaly to detect.
- Anomalies are inherently surprising. We don’t know in advance what they’ll be.
Consequences of the last point include:
- It’s hard to tune performance when one doesn’t know exactly how the system will be used.
- It’s hard to set up role-based alerting if one doesn’t know exactly what kinds of alerts there will be.
- It’s hard to choose models for the machine learning part of the system.
Donald Rumsfeld’s distinction between “known unknowns” and “unknown unknowns” is relevant here, although it feels wrong to mention Rumsfeld and Sir Galahad in the same post.
And so a reasonable summary of my views might be:
Anomaly management is an important and difficult problem. So far, vendors have done a questionable job of solving it.
But there’s a lot of activity, which I look forward to writing about in considerable detail.
- The most directly relevant companies I’ve written about are probably Rocana and Splunk.
Five years ago, in a taxonomy of analytic business benefits, I wrote:
A large fraction of all analytic efforts ultimately serve one or more of three purposes:
- Problem and anomaly detection and diagnosis
- Planning and optimization
That continues to be true today. Now let’s add a bit of spin.
1. A large fraction of analytics is adversarial. In particular:
- Many of the analytics companies I talk with tell me that they have important use cases in security, anti-fraud or both.
- Click fraud steals a large fraction of the revenue in online advertising and other promotion. Combating it is a major application need.
- Spam is another huge, ongoing fight.
- There’s an adversarial aspect to algorithmic trading. You’re trying to beat other investors. What’s more, they’re trying to identify your trading activity, so you’re trying to obscure it. Etc.
- Unfortunately, unfree countries can deploy analytics to identify attempts to evade censorship. I plan to post much more on that point soon.
- Similarly, de-anonymization can be adversarial.
- Analytics supporting national security often have an adversarial aspect.
- Banks deploy analytics to combat money-laundering.
Adversarial analytics are inherently difficult, because your adversary actively wants you to get the wrong answer. Approaches to overcome the difficulties include:
- Deploying lots of data. Email spam was only defeated by large providers who processed lots of email and hence could see when substantially the same email was sent to many victims at once. (By the way, that’s why “spear-phishing” still works. Malicious email sent to only one or a few victims still can’t be stopped.)
- Using unusual analytic approaches. For example, graph analytics are used heavily in adversarial situations, even though they have lighter adoption otherwise.
- Using many analytic tests. For example, Google famously has 100s (at least) of sub-algorithms contributing to its search rankings. The idea here is that even the cleverest adversary might find it hard to perfectly simulate innocent behavior.
2. I was long a skeptic of “real-time” analytics, although I always made exceptions for a few use cases. (Indeed, I actually used a form of real-time business intelligence when I entered the private sector in 1981, namely stock quote machines.) Recently, however, the stuff has gotten more-or-less real. And so, in a post focused on data models, I highlighted some use cases, including:
- It is increasingly common for predictive decisions to be made at [real-timeish] speeds. (That’s what recommenders and personalizers do.) Ideally, such decisions can be based on fresh and historical data alike.
- The long-standing desire for business intelligence to operate on super-fresh data is, increasingly, making sense, as we get ever more stuff to monitor. However …
- … most such analysis should look at historical data as well.
- Streaming technology is supplying ever more fresh data.
Let’s now tie those comments into the analytic use case trichotomy above. From the standpoint of mainstream (or early-life/future-mainstream) analytic technologies, I think much of the low-latency action is in two areas:
- Monitoring and troubleshooting networked equipment. This is generally an exercise in anomaly detection and interpretation.
- At sufficiently large online companies, there’s a role for low-latency marketing decision support.
- Low-latency marketing-oriented BI can also help highlight system malfunctions.
- Investments/trading has a huge low-latency aspect, but that’s somewhat apart from the analytic mainstream. (And it doesn’t fit well into my trichotomy anyway.)
- Also not in the analytic mainstream are the use cases for low-latency (re)planning and optimization.
My April, 2015 post Which analytic technology problems are important to solve for whom? has a round-up of possibly relevant links.
Taxonomy is a Sleeper. The reasons from A to ZZZs that taxonomy hasn’t been a part of your most important projects—but should be!
I’m a taxonomy practitioner at Fishbowl Solutions who has worked with many companies to implement simple to sophisticated document management systems. I’ve noticed over the years the large number of obstacles that have prevented companies from establishing taxonomy frameworks to support effective document management. I won’t review an exhaustive alphabetic list of obstacles, in fact, there are probably far more than 26, but I’ll highlight the top culprits that have turned even the best, most sophisticated companies away from taxonomy. Don’t fall asleep. Don’t hit snooze. Make sure you don’t miss one of the most important parts of a document management software project–taxonomy. Taxonomy is a necessity to deliver effective document management solutions in Oracle WebCenter Content, SharePoint, or any other enterprise content management solution. You’ll get the most out of the software and your users.
Authority. Who owns taxonomy? Does IT own the taxonomy or a Quality Management Department or all departments own a piece? Determining decision-makers and authority to sign off on taxonomy frameworks can be difficult. After all, taxonomies are best when they are enterprise-wide solutions. Then, users have a familiar context when working with documents for all business purposes. Don’t let challenges with authority prevent you from establishing taxonomy for your project. Plan on establishing a governance team to own the taxonomy practice for the current project and in the future.
Bright. Shiny. Object. Taxonomy is not a bright shiny object. It’s not as fancy as the user interface of the new software. It doesn’t have the “bells and whistles” that hardware and devices have either. So, too often document management projects end up focusing on the software and not the necessary taxonomy that makes that software a rock star. Don’t be blinded. If you want users to have a great experience, work with documents effectively, and generally adopt your new document management software, you must ensure you define a taxonomy. Otherwise, your bright shiny object may easily be replaced by the next one as it loses appeal.
Complicated. I often hear from customers that a business taxonomy is complicated. It can seem insurmountable to sift through existing taxonomy frameworks (or identify new ones), synthesize frameworks, identify new requirements, and really come up with something comprehensive. Regardless, it’s necessary. If a taxonomy effort is complicated, think of how complicated managing and searching documents is for your users. Help your users by including taxonomy in your next project to simplify their experience. It’s the foundation for browsing, searching, contribution, workflows, interface design, and more.
Glamour. Unfortunately, taxonomy is not glamorous. It’s hard, investigative work. It entails identifying stakeholders; meeting with stakeholders to really understand documentation, process, and users; generating consensus; and documenting, documenting, documenting. On top of that, it’s invisible. Users often don’t even notice taxonomies, especially if they’re good. But if a taxonomy is non-existent or poorly designed, your users will notice the taxonomy for all the wrong reasons—unintuitive naming, missing categories, illogical hierarchies, and more. Even though taxonomy is not glamorous, it demands an investment to ensure your project is successful, at launch and thereafter.
Time. It’s common to hear in projects that there is just not enough time. Customers may say “We need to complete X with the project by date Y.” Or, “The management team really needs to see something.” Frequently, the most important milestones for projects are software-related, causing taxonomy to lose focus. The good thing about taxonomy is that projects can work concurrently on the software build out as they work on taxonomy frameworks. You can do both and do them well. Resist the urge to scope out taxonomy in your next project and consider creative ways to plan in taxonomy.
What? Yes, taxonomy has been around for a long time, but still often in projects I see that it’s just something that people are not aware of. It’s existed for years in the biological and library sciences fields and has had application in IT and many other fields, but often it is just not understood for document management projects. If you’re not familiar with taxonomy, see my previous blog post “Taxonomy isn’t just for frogs anymore.” and consider hiring a reputable company that can guide you through the practice for your next project.
ZZZs. It’s often perceived as a boring practice with tasks that are in the weeds, but some of us do love it. Actually, we even find it rewarding to solve the puzzle of the perfect categorization that works for the project and the customer. If you’re new to taxonomy, you may find that you like it too. If not, find a resource for your project who has a passion for taxonomy because a good taxonomy is so important to successful document management projects.
It’s time to have your eyes wide open. If you’re considering a document management software or improvement project, consider how important the underlying taxonomy is for your project and plan taxonomy analysis and development as a required effort. Your users will appreciate it and your business will see increased software utilization. Remember the old adage, “Technology cannot solve your business problems?” It can’t. But technology + taxonomy can.
This blog is one in a series discussing taxonomy topics. Watch for the next blog coming soon.
Carrie McCollor is a Business Solutions Architect at Fishbowl Solutions. Fishbowl Solutions was founded in 1999. Their areas of expertise include Oracle WebCenter, PTC’s Product Development System (PDS), and enterprise search solutions using the Google Search Appliance. Check out our site to learn more about what we do.
One of the most important issues in privacy and surveillance is also one of the least-discussed — the use of new surveillance technologies in ordinary law enforcement. Reasons for this neglect surely include:
- Governments, including in the US, lie about this subject a lot. Indeed, most of the reporting we do have is exposure of the lies.
- There’s no obvious technology industry ox being gored. What I wrote in another post about Apple, Microsoft et al. upholding their customers’ rights doesn’t have a close analogue here.
One major thread in the United States is:
- The NSA (National Security Agency) collects information on US citizens. It turns a bunch of this over to the “Special Operations Division” (SOD) of the Drug Enforcement Administration (NSA).
- The SOD has also long collected its own clandestine intelligence.
- The SOD turns over information to the DEA, FBI (Federal Bureau of Investigation), IRS (Internal Revenue Service) and perhaps also other law enforcement agencies.
- The SOD mandates that the recipient agencies lie about the source of the information, even in trials and court filings. This is called “parallel construction”, in that the nature of the lie is to create another supposed source for the original information, which has the dual virtues of:
- Making it look like the information was obtained by allowable means.
- Protecting confidentiality of the information’s true source.
- There is a new initiative to allow the NSA to share more surveillance information on US citizens with other agencies openly, thus reducing the “need” to lie, and hopefully gaining efficiency/effectiveness in information-sharing as well.
Similarly, StingRay devices that intercept cell phone calls (and thus potentially degrade service) are used by local police departments, who then engage in “parallel construction” for several reasons, one simply being an NDA with manufacturer Harris Corporation.
Links about these and other surveillance practices are below.
At this point we should note the distinction between intelligence/leads and admissible evidence.
- Intelligence (or leads) is any information that can be used to point law enforcement or security forces at people who either plan to do or already have done unlawful and/or very harmful things.
- Admissible evidence is information that can legally be used to convict people of crimes or otherwise bring down penalties and sanctions upon then.
I won’t get into the minutiae of warrants, subpoenas, probable cause and all that, but let’s just say:
- In theory there’s a semi-bright line between intelligence and admissible evidence; i.e., there’s some blurring, but in most cases the line can be pretty easily seen.
- In practice there’s a lot of blurring. Parallel construction is only one of the ways the semi-bright line gets scuffed over.
- Even so, this distinction has great value. The number of people who have been badly harmed in the US by inappropriate use of inadmissible intelligence isn’t very high …
- … yet.
“Yet” is the key word. My core message in this post is that — despite the lack of catastrophe to date — the blurring of the intelligence/evidence line needs to be greatly reversed:
Going forward, the line between intelligence and admissible evidence needs to be established and maintained in a super-bright state.
As you may recall, I’ve said that for years, in a variety of different phrasings. Still, it’s a big enough deal that I feel I should pound the table about it from time to time — especially now, when public policy in other aspects of surveillance is going pretty well, but this area is headed for disaster. My argument for this view can be summarized in two bullet points:
- Massive surveillance is inevitable.
- Unless the uses of the resulting information are VERY limited, freedoms will be chilled into oblivion.
I recapitulate the chilling effects argument frequently, so for the rest of this post let’s focus on the first bullet point. Massive surveillance will be a fact of life for reasons including:
- As a practical political matter, domestic surveillance will be used at least for anti-terrorism. If you doubt that — please just consider the number of people who support Donald Trump.
- Actually, the constituency for anti-terrorism surveillance is much more than just the paranoid idiots. Indeed — and notwithstanding the great excesses of anti-terrorism propaganda around the world — that constituency includes me. My reasons start:
- In a country of well over 300 million people, there probably are a few who are both crazy and smart enough to launch Really Bad Attacks. Stopping them before they act is a Very Good Idea.
- The alternative is security — or more likely security theater — measures that are intrusive across the board. I like unfettered freedom of movement, for example. But I can barely stand the TSA (Transportation Security Administration).
- Commercial “surveillance” is intense. And it’s essential to the internet economy.
And so I return to the point I’ve been making for years: Surveillance WILL happen. So the use of surveillance information needs to be tightly limited.
- Reason’s recent rant about parallel construction contains a huge number of links. Ditto a calmer Rodney Balko blog for the Washington Post. (March, 2016).
- Reuters gave details of the SOD’s thou-shalt-lie mandates in August, 2013.
- If you have a clearance and work in the civilian sector, you may be subject to 24/7 surveillance, aka continuous evaluation, for fear that you might be the next Ed Snowden. (March, 2016)
- License plate scanning databases are already a big deal in law enforcement. (October, 2015)
- StingRay-type devices are powerful, and have been for quite a few years. They’re really powerful. Procedures related to StingRay surveillance are in flux. (2015)
- Chilling effects are real. (April, 2016)
- At least one federal court has decided that tracking URLs visited without a warrant is an illegal wiretap. Other courts think your URL visits, shopping history, etc. are fair game. (November, 2015)
- Pakistan in effect bugged citizens’ cell phones to track their movements and force polio vaccines on them. (November, 2015)
- This is not totally on-topic, but it does support worries about what the government can do with surveillance-based analytics — law enforcement can wildly exaggerate the significance of its “scientific” evidence, and gain bogus convictions as a result. (2015-2016).
- The Electronic Frontier Foundation offers a dated but fact-filled overview of NSA domestic spying (2012-2013).
Numerous tussles fit the template:
- A government wants access to data contained in one or more devices (mobile/personal or server as the case may be).
- The computer’s manufacturer or operator doesn’t want to provide it, for reasons including:
- That’s what customers prefer.
- That’s what other governments require.
- Being pro-liberty is the right and moral choice. (Yes, right and wrong do sometimes actually come into play. )
As a general rule, what’s best for any kind of company is — pricing and so on aside — whatever is best or most pleasing for their customers or users. This would suggest that it is in tech companies’ best interest to favor privacy, but there are two important quasi-exceptions:
- Recommendation/personalization. E-commerce and related businesses rely heavily on customer analysis and tracking.
- When the customer is the surveiller. Governments pay well for technology that is used to watch over their citizens.
I used the “quasi-” prefix because screwing the public is risky, especially in the long term.
Something that is not even a quasi-exception to the tech industry’s actual or potential pro-privacy bias is governmental mandates to let their users be watched. In many cases, governments compel privacy violations, by threat of severe commercial or criminal penalties. Tech companies should and often do resist these mandates as vigorously as they can, in the courts and/or via lobbying as the case may be. Yes, companies have to comply with the law. However, it’s against their interests for the law to compel privacy violations, because those make their products and services less appealing.
The most visible example of all this right now is the FBI/Apple kerfuffle. To borrow a phrase — it’s complicated. Among other aspects:
- Syed Rizwan Farook, one of the San Bernardino terrorist murderers, had 3 cell phones. He carefully destroyed his 2 personal phones before his attack, but didn’t bother with his iPhone from work.
- Notwithstanding this clue that the surviving phone contained nothing of interest, the FBI wanted to unlock it. It needed technical help to do so.
- The FBI got a court order commanding Apple’s help. Apple refused and appealed the order.
- The FBI eventually hired a third party to unlock Farook’s phone, for a price that was undisclosed but >$1.3 million.
- Nothing of interest was found on the phone.
- Stories popped up of the FBI asking for Apple’s help unlocking numerous other iPhones. The courts backed Apple or not depending on how they interpreted the All Writs Act. The All Writs Act was passed in the first-ever session of the US Congress, in 1789, and can reasonably be assumed to reflect all the knowledge that the Founders possessed about mobile telephony.
- It’s widely assumed that the NSA could have unlocked the phones for the FBI — but it didn’t.
Russell Brandom of The Verge collected links explaining most of the points above.
With that as illustration, let’s go to some vendor examples:
- Apple — which sells devices much more than advertising — has clearly decided that being (seen as) pro-privacy is its preferred course.
- Microsoft — all rumors about Skype backdoors and the like notwithstanding — has made a similar choice. Notably, it is struggling to keep data hosted on its European servers out of US subpoena reach.
- Amazon and Google, by way of contrast, whose core consumer businesses depend on recommendation/personalization, have not been so visible about protecting the privacy of their cloud services’ data.
- Blackberry, meanwhile, seems to split the difference, being pro-privacy in its enterprise server business but acquiescing to surveillance in its consumer operations.
All of these cases seem consistent with my comments about vendors’ privacy interests above.
Bottom line: The technology industry is correct to resist government anti-privacy mandates by all means possible.
This year, privacy and surveillance issues have been all over the news. The most important, in my opinion, deal with the tension among:
- Personal privacy.
- General law enforcement.
More precisely, I’d say that those are the most important in Western democracies. The biggest deal worldwide may be China’s movement towards an ever-more-Orwellian surveillance state.
The main examples on my mind — each covered in a companion post — are:
- The Apple/FBI conflict(s) about locked iPhones.
- The NSA’s propensity to share data with civilian law enforcement.
Legislators’ thinking about these issues, at least in the US, seems to be confused but relatively nonpartisan. Support for these assertions includes:
- The recent unanimous passage in the US House of Representatives of a law restricting police access to email.
- An absurd anti-encryption bill proposed in the US Senate.
- The infrequent mention of privacy/surveillance issues in the current election campaign.
I do think we are in for a spate of law- and rule-making, especially in the US. Bounds on the possible outcomes likely include:
- Governments will retrain broad powers for anti-terrorism If there was any remaining doubt, the ISIS/ISIL/Daesh-inspired threats guarantees that surveillance will be intense.
- Little will happen in the US to clip the wings of internet personalization/recommendation. To a lesser extent, that’s probably true in other developed countries as well.
- Non-English-speaking countries will maintain data sovereignty safeguards, both out of genuine fear of (especially) US snooping and as a pretext to support their local internet/cloud service providers.
As always, I think that the eventual success or failure of surveillance regulation will depend greatly on the extent to which it accounts for chilling effects. The gravity of surveillance’s longer-term dangers is hard to overstate, yet they still seem broadly overlooked. So please allow me to reiterate what I wrote in 2013 — surveillance + analytics can lead to very chilling effects.
When government — or an organization such as your employer, your insurer, etc. — watches you closely, it can be dangerous to deviate from the norm. Even the slightest non-conformity could have serious consequences.
And that would be a horrific outcome.
… direct controls on surveillance … are very weak; government has access to all kinds of information. … And they’re going to stay weak. … Consequently, the indirect controls on surveillance need to be very strong, for they are what stands between us and a grim authoritarian future. In particular:
- Governmental use of private information needs to be carefully circumscribed, including in most aspects of law enforcement.
- Business discrimination based on private information needs in most cases to be proscribed as well.
The politics of all this is hard to predict. But I’ll note that in the US:
- There’s an emerging consensus that the criminal justice system is seriously flawed, on the side of harshness. However …
- … criminal justice reform is typically very slow.
- The libertarian movement (Ron Paul, Rand Paul, aspects of the Tea Party folks, etc.) seems to have lost steam.
- The courts cannot be relied upon to be consistent. Questions about Supreme Court appointments even aside, Fourth Amendment jurisprudence in the US has long been confusing and confused.
- Few legislators understand technology.
Realistically, then, the main plausible path to a good outcome is that the technology industry successfully pushes for one. That’s why I keep writing about this subject in what is otherwise a pretty pure technology blog.
Bottom line: The technology industry needs to drive privacy/ surveillance public policy in directions that protect individual liberties. If it doesn’t, we’re all screwed.
My blogs are having a bad time with comment spam. While Akismet and other safeguards are intercepting almost all of the ~5000 attempted spam comments per day, the small fraction that get through are still a large absolute number to deal with.
There’s some danger I’ll need to restrict comments here to combat it. (At the moment they’ve been turned off almost entirely on Text Technologies, which may be awkward if I want to put a post up there rather than here.) If I do, I’ll say so in a separate post. I apologize in advance for any inconvenience.
Taxonomy can be a nebulous term. It has existed for years, having probably its most common roots in the sciences, but has blossomed to apply its practices to a plethora of other fields. The wide application of taxonomy shows how useful and effective it is, yet its meaning can be unclear due to its diversity. We identify with taxonomy in library sciences with the Dewey Decimal System and we identify with taxonomy in the scientific use when we talk about animals (Kingdom: Animalia; Phylum: Chordata; Class: Amphibia; Clade: Salientia; Order: Anura (frog)). These are familiar uses to us. We learned of them early on in school. We’ve seen them around for years—even if we didn’t identify them as taxonomies. But what is taxonomy when we talk about subjects, like documents and data, that aren’t so tangible? As a Business Solutions Architect at Fishbowl Solutions, I encounter this question quite a bit when working on Oracle WebCenter Content document management projects with customers.
The historical Greek term taxonomy means “arrangement law.” Taxonomy is the practice in which things, in this case documents, are arranged and classified to provide order for users. When it comes to documents, we give this order by identifying field names, field values, and business rules and requirements for tagging documents with these fields. These fields then describe the document so that we can order the document, know more about it, and do more with it.
- Document Type: Policy
- Document Status: Active
- Document Owner: Administrator
- Lifecycle: Approved
- Folder: HR
- Sub-Folder: Employee Policies
- And so on…
Defining taxonomy for documents provides a host of business and user benefits for document management, such as:
- A classification and context for documents. It tells users how a document is classified and where it “fits in” with other documents. It gives the document a name and a place. When a document is named and placed, it enables easier searching and browsing for users to find documents, as well as an understanding of the relationship of one document to another. Users know where it will be and how to get it.
- A simplified experience. When we have order, we reduce clutter and chaos. No more abandoned or lost documents. Everything has a place. This simplifies and improves the user experience and can reduce frustration as well. Another bonus: document management and cleanup is a simple effort. Documents out of order are easy to identify and can be put in place. Documents that are ordered can be easily retrieved, for instance for an archiving process, and managed.
- An arrangement that makes sense for the business. Using taxonomy in a document management system like Oracle’s WebCenter Content allows a company to define its own arrangement for storing and managing documents that resonates with users. Implementing a taxonomy that is familiar to users will make the document management system exponentially more usable and easier to adopt. No more guessing or interpreting arrangement or terminology—users know what to expect, terms are common, they are in their element!
- A scalable framework. Utilizing a defined and maintained taxonomy will allow users to adopt the common taxonomy as they use the document management system, but will also allow for business growth as new scope (documents, processes, capabilities, etc.) is added. Adding in a new department with new documents? Got it. Your scalable taxonomy can be reused or built upon. Using a comprehensive taxonomy that is scalable allows for an enterprise approach to document management where customizations and one-offs are minimized, allowing for a common experience for users across the business.
- A fully-enabled document management system. Lastly, defining a taxonomy will allow for full utilization of your OracleWebCenter Content, or other, document management system. Defining a taxonomy and integrating it with your document management system will enable building out:
- logical folder structures,
- effective browse and search capabilities,
- detailed profiles and filters,
- advanced security,
- sophisticated user interfaces and more.
Clearly, a taxonomy is the solution to providing necessary order and classification to documents. It creates a common arrangement and vocabulary to empower your users, and your document management system, to work the best for you. Now hop to it!
This blog is the first in a series discussing taxonomy topics. Watch for the next blog entitled “Taxonomy is a Sleeper. The reasons from A to ZZZs that taxonomy hasn’t been a part of your most important projects—but should be!”
Carrie McCollor is a Business Solutions Architect at Fishbowl Solutions. Fishbowl Solutions was founded in 1999. Their areas of expertise include Oracle WebCenter, PTC’s Product Development System (PDS), and enterprise search solutions using the Google Search Appliance. Check out our site to learn more about what we do.
The post Taxonomy isn’t just for frogs anymore. What taxonomy means in document management. appeared first on Fishbowl Solutions' C4 Blog.
This post comes from Fishbowl’s president, Tim Gruidl. One of Tim’s biggest passions is technology innovation, and not only does he encourage others to innovate, he participates and helps drive this where he can. Tim likes to say “we innovate to help customers dominate”. Tim summarizes Fishbowl’s Hackathon event, held last Friday and Saturday at Fishbowl Solutions, in the post below.
What a great event to learn, build the team, interact with others and compete. We also created some innovative solutions that I’m sure at some point will be options to help our customers innovate and extend their WebCenter investments. This year, we had 3 teams that designed and coded the following solutions:
- InSight Image Processing – Greg Bollom and Kim Negaard
They leveraged the Google Vision API to enable the submission of images to Oracle WebCenter and then leveraged Google Vision to pull metadata back and populate fields within the system. They also added the ability to pull in GPS coordinates from photos (taken from cameras, etc.) and have that metadata and EXIF data populate WebCenter Content.
- Slack Integation with WebCenter Portal and Content – Andy Weaver, Dan Haugen, Jason Lamon and Jayme Smith
Team collaboration is a key driver for many of our portals, and Slack is one of the most popular collaboration tools. In fact, it is currently valued at $3.6 billion, and there seems to be a rapidly growing market for what they do. The team did some crazy innovation and integration to link Slack to both WebCenter Portal and WebCenter Content. I think the technical learning and sophistication of what they did was probably the most involved and required the most pre-work and effort at the event, and it was so cool to see it actually working.
- Oracle WebCenter Email Notes – John Sim (Oracle ACE) Lauren Beatty and me
Valuable corporate content is stored in email, and more value can be obtained from those emails if the content can be tagged and context added in a content management system – Oracle WebCenter. John and Lauren did an awesome job of taking a forwarded email, checking it into WebCenter Content to a workspace, and using related content to build relationships. You can then view the relationships in a graphical way for context. They also created a mobile app to allow you to tag the content on the go and release it for the value of the org.
Participants voted on the competing solutions, and it ended up being a tie between the Google Insight team and the Email Notes team, but all the solutions truly showed some innovation, sophistication, and completeness of vision. A key aspect of the event for me was how it supported all of Fishbowl’s company values:
Customer First – the solutions we build were based on real-life scenarios our customers have discussed, so this will help us be a better partner for them.
Teamwork – the groups not only worked within their teams, but there was cross team collaboration – Andy Weaver helped John Sim solve an issue he was having, for example.
Intellectual Agility – this goes without saying.
Ambition – people worked late and on the weekend – to learn more, work with the team and have fun.
Continuous Learning – we learned a lot about Slack, cloud, email, etc.
Overall, the annual Hackathon is a unique event that differentiates Fishbowl on so many fronts. From the team building, to the innovation keeping us ahead of the technology curve, to all the learnings – Hackathons truly are a great example of what Fishbowl is all about.
Thanks to all that participated, and remember, let’s continue to innovate so our customers can dominate.
Hackathon weekend at Fishbowl Solutions – Google Vision, Slack, and Email Integrations with Oracle WebCenter
It’s hackathon weekend at Fishbowl Solutions. Fishbowl’s consulting and development teams – the hackers – along with members of the sales and marketing teams join forces to collaborate on and develop new software applications. While the overall goal of the hackathon may be to produce usable software, the event also is a great learning opportunity for participants and results in a lot of fun.
This is Fishbowl’s 4th annual hackathon and previous events have produced “beta” software that eventually evolved into shippable software components that benefited customers. Here are recaps on the 2012 and 2014 events.
This year there were over 16 different ideas, and out of those 3 teams were formed to develop the following:
- Oracle WebCenter Portal and Slack integration – Slack is a popular collaboration tool for the enterprise that enables members to communicate across channels (specific topics), send direct messages, and drag and drop files for sharing. Integrating Slack with WebCenter Portal brings its popular features and ease of use directly in context of a user’s portal session, ensuring that collaboration is easy and reducing the amount of switching between applications to communicate with others – leaving the portal to send an email, for example.
- Oracle WebCenter Content and Google Vision integration – This integration would enable the tagging of images upon check-in. The Google Vision API enables applications to understand the content of images by encapsulating machine learning models in an easy to use REST API. Using this technology, images are auto-classified into thousands of categories (e.g., “sailboat”, “lion”, “Eiffel Tower”). For example, you might check in a picture of a knit hat and it would be tagged with xKeywords of “hat”, “knit hat”, and “fashion accessories” without any human tagging. To further automate image discovery, the GSA can be used to map related terms so that searches for “beanie”, “stocking cap”, or “winter hat”, could also return the image. This tagging automation would have great implications for Oracle WebCenter customers that are using it for Digital Asset Management.
- Oracle WebCenter Content Email Check in - This integration would enable emails with attachments to be checked in to WebCenter Content automatically. Instead of the user having to check in the email itself, and then relating each attachment to the associated email, which results in additional check in steps, the emails and attachments would be parsed out and sent to a user workspaces in WebCenter. From there, users can tag and validate that the email should be checked in with the appropriate attachments – either from their desktops or mobile device.
The hacking commenced at 3 PM today and will continue until 4 PM on Saturday, April 16th. Each team will then present their developed integration/component, and the other Fishbowl team members will vote on their favorite finished product. Check back on this blog next week to see who won.