In which I observe that Tim Cook and the EFF, while thankfully on the right track, haven’t gone nearly far enough.
Traditionally, the term “chilling effect” referred specifically to inhibitions on what in the US are regarded as First Amendment rights — the freedoms of speech, the press, and in some cases public assembly. Similarly, when the term “chilling effect” is used in a surveillance/privacy context, it usually refers to the fear that what you write or post online can later be held against you. This concern has been expressed by, among others, Tim Cook of Apple, Laura Poitras, and the Electronic Frontier Foundation, and several research studies have supported the point.
But that’s only part of the story. As I wrote in July, 2013,
… with the new data collection and analytic technologies, pretty much ANY action could have legal or financial consequences. And so, unless something is done, “big data” privacy-invading technologies can have a chilling effect on almost anything you want to do in life.
The reason, in simplest terms, is that your interests could be held against you. For example, models can estimate your future health, your propensity for risky hobbies, or your likelihood of changing your residence, career, or spouse. Any of these insights could be useful to employers or financial services firms, and not in a way that redounds to your benefit. And if you think enterprises (or governments) would never go that far, please consider an argument from the sequel to my first “chilling effects” post:
What makes these dangers so great is the confluence of two sets of factors:
- Some basic facts of human nature and organizational behavior — policies and procedures are biased against risk of “bad” outcomes, because people (and organizations) fear (being caught) making mistakes.
- Technological developments that make ever more precise judgments as to what constitutes risk, or deviation from “proven-safe” profiles.
A few people have figured at least some of these dangers out. ACLU policy analyst Jay Stanley got there before I did, as did a pair of European Law and Economics researchers. Natasha Lomas of TechCrunch seems to get it. But overall, the chilling effects discussion — although I’m thrilled that it’s gotten even this far — remains much too narrow.
In a tough economy, will the day come that people organize their whole lives to appear as prudent and risk-averse as possible? As extreme as it sounds, that danger should not be overlooked. Plenty of societies have been conformist with much weaker mechanisms for surveillance (i.e., little beyond the eyes and ears of nosy neighbors).
And so I return yet again to my privacy mantra — we need to regulate information use, not just information collection and retention. To quote a third post from that July, 2013 flurry:
- Governmental use of private information needs to be carefully circumscribed, including in most aspects of law enforcement.
- Business discrimination based on private information needs in most cases to be proscribed as well.
As for exactly what those regulations should be — that, of course, is a complex subject in itself.
Occasionally I talk with an astute reporter — there are still a few left — and get led toward angles I hadn’t considered before, or at least hadn’t written up. A blog post may then ensue. This is one such post.
There is a group of questions going around that includes:
- Is Hadoop overhyped?
- Has Hadoop adoption stalled?
- Is Hadoop adoption being delayed by skills shortages?
- What is Hadoop really good for anyway?
- Which adoption curves for previous technologies are the best analogies for Hadoop?
To a first approximation, my responses are:
- The Hadoop hype is generally justified, but …
- … what exactly constitutes “Hadoop” is trickier than one might think, in at least two ways:
- Hadoop is much more than just a few core projects.
- Even the core of Hadoop is repeatedly re-imagined.
- RDBMS are a good analogy for Hadoop.
- As a general rule, Hadoop adoption is happening earlier for new applications, rather than in replacement or rehosting of old ones. That kind of thing is standard for any comparable technology, both because enabling new applications can be valuable and because migration is a pain.
- Data transformation, as pre-processing for analytic RDBMS use, is an exception to that general rule. That said …
- … it’s been adopted quickly because it saves costs. But of course a business that’s only about cost savings may not generate a lot of revenue.
- Dumping data into a Hadoop-centric “data lake” is a smart decision, even if you haven’t figured out yet what to do with it. But of course, …
- … even if zero-application adoption makes sense, it isn’t exactly a high-value proposition.
- I’m generally a skeptic about market numbers. Specific to Hadoop, I note that:
- The most reliable numbers about Hadoop adoption come from Hortonworks, since it is the only pure-play public company in the market. (Compare, for example, the negligible amounts of information put out by MapR.) But Hortonworks’ experiences are not necessarily identical to those of other vendors, who may compete more on the basis of value-added service and technology rather than on open source purity or price.
- Hadoop (and the same is true of NoSQL) are most widely adopted at digital companies rather than at traditional enterprises.
- That said, while all traditional enterprises have some kind of digital presence, not all have ones of the scope that would mandate a heavy investment in internet technologies. Large consumer-oriented companies probably do, but companies with more limited customer bases might not be there yet.
- Concerns about skill shortages are exaggerated.
- The point of distributing processing frameworks such as Spark or MapReduce is to make distributed analytic or application programming not be much harder than any other kind.
- If a new programming language or framework needs to be adopted — well, programmers nowadays love learning that kind of stuff.
- The industry is moving quickly to make distributed systems easier to administer. Any skill shortages in operations should prove quite temporary.
Did your organization recently purchase Oracle WebCenter Content? Are you the new admin? Consider these 4 tools to successfully manage and administer the system
Congratulations! Your organization has made an investment in a leading enterprise content management and portal system, and even better, you get to manage and administer the system. Lucky you, right? As long as users can access the system, find what they need, and receive important, system generated notifications that are relevant to them they will generally be happy and leave you alone, right?
Unfortunately, a complete state of end-user bliss doesn’t exist – if it did, there might not be a need for system administrators. The reason for this is there will be many different personas that will access and use the system, and each will have their own set of likes, dislikes and complaints. For example, some users won’t like the interface (most popular complaint). Others will complain about the (perceived) lack of features and functionality. Regardless, as the system administrator you will not be able to satisfy all user requests, but with Fishbowl’s Administration Suite you will be able to ensure WebCenter users get the most out of inherent features and system functionality.
Administration (Admin) Suite brings together Fishbowl’s most popular components for WebCenter automation. The reason for their popularity is they truly help administrators be more efficient when performing the most common and repetitive tasks within Oracle WebCenter Content, as well as they provide additional functionality that provides organizations with more business value. These include rule-based security mappings to provide users with the right level of access (read, read/write, read/write/delete, etc.), enabling custom email notifications to be sent for content subscriptions, scheduling and off-loading the process of bulk loading content in WebCenter, and the addition of several workflow features to aid in workflow creation, review and auditing. Let’s take a closer look of each of these components.
Advanced User Security Mapping (AUSM)
- AUSM provides rule-based configuration to integrate external user sources (LDAP or Active Directory) with Oracle WebCenter. Rules can be created to assign aliases to users based on their directory information, and this information can be directly imported into WebCenter. AUSM also provides reporting capabilities to quickly audit user access and troubleshoot permission issues.
- Business Problems it solves:
- Decreases the time it takes for administrators to integrate an enterprise security model with Oracle WebCenter. No more having to create multiple (sometimes hundreds…) of mappings between LDAP groups and roles in WebCenter
- Enables administrators to quickly troubleshoot user access issues
- Accelerates new user access to content in the system by not having to wait until users log in
- Subscription Notifier provides an intuitive interface for administrators to create subscriptions for tracking and managing business content. This is done through queries that can be scheduled to run at various intervals. For example, you can create queries to notify a contract manager at 90, 60 and 30 days before a vendor contract expires. You can also create queries that notify content owners that content is X days old and should be reviewed.
- Business problem it solves:
- Ensures internal and external stakeholders have visibility, and have enough time to respond, to when high-value content is set to expire (contracts, etc.)
- Helps avoid duplication of effort by alerting teams when new content is available – sales teams get notified when new marketing content is checked in, for example
- Provides owners of web content with triggers to update, create new or delete – this can help keep content on the site fresh and new which is important for SEO
Enterprise Batch Loader
- Enterprise Batch Loader provides a robust, standalone component for WebCenter administrators to quickly and efficiently load content into the system. This content can come from ERP, CRM, CAD and other business systems as Enterprise Batch Loader can be configured to “watch” folders where such data is output and then create a batch load file for loading into WebCenter. Metadata from these systems can also be mapped to fields existing in WebCenter.
- Business problems it solves:
- Helps organizations reduce content repositories and file shares by automating the process of checking content into Oracle WebCenter
- Ensures high-value data from ERP, CRM, CAD and other business systems gets loaded into WebCenter, providing a single location for users to access and find information
Workflow Solution Set
- Packed with 9 powerful features, Workflow Solution Set complements the out-of-the-box Oracle WebCenter workflows through detailed auditing, the ability to search for content in workflow, and the ability to customize email notifications and the workflow review pane. Workflow Solution Set makes it easier for users to interact with and fully leverage WebCenter Content workflows.
- Business problem it solves:
- Helps remove confusion from the workflow process by enabling explicit instructions to be included in the workflow review pane or email notifications
- Ensures the history of a workflow is fully retained – including rejection comments
- Provides visibility into content that is in a workflow, which natively in WebCenter items in a workflow are not included as part of search results
- Improves performance of review process by separating out workflow items into pages instead of one long list
I will be covering more of the capabilities of these components that make up Fishbowl’s Administration Suite during a one-hour webinar on Thursday, June 11th. Come hear more about why, together, the components of Fishbowl’s Admin Suite provide the perfect tools for WebCenter admins.
You can register for the event here. I hope you will be able to join us.
The post Did your organization recently purchase Oracle WebCenter Content? Are you the new admin? Consider these 4 tools to successfully manage and administer the system appeared first on Fishbowl Solutions' C4 Blog.
At the highest level:
- Presto is, roughly speaking, Facebook’s replacement for Hive, at least for queries that are supposed to run at interactive speeds.
- Teradata is announcing support for Presto with a classic open source pricing model.
- Presto will also become, roughly speaking, Teradata’s replacement for Hive.
- Teradata’s Presto efforts are being conducted by the former Hadapt.
Now let’s make that all a little more precise.
Regarding Presto (and I got most of this from Teradata)::
- To a first approximation, Presto is just another way to write SQL queries against HDFS (Hadoop Distributed File System). However …
- … Presto queries other data stores too, such as various kinds of RDBMS, and federates query results.
- Facebook at various points in time created both Hive and now Presto.
- Facebook started the Presto project in 2012 and now has 10 engineers on it.
- Teradata has named 16 engineers – all from Hadapt – who will be contributing to Presto.
- Known serious users of Presto include Facebook, Netflix, Groupon and Airbnb. Airbnb likes Presto well enough to have 1/3 of its employees using it, via an Airbnb-developed tool called Airpal.
- Facebook is known to have a cluster cited at 300 petabytes and 4000 users where Presto is presumed to be a principal part of the workload.
Daniel Abadi said that Presto satisfies what he sees as some core architectural requirements for a modern parallel analytic RDBMS project:
- Data is pipelined between operators, with no gratuitous writing to disk the way you might have in something MapReduce-based. This is different from the sense of “pipelining” in which one query might keep an intermediate result set hanging around because another query is known to need those results as well.
- Presto processing is vectorized; functions don’t need to be re-invoked a tuple at a time. This is different from the sense of vectorization in which several tuples are processed at once, exploiting SIMD (Single Instruction Multiple Data). Dan thinks SIMD is useful mainly for column stores, and Presto tries to be storage-architecture-agnostic.
- Presto query operators and hence query plans are dynamically compiled, down to byte code.
- Although it is generally written in Java, Presto uses direct memory management rather than relying on what Java provides. Dan believes that, despite being written in Java, Presto performs as if it were written in C.
More precisely, this is a checklist for interactive-speed parallel SQL. There are some query jobs long enough that Dan thinks you need the fault-tolerance obtained from writing intermediate results to disk, ala’ HadoopDB (which was of course the MapReduce-based predecessor to Hadapt).
That said, Presto is a newish database technology effort, there’s lots of stuff missing from it, and there still will be lots of stuff missing from Presto years from now. Teradata has announced contribution plans to Presto for, give or take, the next year, in three phases:
- Phase 1 (released immediately, and hence in particular already done):
- An installer.
- More documentation, especially around installation.
- Command-line monitoring and management.
- Phase 2 (later in 2015)
- Integrations with YARN, Ambari and soon thereafter Cloudera Manager.
- Expanded SQL coverage.
- Phase 3 (some time in 2016)
- An ODBC driver, which is of course essential for business intelligence tool connectivity.
- Other connectors (e.g. more targets for query federation).
- Further SQL coverage.
Absent from any specific plans that were disclosed to me was anything about optimization or other performance hacks, and anything about workload management beyond what can be gotten from YARN. I also suspect that much SQL coverage will still be lacking after Phase 3.
Teradata’s basic business model for Presto is:
- Teradata is selling subscriptions, for which the principal benefit is support.
- Teradata reserves the right to make some of its Presto-enhancing code subscription-only, but has no immediate plans to do so.
- Teradata being Teradata, it would love to sell you Presto-related professional services. But you’re absolutely welcome to consume Presto on the basis of license-plus-routine-support-only.
And of course Presto is usurping Hive’s role wherever that makes sense in Teradata’s data connectivity story, e.g. Teradata QueryGrid.
Finally, since I was on the phone with Justin Borgman and Dan Abadi, discussing a project that involved 16 former Hadapt engineers, I asked about Hadapt’s status. That may be summarized as:
- There are currently no new Hadapt sales.
- Only a few large Hadapt customers are still being supported by Teradata.
- The former Hadapt folks would love Hadapt or Hadapt-like technology to be integrated with Presto, but no such plans have been finalized at this time.
It’s difficult to project the rate of IT change in health care, because:
- Health care is suffused with technology — IT, medical device and biotech alike — and hence has the potential for rapid change. However, it is also the case that …
- … health care is heavily bureaucratic, political and regulated.
Timing aside, it is clear that health care change will be drastic. The IT part of that starts with vastly comprehensive electronic health records, which will be accessible (in part or whole as the case may be) by patients, care givers, care payers and researchers alike. I expect elements of such records to include:
- The human-generated part of what’s in ordinary paper health records today, but across a patient’s entire lifetime. This of course includes notes created by doctors and other care-givers.
- Large amounts of machine-generated data, including:
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
- Most tests exploit electronic technology. Progress in electronics is intense.
- Biomedical research is itself intense.
- In particular, most research technologies (for example gene sequencing) can be made cheap enough over time to be affordable clinically.
- The output of consumer health-monitoring devices — e.g. Fitbit and its successors. The buzzword here is “quantified self”, but what it boils down to is that every moment of our lives will be measured and recorded.
- The results of clinical tests. Continued innovation can be expected in testing, for reasons that include:
These vastly greater amounts of data cited above will allow for greatly changed analytics.
- Right now, medical decisions are made based on research that looks at a few data points each for a specially-recruited sample of patients, then draws conclusions based on simplistic and questionable statistical methods.
- More sophisticated analytic methods are commonly used, but almost always still to aid in the discovery and formation of hypotheses that will then be validated, if at all, using the bad old analytic techniques.
- State of the art predictive modeling, applied to vastly more data, will surely yield greatly better results.
And so I believe that health care itself will be revolutionized.
- Diagnosis will be much more accurate, pretty much across the board, except in those limited areas where it’s already excellent today.
- Medication regimens will be much more personalized. (Pharma manufacturing may have to change greatly as a result.) So will other treatments. So will diet/fitness regimens.
- The vulnerable (elderly, hospital patients) will be more accurately and comprehensively monitored. Also, their care will likely be aided by robotics.
- Some of the same things will be true of infants and toddlers. (In other cases, they get such close attention today that I can’t imagine how it could be greatly increased. )
I believe that this will all happen because I believe that it will make health care vastly more successful. And if I’m right about that, no obstacles will be able to prevent it from coming into play — not cost (which will keep going down in a quasi-Moore’s-Law way), not bureaucratic inertia (although that will continue to slow things greatly), and not privacy fears (despite the challenges cited below).
So what are the IT implications of all this?
- I already mentioned the need for new (or newly-used) kinds of predictive modeling.
- Probably in association with those, event detection — which in many but not all cases will amount to anomaly detection — will be huge. If one goal is to let the elderly and ailing live independently, but receive help when it’s needed — well, recognizing when that help is needed will be crucial. Similar dynamics will occur in hospitals.
- And in support of that, there will be great amount of monitoring, and hence strong demands upon sensors and recognition. Potentially, all five human senses will be mimicked, among others. These technologies will become even more important in health care if I’m right that robotics will play a big role.
- Data quality will be a major challenge, especially in the doctors’-notes parts of health records. Reasons start:
- Different medical professionals might evaluate the same situation differently; diagnosis is a craft rather than a dumb, repeatable skill.
- If entries are selected from a predefined set of options, none may be a perfect match to the doctor’s actual opinion.
- Doctors often say what’s needful to have their decisions (care, tests, etc.) approved, whether or not it precisely matches what they really think. Thus, there are significant incentives to enter bad data.
- Free-text data is more central to health care than to many other application areas, and text data is inherently dirty.
- Health records are decades later than many other applications in moving from paper to IT.
- Data integration problems will also be and indeed already are huge, because different health care providers have addressed the tough challenges of record-keeping in different ways.
As for data management — well, almost everything discussed in this blog could come into play.
- A person’s entire medical record resembles the kind of mess increasingly often dumped these days into NoSQL — typically MongoDB, Cassandra, or HBase.
- There are plenty of business-transaction records in the mix, of the kind that have long been managed by RDBMS.
- There are a whole lot of diverse machines in the mix, and managing the data to keep such a menagerie running is commonly the job of Splunk or streaming-enhanced Hadoop.
- There’s a lot of free text in medical records. Also images, video and so on.
- Since graph analytics is used in research today, it might at some point make its way into clinical use.
Finally, let me say:
- Data-driven medicine cannot live up to its potential unless researchers can investigate data sets comprising private information of large numbers of people.
- Researchers will not have the appropriate permissions unless privacy law moves toward a basis in data use, rather than exclusively regulating data possession.
- The New York Times and Hacker News discussed the benefits of using your own medical records a couple months ago.
- I wrote about the monitoring/early response aspects of health care in February, 2015.
- Perhaps my most recent survey of privacy issues was in September, 2014.
- A pretty good survey of the debate about statistical methods in medical research came out in December, 2013.
I talked with my clients at MemSQL about the release of MemSQL 4.0. Let’s start with the reminders:
- MemSQL started out as in-memory OTLP (OnLine Transaction Processing) DBMS …
- … but quickly positioned with “We also do ‘real-time’ analytic processing” …
- … and backed that up by adding a flash-based column store option …
- … before Gartner ever got around to popularizing the term HTAP (Hybrid Transaction and Analytic Processing).
- There’s also a JSON option.
The main new aspects of MemSQL 4.0 are:
- Geospatial indexing. This is for me the most interesting part.
- A new optimizer and, I suppose, query planner …
- … which in particular allow for serious distributed joins.
- Some rather parallel-sounding connectors to Spark. Hadoop and Amazon S3.
- Usual-suspect stuff including:
- More SQL coverage (I forgot to ask for details).
- Some added or enhanced administrative/tuning/whatever tools (again, I forgot to ask for details).
- Surely some general Bottleneck Whack-A-Mole.
There’s also a new free MemSQL “Community Edition”. MemSQL hopes you’ll experiment with this but not use it in production. And MemSQL pricing is now wholly based on RAM usage, so the column store is quasi-free from a licensing standpoint is as well.
Before MemSQL 4.0, distributed joins were restricted to the easy cases:
- Two tables are distributed (i.e. sharded) on the same key.
- One table is small enough to be broadcast to each node.
Now arbitrary tables can be joined, with data reshuffling as needed. Notes on MemSQL 4.0 joins include:
- Join algorithms are currently nested-loop and hash, and in “narrow cases” also merge.
- MemSQL fondly believes that its in-memory indexes work very well for nested-loop joins.
- The new optimizer is fully cost-based (but I didn’t get much clarity as to the cost estimators for JSON).
- MemSQL’s indexing scheme, skip lists, had histograms anyway, with the cutesy name skiplistogram.
- MemSQL’s queries have always been compiled, and of course have to be planned before compilation. However, there’s a little bit of plan flexibility built in based on the specific values queried for, aka “parameter-sensitive plans” or “run-time plan choosing”.
To understand the Spark/MemSQL connector, recall that MemSQL has “leaf” nodes, which store data, and “aggregator” nodes, which combine query results and ship them back to the requesting client. The Spark/MemSQL connector manages to skip the aggregation step, instead shipping data directly from the various MemSQL leaf nodes to a Spark cluster. In the other direction, a Spark RDD can be saved into MemSQL as a table. This is also somehow parallel, and can be configured either as a batch update or as an append; intermediate “conflict resolution” policies are possible as well.
In other connectivity notes:
- MemSQL’s idea of a lambda architecture involves a Kafka stream, with data likely being stored twice (in Hadoop and MemSQL).
- MemSQL likes and supports the Spark DataFrame API, and says financial trading firms are already using it.
Other application areas cited for streaming/lambda kinds of architectures are — you guessed it! — ad-tech and “anomaly detection”.
And now to the geospatial stuff. I thought I heard:
- A “point” is actually a square region less than 1 mm per side.
- There are on the order of 2^30 such points on the surface of the Earth.
Given that Earth’s surface area is a little over 500,000,000 square meters, I’d think 2^50 would be a better figure, but fortunately that discrepancy doesn’t matter to the rest of the discussion. (Edit: As per a comment below, that’s actually square kilometers, so unless I made further errors we’re up to the 2^70 range.)
Anyhow, if the two popular alternatives for geospatial indexing are R-trees or space-filling curves, MemSQL favors the latter. (One issue MemSQL sees with R-trees is concurrency.) Notes on space-filling curves start:
- In this context, a space-filling curve is a sequential numbering of points in a higher-dimensional space. (In MemSQL’s case, the dimension is two.)
- Hilbert curves seem to be in vogue, including at MemSQL.
- Nice properties of Hilbert space-filling curves include:
- Numbers near each other always correspond to points near each other.
- The converse is almost always true as well.*
- If you take a sequence of numbers that is simply the set of all possibilities with a particular prefix string, that will correspond to a square region. (The shorter the prefix, the larger the square.)
*You could say it’s true except in edge cases … but then you’d deserve to be punished.
Given all that, my understanding of the way MemSQL indexes geospatial stuff — specifically points and polygons — is:
- Points have numbers assigned to them by the space-filling curve; those are indexed in MemSQL’s usual way. (Skip lists.)
- A polygon is represented by its vertices. Take the longest prefix they share. That could be used to index them (you’d retrieve a square region that includes the polygon). But actually …
- … a polygon is covered by a union of such special square regions, and indexed accordingly, and I neglected to ask exactly how the covering set of squares was chosen.
As for company metrics — MemSQL cites >50 customers and >60 employees.
1. There are multiple ways in which analytics is inherently modular. For example:
- Business intelligence tools can reasonably be viewed as application development tools. But the “applications” may be developed one report at a time.
- The point of a predictive modeling exercise may be to develop a single scoring function that is then integrated into a pre-existing operational application.
- Conversely, a recommendation-driven website may be developed a few pages — and hence also a few recommendations — at a time.
Also, analytics is inherently iterative.
- Everything I just called “modular” can reasonably be called “iterative” as well.
- So can any work process of the nature “OK, we got an insight. Let’s pursue it and get more accuracy.”
If I’m right that analytics is or at least should be modular and iterative, it’s easy to see why people hate multi-year data warehouse creation projects. Perhaps it’s also easy to see why I like the idea of schema-on-need.
2. In 2011, I wrote, in the context of agile predictive analytics, that
… the “business analyst” role should be expanded beyond BI and planning to include lightweight predictive analytics as well.
I gather that a similar point is at the heart of Gartner’s new term citizen data scientist. I am told that the term resonates with at least some enterprises.
3. Speaking of Gartner, Mark Beyer tweeted
In data management’s future “hybrid” becomes a useless term. Data management is mutable, location agnostic and services oriented.
And that’s why I launched DBMS2 a decade ago, for “DataBase Management System SERVICES”.
A post earlier this year offers a strong clue as to why Mark’s tweet was at least directionally correct: The best structures for writing data are the worst for query, and vice-versa.
4. The foregoing notwithstanding, I continue to believe that there’s a large place in the world for “full-stack” analytics. Of course, some stacks are fuller than others, with SaaS (Software as a Service) offerings probably being the only true complete-stack products.
5. Speaking of full-stack vendors, some of the thoughts in this post were sparked by a recent conversation with Platfora. Platfora, of course, is full-stack except for the Hadoop underneath. They’ve taken to saying “data lake” instead of Hadoop, because they believe:
- It’s a more benefits-oriented than geek-oriented term.
- It seems to be more popular than the roughly equivalent terms “data hub” or “data reservoir”.
6. Platfora is coy about metrics, but does boast of high growth, and had >100 employees earlier this year. However, they are refreshingly precise about competition, saying they primarily see four competitors — Tableau, SAS Visual Analytics, Datameer (“sometimes”), and Oracle Data Discovery (who they view as flatteringly imitative of them).
Platfora seems to have a classic BI “land-and-expand” kind of model, with initial installations commonly being a few servers and a few terabytes. Applications cited were the usual suspects — customer analytics, clickstream, and compliance/governance. But they do have some big customer/big database stories as well, including:
- 100s of terabytes or more (but with a “lens” typically being 5 TB or less).
- 4-5 customers who pressed them to break a previous cap of 2 billion discrete values.
7. Another full-stack vendor, ScalingData, has been renamed to Rocana, for “root cause analysis”. I’m hearing broader support for their ideas about BI/predictive modeling integration. For example, Platfora has something similar on its roadmap.
- I did a kind of analytics overview last month, which had a whole lot of links in it. This post is meant to be additive to that one.
I’m going to be out-of-sorts this week, due to a colonoscopy. (Between the prep, the procedure, and the recovery, that’s a multi-day disablement.) In the interim, here’s a collection of links, quick comments and the like.
1. Are you an engineer considering a start-up? This post is for you. It’s based on my long experience in and around such scenarios, and includes a section on “Deadly yet common mistakes”.
2. There seems to be a lot of confusion regarding the business model at my clients Databricks. Indeed, my own understanding of Databricks’ on-premises business has changed recently. There are no changes in my beliefs that:
- Databricks does not directly license or support on-premises Spark users. Rather …
- … it helps partner companies to do so, where:
- Examples of partner companies include usual-suspect Hadoop distribution vendors, and DataStax.
- “Help” commonly includes higher-level support.
However, I now get the impression that revenue from such relationships is a bigger deal to Databricks than I previously thought.
Databricks, by the way, has grown to >50 people.
3. DJ Patil and Ruslan Belkin apparently had a great session on lessons learned, covering a lot of ground. Many of the points are worth reading, but one in particular echoed something I’m hearing lots of places — “Data is super messy, and data cleanup will always be literally 80% of the work.” Actually, I’d replace the “always” by something like “very often”, and even that mainly for newish warehouses, data marts or datasets. But directionally the comment makes a whole lot of sense.
4. Of course, dirty data is a particular problem when the data is free-text.
5. In 2010 I wrote that the use of textual news information in investment algorithms had become “more common”. It’s become a bigger deal since. For example:
- It seems to be quite profitable to do automated options trading based on the parsing of tweets.
- In a funny example, Tesla motors stock gyrated due to Tesla’s April Fool’s press release about a new wristwatch product.
6. Sometimes a post here gets a comment thread so rich it’s worth doubling back to see what other folks added. I think the recent geek-out on indexes is one such case. Good stuff was added by multiple smart people.
7. Finally, I’ve been banging the drum for electronic health records for a long time, arguing that the great difficulties should be solved due to the great benefits of doing so. The Hacker News/New York Times combo offers a good recent discussion of the subject.
Fishbowl Solutions recently held a webinar about our newest product, ControlCenter for Oracle WebCenter Content. Product manager Kim Negaard discussed ControlCenter’s unique features and advantages for controlled document management. If you missed the webinar, you can now view it on YouTube:
Below is a summary of questions Kim answered during the webinar. If you have any unanswered questions, or would just like to learn more, feel free to contact Fishbowl by emailing firstname.lastname@example.org.
Is this a custom component that I can apply to current documents and workflow processes, or are there additional customization that needs to be done?
ControlCenter is installed as a custom component and will work with current documents. You can identify which document types and workflows you want to be able to access through ControlCenter and additional customizations are not necessary.
Does the metadata sync with the header information in both Microsoft Word and Adobe PDF documents?
The metadata synchronization works with PDF and Word documents.
Do you need to have a specific template in order to synchronize?
Not really. There’s two ways you can insert metadata into a document:
- You can use the header or footer area and replace anything existing in the current header with a standard header.
- You can use properties fields: for example, anytime the Microsoft Office properties field “document ID” occurs in a document, the dDoc name value could be inserted into that field.
In either of these cases, the formatting standards wouldn’t be rigid, but you would ideally want to know before rolling this out which approach you’d like to take.
What version of WebCenter do you need for this?
This is supported on WebCenter Content version 11.1.8 or above.
Is it completely built on component architecture as a custom component?
Does ControlCenter require the Site Studio component to be enabled in order to work?
No, Site Studio doesn’t have to be enabled.
Does ControlCenter require Records Management to be installed on top of WCC?
Does the custom component require additional metadata specific to ControlCenter?
Not exactly. We’ve made it pretty flexible; for example, with the scheduled reviews, we don’t force you to create a field called “review date”. We allow you to pick any date field you want to use for the scheduled reviews, so that if you already have something in place you could use it.
Where do you put ControlCenter if you don’t already have an existing server?
You do need to have an Oracle WebCenter Content server to run ControlCenter. If you don’t have a server, you’ll need to purchase a license for WebCenter Content. However, you don’t need any additional servers besides your WebCenter Content server.
Does the notification have to go to a specific author, or could you send it to a group or list in case that author is no longer with the organization?
The notification system is very flexible in terms of who you can send documents to. You can send it to a group, like an entire department or group of team leads, or it can be configured to send to just one person, like the document author or owner.
How does this work with profiles?
ControlCenter fully supports profiles. When you view content information for a document, it will display the metadata using the profile. If you check in a document using a check-in profile, then all of the metadata and values from that profile will be respected and enforced within ControlCenter. I should also mention that ControlCenter does support DCLs, so if you’re using DCLs those will be respected, both from a check in perspective but also in the metadata on the left. So as you are creating a browse navigation in ControlCenter, it will recognize your DCLs and allow you to filter with the proper relationships.
Does it integrate with or support OID (Oracle Internet Directory)/OAM (Oracle Access Manager)?
ControlCenter will use whatever authentication configuration you already have set up. So if you’re using OAM with WebCenter Content, then that’s what ControlCenter will use as well.
Does it support any custom metadata that has already been created?
Yes, if you have custom metadata fields that are already created, any of those can be exposed in ControlCenter.
Does it support any other customizations that have already been defined in the WebCenter Content instance?
It will depend on the nature of the customization. In general, if you have customized the WebCenter UI, those customizations would not show up in ControlCenter because ControlCenter has a separate UI; however, customizations on the back end, like workflow or security, would likely carry over into ControlCenter.
Does ControlCenter integrate with URM?
The ControlCenter interface isn’t specifically integrated with URM.
In case of a cluster environment, does ControlCenter need to be installed on both WebCenter Content servers?
Yes, if you have a clustered WebCenter Content environment, you would need to install ControlCenter on both/all nodes of the clustered environment.
Does it change anything within core WebCenter?
Not really. The only change to the core UI is an additional button in the Browse Content menu that will take you to the ControlCenter interface. But other than that, ControlCenter doesn’t change or prevent you from using the regular Content Server interface.
Can you customize the look and feel (icons, colors, etc.)?
Yes. We will work with you to customize the look and feel, widgets, etc. The architecture that we used when we created this supports customization.
The post ControlCenter for WebCenter Content: Controlled Document Management for any Device appeared first on Fishbowl Solutions' C4 Blog.
Wanted to drop a quick note here that Google released the latest version of the Google Search Appliance software last week. That brings the most current version up to 7.4.0.G.72 and officially end of life’s the 7.0 version of the appliance software.
New features from the support site posting:
- Seamless Integration with Microsoft: we’re releasing our SharePoint and Active Directory 4.0 connectors out of beta. These connectors provide improved scalability, easier configuration and tighter integration with SharePoint. Additionally, the GSA now supports ADFS which improves our ability to integrate with Windows security.
- Strengthening GSA as a Platform: GSA 7.4 improves the overall quality of the GSA as a search platform. Examples of this include better monitoring through exposing new SNMP metrics, better administration of the internal group resolution repository (GroupDB) and more support for security standards through a generic SAML Identity Provider (eg: OpenSAML).
- Improved Performance: this new GSA version provides better performance in 3 areas: crawling, serving and authorizing search results.
- Enhanced Connector Platform: we have released new features to our 4.0 connector platform like a new beta Database connector and support for SharePoint 2013 multi tenant. We’ll continue to release new Connector functionality regularly.
I am particularly excited about the ability to see into the onboard GroupDB and manage it through the admin console.
Link to 7.4 documentation: https://support.google.com/gsa/answer/6187273?hl=en&ref_topic=2709671
The post Google Search Appliance (GSA) Version 7.4 Released appeared first on Fishbowl Solutions' C4 Blog.
Indexes are central to database management.
- My first-ever stock analyst report, in 1982, correctly predicted that index-based DBMS would supplant linked-list ones …
- … and to this day, if one wants to retrieve a small fraction of a database, indexes are generally the most efficient way to go.
- Recently, I’ve had numerous conversations in which indexing strategies played a central role.
Perhaps it’s time for a round-up post on indexing.
1. First, let’s review some basics. Classically:
- An index is a DBMS data structure that you probe to discover where to find the data you really want.
- Indexes make data retrieval much more selective and hence faster.
- While indexes make queries cheaper, they make writes more expensive — because when you write data, you need to update your index as well.
- Indexes also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for “zero-DBA” RDBMS installations.)
- A DBMS or other system can index data it doesn’t control.
- This is common in the case of text indexing, and not just in public search engines like Google. Performance design might speak against recopying text documents. So might security.
- This capability overlaps with but isn’t exactly the same thing as an “external tables” feature in an RDBMS.
- Indexes can be updated in batch mode, rather than real time.
- Most famously, this is why Google invented MapReduce.
- Indeed, in cases where you index external data, it’s almost mandatory.
- Indexes written in real-time are often cleaned up in batch, or at least asynchronously with the writes.
- The most famous example is probably the rebalancing of B-trees.
- Append-only index writes call for later clean-up as well.
3. There are numerous short-request RDBMS indexing strategies, with various advantages and drawbacks. But better indexing, as a general rule, does not a major DBMS product make.
- The latest example is my former clients at Tokutek, who just got sold to Percona in a presumably small deal — regrettably without having yet paid me all the money I’m owed. (By the way, the press release for that acquisition highlights TokuDB’s advantages in compression much more than it mentions straight performance.)
- In a recent conversation with my clients at MemSQL, I basically heard from Nikita Shamgunov that:
- He felt that lockless indexes were essential to scale-out, and to that end …
- … he picked skip lists, not because they were the optimal lockless index, but because they were good enough and a lot easier to implement than the alternatives. (Edit: Actually, see Nikita’s comment below.)
- Red-black trees are said to be better than B-trees. But they come up so rarely that I don’t really understand how they work.
- solidDB did something cool with Patricia tries years ago. McObject and ScaleDB tried them too. Few people noticed or cared.
I’ll try to explain this paradox below.
4. The analytic RDBMS vendors who arose in the previous decade were generally index-averse. Netezza famously does not use indexes at all. Neither does Vertica, although the columns themselves played some of the role of indexes, especially give the flexibility in their sort orders. Others got by with much less indexing than was common in, for example, Oracle data warehouses.
Some of the reason was indexes’ drawbacks in terms of storage space and administrative overhead. Also, sequential scans can be much faster from spinning disk than more selective retrieval, so table scans often outperformed index-driven retrieval.
5. It is worth remembering that almost any data access method brings back more data than you really need, at least as an intermediate step. For starters, data is usually retrieved in whole pages, whether you need all their contents or not. But some indexing and index-alternative technologies go well beyond that.
- To avoid doing true full table scans, Netezza relies on “zone maps”. These are a prominent example of what is now often called data skipping.
- Bloom filters in essence hash data into a short string of bits. If there’s a hash collision, excess data is returned.
- Geospatial queries often want to return data for regions that have no simple representation in the database. So instead they bring back data for a superset of the desired region, which the DBMS does know how to return.
6. Geospatial indexing is actually one of the examples that gave me the urge to write this post. There are two main geospatial indexing strategies I hear about. One is the R-tree, which basically divides things up into rectangles, rectangles within those rectangles, rectangles within those smaller rectangles, and so on. A query initially brings back the data within a set of rectangles whose union contains the desired region; that intermediate result is then checked row by row for whether it belongs in the final result set.
The other main approach to geospatial indexing is the space-filling curve. The idea behind this form of geospatial indexing is roughly:
- For computational purposes, a geographic region is of course a lattice of points rather than a true 2-dimensional continuum.
- So you take a lattice — perhaps in the overall shape of a square — and arrange its points in a sequence, so that each point is adjacent in some way to its predecessor.
- Then regions on a plane are covered by subsequences (or unions of same).
The idea gets its name because, if you trace a path through the sequence of points, what you get is an approximation to a true space-filling curve.
7. And finally — mature DBMS use multiple indexing strategies. One of the best examples of a DBMS winning largely on the basis of its indexing approach is Sybase IQ, which popularized bitmap indexing. But when last I asked, some years ago, Sybase IQ actually used 9 different kinds of indexing. Oracle surely has yet more. This illustrates that different kinds of indexes are good in different use cases, which in turn suggests obvious reasons why clever indexing rarely gives a great competitive advantage.
I chatted with the MariaDB folks on Tuesday. Let me start by noting:
- MariaDB, the product, is a MySQL fork.
- MariaDB, product and company alike, are essentially a reaction to Oracle’s acquisition of MySQL. A lot of the key players are previously from MySQL.
- MariaDB, the company, is the former SkySQL …
- … which acquired or is the surviving entity of a merger with The Monty Program, which originated MariaDB. According to Wikipedia, something called the MariaDB Foundation is also in the mix.
- I get the impression SkySQL mainly provided services around MySQL, especially remote DBA.
- It appears that a lot of MariaDB’s technical differentiation going forward is planned to be in a companion product called MaxScale, which was released into Version 1.0 general availability earlier this year.
The numbers around MariaDB are a little vague. I was given the figure that there were ~500 customers total, but I couldn’t figure out what they were customers for. Remote DBA services? MariaDB support subscriptions? Something else? I presume there are some customers in each category, but I don’t know the mix. Other notes on MariaDB the company are:
- ~80 people in ~15 countries.
- 20-25 engineers, which hopefully doesn’t count a few field support people.
- “Tiny” headquarters in Helsinki.
- Business leadership growing in the US and especially the SF area.
MariaDB, the company, also has an OEM business. Part of their pitch is licensing for connectors — specifically LGPL — that hopefully gets around some of the legal headaches for MySQL engine suppliers.
MaxScale is a proxy, which starts out by intercepting and parsing MariaDB queries.
- As you might guess, MaxScale has a sharding story.
- All MaxScale sharding is transparent.
- Right now MaxScale sharding is “schema-based”, which I interpret to mean as different tables potentially being on different servers.
- Planned to come soon is “key-based” sharding, which I interpret to mean as the kind of sharding that lets you scale a table across multiple servers without the application needing to know that is happening.
- I didn’t ask about join performance when tables are key-sharded.
- MaxScale includes a firewall.
- MaxScale has 5 “well-defined” APIs, which were described as:
- I think MaxScale’s development schedule is “asynchronous” from that of the MariaDB product.
- Further, MaxScale has a “plug-in” architecture that is said to make it easy to extend.
- One plug-in on the roadmap is replication into Hadoop-based tables. (I think “into” is correct.)
I had trouble figuring out the differences between MariaDB’s free and enterprise editions. Specifically, I thought I heard that there were no feature differences, but I also thought I heard examples of feature differences. Further, there are third-party products included, but plans to replace some of those with in-house developed products in the future.
A few more notes:
- MariaDB’s optimizer is rewritten vs. MySQL.
- Like other vendors before it, MariaDB has gotten bored with its old version numbering scheme and jumped to 10.0.
- One of the storage engines MariaDB ships is TokuDB. Surprisingly, TokuDB’s most appreciated benefit seems to be compression, not performance.
- As an example of significant outside code contributions, MariaDB cites Google contributing whole-database encryption into what will be MariaDB 10.1.
- Online schema change is on the roadmap.
- There’s ~$20 million of venture capital in the backstory.
- Engineering is mainly in Germany, Eastern Europe, and the US.
- MariaDB Power8 performance is reportedly great (2X Intel Sandy Bridge or a little better). Power8 sales are mainly in Europe.
I hear much discussion of shortfalls in analytic technology, especially from companies that want to fill in the gaps. But how much do these gaps actually matter? In many cases, that depends on what the analytic technology is being used for. So let’s think about some different kinds of analytic task, and where they each might most stress today’s available technology.
In separating out the task areas, I’ll focus first on the spectrum “To what extent is this supposed to produce novel insights?” and second on the dimension “To what extent is this supposed to be integrated into a production/operational system?” Issues of latency, algorithmic novelty, etc. can follow after those. In particular, let’s consider the tasks:
- Reporting for regulatory compliance (financial or otherwise). The purpose of this is to follow rules.
- This is non-innovative almost by design.
- Somebody probably originally issued the regulations for a reason, so the reports may be useful for monitoring purposes. Failing that, they probably are supported by the same infrastructure that also tries to do useful monitoring.
- Data governance is crucial. Submitting incorrect data to regulators can have dire consequences. That said, when we hear about poor governance of poly-structured data, I question whether that data is being used in the applications where strong governance is actually needed.
- Other routine, monitoring-oriented business intelligence. The purpose can be general monitoring or general communication. Sometimes the purpose is lost to history entirely. This is generally lame, at least technically, unless interesting requirements are added.
- Displaying it on mobile devices makes it snazzier, and in some cases more convenient. Whoop-de-do.
- Usually what makes it interesting these days is the desire to actually explore the data and gain new insights. More on that below.
- BI for inherently non-tabular data is definitely an unsolved problem.
- Integration of BI with enterprise apps continues to be an interesting subject, but one I haven’t learned anything new about recently.
- All that said, this is an area for some of the most demanding classical data warehouse installations, specifically ones that are demanding along dimensions such as concurrency or schema complexity. (Recall that the most complicated data warehouses are often not the largest ones.) Data governance can be important here as well.
- Investigation by business analysts or line-of-business executives. Much of the action is here, not least because …
- … it’s something of a catch-all category.
- “Business analyst” is a flexible job description, and business analysts can have a variety of goals.
- Alleged line-of-business executives doing business-analyst work are commonly delegating it to fuller-time business analysts.
- These folks can probably manage departmental analytic RDBMS if they need to (that was one of Netezza’s early value propositions), but a Hadoop cluster stretches them. So easy deployment and administration stories — e.g. “Hadoop with less strain”/”Spark with less strain” — can have merit. This could be true even if there’s a separate team of data wranglers pre-processing data that the analysts will then work with.
- Further, when it comes to business intelligence:
- Tableau and its predecessors have set a high bar for quality of user interface.
- The non-tabular BI challenges are present in spades.
- ETL reduction/elimination (Extract/Transform/Load) is a major need.
- Predictive modeling by business analysts is problematic from beginning to end; much progress needs to be made here.
- … it’s something of a catch-all category.
- Investigation by data scientists. The “data scientist”/”business analyst” distinction is hardly precise. But for the purpose of this post, a business analyst may be presumed to excel at elementary mathematics — even stock analysts just use math at a high school level — and at using tabular databases, while data scientists (individuals or teams) have broader skill sets and address harder technical or mathematical problems.
- Rapid-response trouble-shooting. There are some folks — for example network operators — whose job includes monitoring things moment to moment and, when there’s a problem, reacting quickly.
- “Operationalization” of investigative results. This is a hot area, because doing something with insights — “insights” being a hot analytic buzzword these days — is more valuable than merely having them.
- This is where short-request kinds of data stores — NoSQL or otherwise — are often stressed, especially in the low-latency analytics they need to support.
- This is the big area for any kind of “closed loop” predictive modeling story, e.g. in experimentation.
- At least in theory, this is another big area for streaming.
And finally — across multiple kinds of user group and use case, there are some applications that will only be possible when sensors or other data sources improve.
Bottom line: Almost every interesting analytic technology problem is worth solving for some market, but please be careful about finding the right match.
- Where the innovation is (January, 2015)
- Various notes (November, 2014)
- “Freeing business analysts from IT” (August, 2014)
- Data integration as a business opportunity (July, 2014)
- Differentiation in BI usability (March, 2014)
- Analytic database distinctions (February, 2013)
- Juggling analytic databases (March, 2012)
- Applications of an analytic kind (February, 2012)
- Agile predictive analytics (November, 2011)
- Eight kinds of analytic database (July, 2011)
- Use cases for low-latency analytics (April, 2011)
- The three principle kinds of analytic business benefit (March, 2011)
In less than a week, Fishbowl’s WebCenter experts will be heading to sunny Las Vegas for Collaborate 2015! We have a wide range of activities planned, and are looking forward to meeting and learning from other WebCenter users. If you’d like to view a full list of what Fishbowl will be participating in at Collaborate, download our Focus on Fishbowl guide. IOUG also has detailed information about Collaborate on their website.
Exhibit Information | Booth #948
Stop by Fishbowl’s booth for demos and discussions of Google Search for WebCenter, next-generation portals and intranets, image-enabled accounts payable solutions, and our newest product, ControlCenter, which provides an intuitive user interface along with workflow and review automation for controlled document management. We’ll be holding a special giveaway related to ControlCenter; stop by the booth for more details and to also register for an iPad drawing!
Presentation Information | Room Banyan F
Fishbowl will be holding three presentations at Collaborate, all in room Banyan F at Mandalay Bay. Be sure to attend to hear firsthand how our WebCenter team is working with customers to solve business problems.
Tuesday, April 14, 3:15-4:15 PM: Engaging Employees Through an Enterprise Portal: HealthPartners Case Study
Presented by Neamul Haque of HealthPartners and Tim Gruidl and Jerry Aber of Fishbowl Solutions
- Issues HealthPartners had with previous portal sites
- Benefits of deploying a content-centric, portal-focused framework
- Improvement in end-user experience HealthPartners has seen with the new portal
Wednesday, April 15, 2:45-3:45 PM: The Doctors Company Creates Mobile-Ready Website Using Responsive and Adaptive Design
Presented by Paul Manno of The Doctors Company and Jimmy Haugen of Fishbowl Solutions
- Importance of The Doctors’ website for educating customers and prospects
- How responsive and adaptive design transformed user experience
- Technologies leveraged to create a mobile-optimized site
Thursday, April 16, 8:30-9:30 AM: Using Oracle WebCenter Content for Document Compliancy in Food and Manufacturing
Presented by Kim Negaard and George Sokol of Fishbowl Solutions
- Techniques for using revision control and automatic conversion
- How to provide additional security and auditability around document approvals
- How to increase efficiency and control over changes in documents
If you’d like to schedule a meeting with anyone on the Fishbowl team during Collaborate, feel free to contact us at email@example.com. See you in Las Vegas!
The post Collaborate 2015: WebCenter Discussions and Networking in Sunny Las Vegas appeared first on Fishbowl Solutions' C4 Blog.
My favorite educational video growing up, by far, was a 1960 film embedded below. I love it because it pranks its viewers, starting right in the opening scene. (Start at the 0:50 mark to see what I mean.)
If you’re ever in the position of helping a kid or young adult understand physics, this video could be a great help. Frankly, it could help in political discussions as well …
First of all, you may be wondering: what is SPA and how can it improve Oracle WebCenter Portal?
It’s the future, trust me, especially when it comes to providing the best possible UX and flexible UI to your users and enhancing the power of ADF Taskflows with a radical UI overhaul. ADF is great for creating rich applications fast, but to be honest, it is limited in its capability to provide that rich flexible HTML5 UX through the ADF UI components and to create that truly interactive design that we all seek in today’s modern web based apps and portals. If we are honest, the developer (or more importantly, the designers) are constrained with the out-of-the-box ADF components/taskflows and their lack of design flexibility, unless they want to create their own and extend the render kit capabilities (but extending the render kit will be for another post – today lets cover SPA with Portal).
SPA fits in perfectly – overhaul the ADF UI using today’s modern techniques and design agency with the skills to build responsive components using the latest frameworks and libraries, such as knockout, backbone, requirejs, socket.io with reactive templating using mustache, handlebars to compliment ADF and WebCenters services via its REST API.
But before taking the next step and thinking SPA is right for you, read these warnings!
- The taskflow interface is developed using the latest framework and libraries – not all partners and web design agencies are forward thinking and may not be able to develop and achieve SPA components for WebCenter Portal.
- The cost can be greater, as the interface is usually designed from the ground up and requires browser and device testing. But if it’s done well, it will outmatch anything that the ADF UI layer can provide your users.
- Experience goes a long way, especially with matching your design agency with ADF-experienced developers to provide responsive web services and inline datasets.
So what is SPA?
With our use of SPA within portal, the SPA is made modular and wrapped within a taskflow container, providing the rich capability to have multiple components that can be easily dropped into a page while providing rich interaction between components that are context aware by simply adding them from the resource catalog from within the WebCenter Composer view.
Here is an example Photoshop design that we can easily convert into a pixel perfect functioning SPA taskflows we have developed that enables logged in users to manage their own list of Application and Spaces. All server side interaction is handled via ADFm (model) layer, which enables the ADF life cycle to be aware of REST calls and prevent memory leaks or other possible session issues. While the UI layer is all custom with reactive dynamic templating that is far superior and faster than the current ADF, PPR calls, as all the interaction and updating of the template is handled client side. Another great thing is that this template can easily transform and support modern responsive design techniques and can be consumed by mobile devices, and could also be deployed to other app environments like Liferay, SharePoint, and WebCenter Content, as the UI layer does not rely on ADF calls and service requests can be proxied or CORS enabled if the calls are handled by AJAX and are not WebSocket requests.
Here you can see another example of a SPA Taskflow in action displaying the JIVE Forums (above), compared against the out of the box ADF Forum Taskflow (below).
If you are looking to create applications fast that are functional, use the out of the box taskflows or develop your own ADF components entirely in ADF with your development team and customize the ADF skin to improve on the ADF look.
However, if you are looking to take the experience to the next level and want to invest more to create visually interactive modern and rich dynamic experiences to your users, bring in a good UX/UI team that can help transform your interface layer while enabling components to have the potential to be deployed across multiple platforms while maintaining the power of the ADF back end.
I’m currently co-authoring a white paper on modular SPA for ADF taskflows providing examples and different development techniques that can be used to enrich the UI. I hope to get this out to you and on OTN in the next 6 months so keep a look out for it.
I’m skeptical of data federation. I’m skeptical of all-things-to-all-people claims about logical data layers, and in particular of Gartner’s years-premature “Logical Data Warehouse” buzzphrase. Still, a reasonable number of my clients are stealthily trying to do some kind of data layer middleware, as are other vendors more openly, and I don’t think they’re all crazy.
Here are some thoughts as to why, and also as to challenges that need to be overcome.
There are many things a logical data layer might be trying to facilitate — writing, querying, batch data integration, real-time data integration and more. That said:
- When you’re writing data, you want it to be banged into a sufficiently-durable-to-acknowledge condition fast. If acknowledgements are slow, performance nightmares can ensue. So writing is the last place you want an extra layer, perhaps unless you’re content with the durability provided by an in-memory data grid.
- Queries are important. Also, they formally are present in other tasks, such as data transformation and movement. That’s why data manipulation packages (originally Pig, now Hive and fuller SQL) are so central to Hadoop.
Trivial query routing or federation is … trivial.
- Databases have or can be given some kind of data catalog interface. Of course, this is easier for databases that are tabular, whether relational or MOLAP (Multidimensional OnLine Analytic Processing), but to some extent it can be done for anything.
- Combining the catalogs can be straightforward. So can routing queries through the system to the underlying data stores.
In fact, what I just described is Business Objects’ original innovation — the semantic layer — two decades ago.
Careless query routing or federation can be a performance nightmare. Do a full scan. Move all the data to some intermediate server that lacks capacity or optimization to process it quickly. Wait. Wait. Wait. Wait … hmmm, maybe this wasn’t the best data-architecture strategy.
Streaming goes well with federation. Some data just arrived, and you want to analyze it before it ever gets persisted. You want to analyze it in conjunction with data that’s been around longer. That’s a form of federation right there.
There are ways to navigate schema messes. Sometimes they work.
- Polishing one neat relational schema for all your data is exactly what people didn’t want to do when they decided to store a lot of the data non-relationally instead. Still, memorializing some schema after that fact may not be terribly painful.
- Even so, text search can help you navigate the data wilds. So can collaboration tools. Neither helps all the time, however.
Neither extreme view here — “It’s easy!” or “It will never work!” — seems right. Rather, I think there’s room for a lot of effort and differentiation in exposing cross-database schema information.
I’m leaving out one part of the story on purpose — how these data layers are going to be packaged, and specifically what other functionality they will be bundled with. Confidentially would screw up that part of the discussion; so also would my doubts as to whether some of those plans are fully baked yet. That said, there’s an aspect of logical data layer to CDAP, and to Kiji as well. And of course it’s central to BI (Business Intelligence) and ETL (Extract/Transform/Load) alike.
One way or another, I don’t think the subject of logical data layers is going away any time soon.
- Implicit in this post is the belief that enterprises should and do use many different data stores (June, 2014)
1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:
- Everything is stored in sorted files. (I didn’t probe as to what exactly the files were sorted on.)
- Files have indexes and optional Bloom filters.
- Files are marked with min/max field values and time stamp ranges, which helps with data skipping.
Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.
2. The discussion of which kinds of data are originally put into HBase was a bit confusing.
- HBase is commonly used to receive machine-generated data. Everybody knows that.
- Cloudera drew a distinction between:
- Straightforward time series, which should probably just go into HDFS (Hadoop Distributed File System) rather than HBase.
- Data that is bucketed by entity, which likely should go into HBase. Examples of entities are specific users or devices.
- Cloudera also reminded me that OpenTSDB, a popular time series data store, runs over HBase.
OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.
3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that:
- Pinterest has a large HBase database on SSDs (Solid-State Drives), a large fraction of which is actually in RAM.
- eBay has an HBase database largely on spinning disk, used to inform its search engine.
Cloudera also told me of a Very Famous Company that has many 100s of HBase nodes managing petabytes of mobile device data. That sounds like multiple terabytes per node even before considering a replication factor, so I presume it’s disk-based as well. The takeaway from those examples, other than general big-use-case impressiveness, is that storage choices for HBase can vary greatly by user and application.
4. HBase has master/master geographically remote replication. I gather that Yahoo replicates between a couple of 1000-node clusters, on behalf of its Flurry operation. HBase also has the technical capability to segment data across geographies — i.e., the geo-partitioning feature essential to data sovereignty compliance — but no actual implementations came to mind.
5. Besides the ones already mentioned, and famed HBase user Facebook, a few other users came up.
- It seems to be common for ad-tech companies to store in HBase the data that arrives from many different computers and mobile devices.
- An agency that Cloudera didn’t name, but which is obviously something like the SEC or CFTC, stores all trade data in HBase.
- Cerner — or perhaps its software — stores data in HBase on a patient-by-patient basis.
In general, Cloudera suggested that HBase was used in a fair number of OEM situations.
6. Finally, I have one number: As of January, 2014 there were 20,000 HBase nodes managed by Cloudera software. Obviously, that number is growing very quickly, and of course there are many HBase nodes that Cloudera has nothing to do with.
- A lot of this echoes what I hear from DataStax (December, 2013), notwithstanding the consensus that HBase and Cassandra rarely compete in the marketplace.
Over the past couple years, there have been various quick comments and vague press releases about “BI for NoSQL”. I’ve had trouble, however, imagining what it could amount to that was particularly interesting, with my confusion boiling down to “Just what are you aggregating over what?” Recently I raised the subject with a few leading NoSQL companies. The result is that my confusion was expanded. Here’s the small amount that I have actually figured out.
As I noted in a recent post about data models, many databases — in particular SQL and NoSQL ones — can be viewed as collections of <name, value> pairs.
- In a relational database, a record is a collection of <name, value> pairs with a particular and predictable — i.e. derived from the table definition — sequence of names. Further, a record usually has an identifying key (commonly one of the first values).
- Something similar can be said about structured-document stores — i.e. JSON or XML — except that the sequence of names may not be consistent from one document to the next. Further, there’s commonly a hierarchical relationship among the names.
- For these purposes, a “wide-column” NoSQL store like Cassandra or HBase can be viewed much as a structured-document store, albeit with different performance optimizations and characteristics and a different flavor of DML (Data Manipulation Language).
Consequently, a NoSQL database can often be viewed as a table or a collection of tables, except that:
- The NoSQL database is likely to have more null values.
- The NoSQL database, in a naive translation toward relational, may have repeated values. So a less naive translation might require extra tables.
That’s all straightforward to deal with if you’re willing to write scripts to extract the NoSQL data and transform or aggregate it as needed. But things get tricky when you try to insist on some kind of point-and-click. And by the way, that last comment pertains to BI and ETL (Extract/Transform/Load) alike. Indeed, multiple people I talked with on this subject conflated BI and ETL, and they were probably right to do so.
Another set of issues arise on the performance side. Many NoSQL systems have indexes, and thus some kind of filtering capability. Some — e.g. MongoDB — have aggregation frameworks as well. So if you’re getting at the data with some combination of a BI tool, ETL tool or ODBC/JDBC drivers — are you leveraging the capabilities in place? Or are you doing the simplest and slowest thing, which is to suck data out en masse and operate on it somewhere else? Getting good answers to those questions is a work-in-progress at best.
Having established that NoSQL data structures cause problems for BI, let’s turn that around. Is there any way that they actually help? I want to say “NoSQL data often comes in hierarchies, and hierarchies are good for roll-up/drill-down.” But the hierarchies that describe NoSQL data aren’t necessarily the same kinds of hierarchies that are useful for BI aggregation, and I’m indeed skeptical as to how often those two categories overlap.
Hierarchies aside, I do think there are use cases for fundamentally non-tabular BI. For example, consider the following scenario, typically implemented with the help of NoSQL today:
- You have more data — presumably machine-generated — than you can afford to keep.
- So you keep time-sliced aggregates.
- You also keep selective details, namely ones that you identified when they streamed in as being interesting in some way.
Visualizing that properly would be very hard in a traditional tabularly-oriented BI tool. So it could end up with NoSQL-oriented BI tools running over NoSQL data stores. Event series BI done right also seems to be quite non-tabular. That said, I don’t know for sure about the actual data structures used under the best event series BI today.
And at that inconclusive point, I’ll stop for now. If you have something to add, please share it in the comments below, or hit me up as per my Contact link above.
Fishbowl team members competed in our third annual Hackathon over the weekend. Our consultants split up into four groups to create an innovative solution, and were given just over 24 hours to complete the task. The goal was to create usable software that could be used in the future when working with clients; some of our most creative products and services originally derived from Hackathon events.
Here’s a summary of what the teams came up with this year:
Team 1: Integrated Fishbowl’s Control Center with Oracle WebCenter Portal
Team 2: Integrated the Google Search Appliance with Oracle’s Document Cloud Service
Team 3: Worked to connect WebCenter Portal with Oracle’s new Document Cloud Service
Team 4: Got Fishbowl’s ControlCenter to run in the cloud with Google Compute Engine
At the end of the Hackathon, all participants voted on the best solution. The winning team was #2 (GSA-Doc Cloud Service integration), made up of Kim Negaard, Andy Weaver, and Mike Hill. Although there could only be one winner, overall consensus was that each team came up with something extremely useful that could help solve common problems we’ve run into while working with WebCenter and the GSA. If you’re interested in learning more about any of the team’s programs, feel free to contact us at firstname.lastname@example.org.
The post Hackathon Provides Opportunity for Collaboration and Innovation appeared first on Fishbowl Solutions' C4 Blog.