After a July visit to DataStax, I wrote
The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.
- Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
- Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.
That turns out to understate the core point, which is that DBAs still matter in non-RDBMS environments. Specifically, it’s too narrow in two ways.
- First, it’s generally too narrow as to what DBAs do; people with DBA-like skills are also involved in other areas such as “data governance”, “information lifecycle management”, storage, or what I like to call data mustering.
- Second — and more narrowly — the first bullet point of the quote is actually incorrect. In fact, the database design part of application development can be done by a specialized person up front in the NoSQL world, just as it commonly is for RDBMS apps.
My wake-up call for that latter bit was a recent MongoDB 3.4 briefing. MongoDB certainly has various efforts in administrative tools, which I won’t recapitulate here. But to my surprise, MongoDB also found a role for something resembling relational database design. The idea is simple: A database administrator defines a view against a MongoDB database, where views:
- Are logical rather than materialized. (At least at this time.)
- Have their permissions and so on set by the DBA.
- Are the sole thing the programmer writes against.
Besides the obvious benefits in development ease and security, MongoDB says that performance can be better as well.* This is of course a new feature, without a lot of adoption at this time. Even so, it seems likely that NoSQL doesn’t obsolete any part of the traditional DBA role.
*I didn’t actually ask what a naive programmer can do to trash performance that views can forestall, but … well, I was once a naive programmer myself.
Two trends that I think could make DBA’s lives even more interesting and challenging in the future are:
- The integration of quick data management into complex analytic processes. Here by “quick data management” I mean, for example, what you do in connection with a complex Hadoop or Spark (set of) job(s). Leaving the data management to a combination of magic and Python scripts doesn’t seem to respect how central data operations are to analytic tasks.
- The integration of data management and streaming. I should probably write about this point separately, but in any case — it seems that streaming stacks will increasingly look like over-caffeinated DBMS.
Bottom line: Database administration skills will be needed for a long time to come.
“Multimodel” database management is a hot new concept these days, notwithstanding that it’s been around since at least the 1990s. My clients at MongoDB of course had to join the train as well, but they’ve taken a clear and interesting stance:
- A query layer with multiple ways to query and analyze data.
- A separate data storage layer in which you have a choice of data storage engines …
- … each of which has the same logical (JSON-based) data structure.
When I pointed out that it would make sense to call this “multimodel query” — because the storage isn’t “multimodel” at all — they quickly agreed.
To be clear: While there are multiple ways to read data in MongoDB, there’s still only one way to write it. Letting that sink in helps clear up confusion as to what about MongoDB is or isn’t “multimodel”. To spell that out a bit further:
- In query, MongoDB mixes multiple paradigms for DML (Data Manipulation Language). The main one is of course JSON.
- When writing, the DML paradigm is unmixed — it’s just JSON.
Further, MongoDB query DML statements can be mixed with analytic functions rooted in Spark.
The main ways to query data in MongoDB, to my knowledge, are:
- Native/JSON. Duh.
- MongoDB has used MySQL as a guide to what SQL coverage they think the market is calling for.
- More to the point, they’re trying to provide enough SQL so that standard business intelligence tools work well (enough) against MongoDB.
- I neglected to ask why this changed from MongoDB’s adamantly non-SQL approach of 2 1/2 years ago.
- MongoDB has been adding text search features for a few releases.
- MongoDB’s newest search feature revolves around “facets”, in the Endeca sense of the term. MongoDB characterizes as a kind of text-oriented GroupBy.
- Graph. MongoDB just introduced a kind of recursive join capability, which is useful for detecting multi-hop relationships (e.g. ancestor/descendant rather than just parent/child). MongoDB declares that the “graph” box is thereby checked.
Three years ago, in an overview of layered and multi-DML architectures, I suggested:
- Layered DBMS and multimodel functionality fit well together.
- Both carried performance costs.
- In most cases, the costs could be affordable.
MongoDB seems to have bought strongly into that view on the query side — which is, of course, exactly the right way for them to have started.
DATE: THURSDAY, DECEMBER 8, 2016
TIME: 10:00 A.M. PST / 1:00 P.M. EST
Join Ryan Companies Vice President of Construction Operations, Mike Ernst, and Fishbowl Solutions Product Manager, Kim Negaard, to learn how Ryan Companies, a leading national construction firm, found knowledge management success with ControlCenter for Oracle WebCenter Content.
In this webinar, you’ll hear first-hand how ControlCenter has been implemented as part of Ryan’s Integrated Project Delivery Process helping them create a robust knowledge management system to promote consistent and effective operations across multiple regional offices. You’ll also learn how ControlCenter’s intuitive, modern user experience enabled Ryan to easily find documents across devices, implement reoccurring review cycles, and control both company-wide and project-specific documents throughout their lifecycle.
This post comes from Fishbowl Solutions’ Senior Solutions Architect, Seth Richter.
More and more organizations need to merge multiple Windchill instances into a single Windchill instance after either acquiring another company or maybe had separate Windchill implementations based on old divisional borders. Whatever the situation, these organizations want to merge into a single Windchill instance to gain efficiencies and/or other benefits.
The first task for a company in this situation is to assemble the right team and develop the right plan. The team will need to understand the budget and begin to document key requirements and its implications. Will they hire an experienced partner like Fishbowl Solutions? If so, we recommend involving the partner early on in the process so they can help navigate the key decisions, avoid pitfalls and develop the best approach for success.
Once you start evaluating the technical process and tools to merge the Windchill instances, the most likely options are:
1. Manual Method
Moving data from one Windchill system to another manually is always an option. This method might be viable if there are small pockets of data to move in an ad-hoc manner. However, this method is extremely time consuming so proceed with caution…if you get halfway through and then move to a following method then you might have hurt the process rather than help it.
2. Third Party Tools (Fishbowl Solutions LinkExtract & LinkLoader tools)
This process can be a cost effective alternative, but it is not as robust as the Windchill Bulk Migrator so your requirements might dictate if this is viable or not.
3. PTC Windchill Bulk Migrator (WBM) tool
This is a powerful, complex tool that works great if you have an experienced team running it. Fishbowl prefers the PTC Windchill Bulk Migrator in many situations because it can complete large merge projects over a weekend and historical versions are also included in the process.
A recent Fishbowl project involved a billion-dollar manufacturing company who had acquired another business and needed to consolidate CAD data from one Windchill system into their own. The project had an aggressive timeline because it needed to be completed before the company’s seasonal rush (and also be prepared for an ERP integration). During the three-month project window, we kicked off the project, executed all of the test migrations and validations, scheduled a ‘go live’ date, and then completed the final production migration over a weekend. Users at the acquired company checked their data into their “old” Windchill system on a Friday and were able check their data out of the main corporate instance on Monday with zero engineer downtime.
Fishbowl Solutions’ PTC/PLM team has completed many Windchill merge projects such as this one. The unique advantage of working with Fishbowl is that we are PTC Software Partners and Windchill programming experts. Often times, when other reseller/consulting partners get stuck waiting on PTC technical support, Fishbowl has been able to problem solve and keep projects on time and on budget.
If your organization is seeking to find an effective and efficient way to bulk load data from one Windchill system to another, our experts at Fishbowl Solutions are able to accomplish this on time and on budget. Urgency is a priority in these circumstances, and we want to ensure you’re able to make this transition process as hassle-free as possible with no downtime. Not sure which tool is the best fit for your Windchill migration project? Check out our website, click the “Contact Us” tab, or reach out to Rick Passolt in our business development department for more information or to request a demo.
Senior Account Executive
Seth Richter is a Senior Solutions Architect at Fishbowl Solutions. Fishbowl Solutions was founded in 1999. Their areas of expertise include Oracle WebCenter, PTC’s Product Development System (PDS), and enterprise search solutions using the Google Search Appliance. Check out our website to learn more about what we do.
The post Approaches to Consider for Your Organization’s Windchill Consolidation Project appeared first on Fishbowl Solutions' C4 Blog.
This post comes from Fishbowl Solutions’ Associate MCAD Consultant, Ben Sawyer.
CAD data migrations are most often seen as a huge burden. They can be lengthy, costly, messy, and a general road block to a successful project. Organizations planning on migrating SolidWorks data to PTC Windchill should consider their options when it comes to the process and tools they utilize to perform the bulk loading.
At Fishbowl Solutions, our belief is that the faster you can load all your data accurately into Windchill, the faster your company can implement critical PLM business processes and realize the results of such initiatives like a Faster NPI, Streamline Change & Configuration Management, Improved Quality, Etc.
There are two typical project scenarios we encounter with these kinds of data migration projects. SolidWorks data resides on a Network File System (NFS) or resides in either PDMWorks or EPDM.
The options for this process and the tools used will be dependent on other factors as well. The most common guiding factors to influence decisions are the quantity of data and the project completion date requirements. Here are typical project scenarios.
Scenario One: Files on a Network File System
There is always an option to manually migrate SolidWorks data into Windchill. However, if an organization has thousands of files from multiple products that need to be imported, this process can be extremely daunting. When loading manually, this process involves bringing files into the Windchill workspace, carefully resolving any missing dependents, errors, duplicates, setting destination folders, revisions, lifecycles and fixing bad metadata. (Those who have tried this approach with large data quantities in the past know the pain of which we are talking about!)
Years ago, Fishbowl developed its LinkLoader tool for SolidWorks as a viable solution to complete a Windchill bulk loading project with speed and accuracy.
Fishbowl’s LinkLoader solution follows a simple workflow to help identify data to be cleansed and mass loaded with accurate metadata. The steps are as follows:
In this initial stage, the user chooses the mass of SolidWorks data to be loaded into Windchill. Since Windchill doesn’t allow duplicate named CAD files in the system, the software quickly identifies these duplicate files. It is up to the user to resolve the duplicate files or remove them from the data loading set.
The validation stage will ensure files are retrievable, attributes/parameters are extracted (for use in later stages), and relationships with other SolidWorks files are examined. LinkLoader captures all actions. The end user will need to resolve any errors or remove the data from the loading set.
Moving toward the bulk loading stage, it is necessary to confirm and/or modify the attribute-mapping file as desired. The only required fields for mapping are lifecycle, revision/version, and the Windchill folder location. End users are able to leverage the attributes/parameter information from the validation as desired, or create their own ‘Instance Based Attribute’ list to map with the files.
4. Bulk Load
Once the mapping stage is completed, the loading process is ready. There is a progress indicator that displays the number of files completed and the percentage done. If there are errors with any files during the upload, it will document these in an ‘Error List Report’ and LinkLoader will simply move on to the next file.
Scenario Two: Files reside in PDMWorks or EPDM
There is also an option to do a manual data migration from one system to another if files reside in PDMWorks or EPDM. However, this process can also be tedious and drawn out as much, or perhaps even more than when the files are on a NFS.
Having files within PDMWorks or EPDM can make the migration process more straightforward and faster than the NFS projects. Fishbowl has created an automated solution tool that extracts the latest versions of each file from the legacy system and immediately prepares it for loading into Windchill. The steps are as follows:
1. Extraction (LinkExtract)
In this initial stage, Fishbowl uses its LinkExtract tool to pull the latest version of all SolidWorks files , determine references, and extract all the attributes for the files as defined in PDMWorks or EPDM.
Before loading the files, it is necessary to confirm and or modify the attribute mapping file as desired. Admins can fully leverage the attributes/parameter information from the Extraction step, or can start from scratch if they find it to be easier. Often the destination Windchill system will have different terminology or states and it is easy to remap those as needed in this step.
3. Bulk Load
Once the mapping stage is completed, the loading process is ready. There is a progress indicator that displays the number of files completed and the percentage done. If there are errors with any files during the upload, it will document these in the Error List Report and LinkLoader will move on to the next file.
Proven Successes with LinkLoader
Many of Fishbowl’s customers have purchased and successfully ran LinkLoader themselves with little to no assistance from Fishbowl. Other customers of ours have utilized our consulting services to complete the migration project on their behalf.
With Fishbowl’s methodology centered on “Customer First”, our focus and support continuously keeps our customers satisfied. This is the same commitment and expertise we will bring to any and every data migration project.
If your organization is looking to consolidate SolidWorks CAD data to Windchill in a timely and effective manner, regardless of the size and scale of the project, our experts at Fishbowl Solutions can get it done.
For example, Fishbowl partnered with a multi-billion dollar medical device company with a short time frame to migrate over 30,000 SolidWorks files from a legacy system into Windchill. Fishbowl’s expert team took initiative and planned the process to meet their tight industry regulations and finish on time and on budget. After the Fishbowl team executed test migrations, the actual production migration process only took a few hours, thus eliminating engineering downtime.
If your organization is seeking the right team and tools to complete a SolidWorks data migration to Windchill, reach out to us at Fishbowl Solutions.
If you’d like more information about Fishbowl’s LinkLoader tool or our other products and services for PTC Windchill and Creo, check out our website, click the “Contact Us” tab, or reach out to Rick Passolt in our business development department.
Senior Account Executive
Ben Sawyer is an Associate MCAD Consultant at Fishbowl Solutions. Fishbowl Solutions was founded in 1999. Their areas of expertise include Oracle WebCenter, PTC’s Product Development System (PDS), and enterprise search solutions using the Google Search Appliance. Check out our website to learn more about what we do.
The post Consider Your Options for SolidWorks to Windchill Data Migrations appeared first on Fishbowl Solutions' C4 Blog.
“Real-time” technology excites people, and has for decades. Yet the actual, useful technology to meet “real-time” requirements remains immature, especially in cases which call for rapid human decision-making. Here are some notes on that conundrum.
1. I recently posted that “real-time” is getting real. But there are multiple technology challenges involved, including:
- General streaming. Some of my posts on that subject are linked at the bottom of my August post on Flink.
- Low-latency ingest of data into structures from which it can be immediately analyzed. That helps drive the (re)integration of operational data stores, analytic data stores, and other analytic support — e.g. via Spark.
- Business intelligence that can be used quickly enough. This is a major ongoing challenge. My clients at Zoomdata may be thinking about this area more clearly than most, but even they are still in the early stages of providing what users need.
- Advanced analytics that can be done quickly enough. Answers there may come through developments in anomaly management, but that area is still in its super-early days.
- Alerting, which has been under-addressed for decades. Perhaps the anomaly management vendors will finally solve it.
2. In early 2011, I coined the phrase investigative analytics, about which I said three main things:
- It is meant to contrast with “operational analytics”.
- It is meant to conflate “several disciplines, namely”:
- Statistics, data mining, machine learning, and/or predictive analytics.
- The more research-oriented aspects of business intelligence tools.
- Analogous technologies as applied to non-tabular data types such as text or graph.
- A simple definition would be “Seeking (previously unknown) patterns in data.”
Generally, that has held up pretty well, although “exploratory” is the more widely used term. But the investigative/operational dichotomy obscures one key fact, which is the central point of this post: There’s a widespread need for very rapid data investigation.
- Network operations. This is my paradigmatic example.
- Data is zooming all over the place, in many formats and structures, among many kinds of devices. That’s log data, header data and payload data alike. Many kinds of problems can arise …
- … which operators want to diagnose and correct, in as few minutes as possible.
- Interfaces commonly include real-time business intelligence, some drilldown, and a lot of command-line options.
- I’ve written about various specifics, especially in connection with the vendors Splunk and Rocana.
- Security and anti-fraud. Infosec and cyberfraud, to a considerable extent, are just common problems in network operations. Much of the response is necessarily automated — but the bad guys are always trying to outwit your automation. If you think they may have succeeded, you want to figure that out very, very fast.
- Consumer promotion and engagement. Consumer marketers feel a great need for speed. Some of it is even genuine.
- If an online promotion is going badly (or particularly well), they can in theory react almost instantly. So they’d like to know almost instantly, perhaps via BI tools with great drilldown.
- The same is even truer in the case of social media eruptions and the like. Obviously, the tools here are heavily text-oriented.
- Call centers and even physical stores have some of the same aspects as internet consumer operations.
- Consumer internet backends, for e-commerce, publishing, gaming or whatever. These cases combine and in some cases integrate the previous three points. For example, if you get a really absurd-looking business result, that could be your first indication of network malfunctions or automated fraud.
- Industrial technology, such as factory operations, power/gas/water networks, vehicle fleets or oil rigs. Much as in IT networks, these contain a diversity of equipment — each now spewing its own logs — and have multiple possible modes of failure. More often than is the case in IT networks, you can recognize danger signs, then head off failure altogether via preventive maintenance. But when you can’t, it is crucial to identify the causes of failure fast.
- General IoT (Internet of Things) operation. This covers several of the examples above, as well as cases in which you sell a lot of devices, have them “phone home”, and labor to keep that whole multi-owner network working.
- National security. If I told you what I meant by this one, I’d have to … [redacted].
4. And then there’s the investment industry, which obviously needs very rapid analysis. When I was a stock analyst, I could be awakened by a phone call and told news that I would need to explain to 1000s of conference call listeners 20 minutes later. This was >30 years ago. The business moves yet faster today.
The investment industry has invested greatly in high-speed supporting technology for decades. That’s how Mike Bloomberg got so rich founding a vertical market tech business. But investment-oriented technology indeed remains a very vertical sector; little of it get more broadly applied.
I think the reason may be that investing is about guesswork, while other use cases call for more definitive answers. In particular:
- If you’re wrong 49.9% of the time in investing, you might still be a big winner.
- In high-frequency trading, speed is paramount; you have to be faster than your competitors. In speed/accuracy trade-offs, speed wins.
5. Of course, it’s possible to overstate these requirements. As in all real-time discussions, one needs to think hard about:
- How much speed is important in meeting users’ needs.
- How much additional speed, if any, is important in satisfying users’ desires.
But overall, I have little doubt that rapid analytics is a legitimate area for technology advancement and growth.
What I Have Learned as an Oracle WebCenter Consultant in My First Three Months at Fishbowl Solutions
This post comes from Fishbowl Solutions’ Associate Software Consultant, Jake Jehlicka.
Finishing college can be an intimidating experience for many. We leave what we know behind to open the gates to brand new experiences. Those of us fortunate enough to gain immediate employment often find ourselves leaving school and plunging headfirst into an entirely new culture a mere few weeks after turning in our last exam. It is exciting, yet frightening, and what can make-or-break the whole experience is the new environment in which you find yourself if. I consider myself one of the lucky ones.
I have been with Fishbowl Solutions for just over three months, and the experience is unlike any that I had encountered in my previous internships, work, or schooling in Duluth. I moved to the Twin Cities within a week of accepting the position. I was terrified, but my fears were very soon laid to rest. Fishbowl welcomed me with open arms, and I have learned an incredible amount in the short time that I have spent here. Here are just a few of the many aspects of Fishbowl and the skills I’ve gained since working here as an associate software consultant.
One of the things that really jumped out at me right away is how a company’s culture is a critical component to making work enjoyable and sustainable. Right from the outset, I was invited and even encouraged to take part in Fishbowl’s company activities like their summer softball team and happy hours celebrating new employees joining the team. I have seen first-hand how much these activities bring the workplace together in a way that not only makes employees happy, but makes them very approachable when it comes to questions or assistance. The culture here seems to bring everyone together in a way that is unique to Fishbowl, and the work itself sees benefits because of it.
Over the past three months, one thing that I have also learned is the importance of working together. I joined Fishbowl a few weeks after the other trainees in my group, and they were a bit ahead of me in the training program when I started. Not only were they ready and willing to answer any questions that I had, but they also shared their knowledge that they had acquired in such a way that I was able to catch up before our training had completed. Of course the other trainees weren’t the only ones willing to lend their assistance. The team leads have always been there whenever I needed a technical question answered, or even if I just wanted advice in regard to where my own career may be heading.
The team leads also taught me that not every skill is something that can be measured. Through my training, we were exposed to other elements outside of the expected technical skills. We were given guidance when it comes to oft-neglected soft skills such as public speaking and client interactions. These sorts of skills are utterly necessary to learn, regardless of which industry you are in. It is thanks to these that I have already had positive experiences working with our clients.
As a new software consultant at Fishbowl, I have gained a plethora of knowledge about various technologies and applications, especially with Oracle technologies. The training that I received has prepared me for working with technologies like Oracle WebCenter in such a way that I have been able to dive right into projects as soon as I finished. Working with actual systems was nearly a foreign concept after working with small individual projects in college, but I learned enough from my team members to be able to proceed with confidence. The training program at Fishbowl has a very well-defined structure, with an agenda laid out of what I should be working on in any given time period. A large portion of this was working directly with my own installation of the WebCenter content server. I was responsible for setting up, configuring, and creating a custom code for the servers both in a Windows and Linux environment. The training program was very well documented and I always had the tools, information, and assistance that was needed to complete every task.
Once the formal training ended, I was immediately assigned a customer project involving web development using Oracle’s Site Studio Designer. The training had actually covered this application and I was sufficiently prepared to tackle the new endeavor! With that said, every single day at Fishbowl is another day of education; no two projects are identical and there is always something to be learned. For example, I am currently learning Ext JS with Sencha Architect in preparation for a new project!
Although we may never know with absolute certainty what the future has in store for us, I can confidently say that the experiences, skills, knowledge that I have gained while working at Fishbowl Solutions will stay with me for the rest of my life.
Thank you to the entire Fishbowl team for everything they have done for me, and I look forward to growing alongside them!
Jake Jehlicka is an Associate Software Consultant at Fishbowl Solutions. Fishbowl Solutions was founded in 1999. Their areas of expertise include Oracle WebCenter, PTC’s Product Development System (PDS), and enterprise search solutions using the Google Search Appliance. Check out our website to learn more about what we do.
Then felt I like some watcher of the skies
When a new planet swims into his ken
— John Keats, “On First Looking Into Chapman’s Homer”
1. In June I wrote about why anomaly management is hard. Well, not only is it hard to do; it’s hard to talk about as well. One reason, I think, is that it’s hard to define what an anomaly is. And that’s a structural problem, not just a semantic one — if something is well enough understood to be easily described, then how much of an anomaly is it after all?
Artificial intelligence is famously hard to define for similar reasons.
“Anomaly management” and similar terms are not yet in the software marketing mainstream, and may never be. But naming aside, the actual subject matter is important.
2. Anomaly analysis is clearly at the heart of several sectors, including:
- IT operations
- Factory and other physical-plant operations
Each of those areas features one or both of the frameworks:
- Surprises are likely to be bad.
- Coincidences are likely to be suspicious.
So if you want to identify, understand, avert and/or remediate bad stuff, data anomalies are the first place to look.
3. The “insights” promised by many analytics vendors — especially those who sell to marketing departments — are also often heralded by anomalies. Already in the 1970s, Walmart observed that red clothing sold particularly well in Omaha, while orange flew off the shelves in Syracuse. And so, in large college towns, they stocked their stores to the gills with clothing in the colors of the local football team. They also noticed that fancy dresses for little girls sold especially well in Hispanic communities … specifically for girls at the age of First Communion.
4. The examples in the previous point may be characterized as noteworthy correlations that surely are reflecting actual causality. (The beer/diapers story would be another example, if only it were true.) Formally, the same is probably true of most actionable anomalies. So “anomalies” are fairly similar to — or at least overlap heavily with — “statistically surprising observations”.
And I do mean “statistically”. As per my Keats quote above, we have a classical model of sudden-shock discovery — an astronomer finding a new planet, a radar operator seeing a blip on a screen, etc. But Keats’ poem is 200 years old this month. In this century, there’s a lot more number-crunching involved.
Please note: It is certainly not the case that anomalies are necessarily found via statistical techniques. But however they’re actually found, they would at least in theory score as positives via various statistical tests.
5. There are quite a few steps to the anomaly-surfacing process, including but not limited to:
- Collecting the raw data in a timely manner.
- Identifying candidate signals (and differentiating them from noise).
- Communicating surprising signals to the most eager consumers (and letting them do their own analysis).
- Giving more tightly-curated information to a broader audience.
Hence many different kinds of vendor can have roles to play.
6. One vendor that has influenced my thinking about data anomalies is Nestlogic, an early-stage start-up with which I’m heavily involved. Here “heavily involved” includes:
- I own more stock in Nestlogic than I have in any other company of which I wasn’t the principal founder.
- I’m in close contact with founder/CEO David Gruzman.
- I’ve personally written much of Nestlogic’s website content.
Nestlogic’s claims include:
- For machine-generated data, anomalies are likely to be found in data segments, not individual records. (Here a “segment” might be all the data coming from a particular set of sources in a particular period of time.)
- The more general your approach to anomaly detection, the better, for at least three reasons:
- In adversarial use cases, the hacker/fraudster/terrorist/whatever might deliberately deviate from previous patterns, so as to evade detection by previously-established filters.
- When there are multiple things to discover, one anomaly can mask another, until it is detected and adjusted for.
- (This point isn’t specific to anomaly management) More general tools can mean that an enterprise has fewer different new tools to adopt.
- Anomalies boil down to surprising data profiles, so anomaly detection bears a slight resemblance to the data profiling approaches used in data quality, data integration and query optimization.
- Different anomaly management users need very different kinds of UI. Less technical ones may want clear, simple alerts, with a minimum of false positives. Others may use anomaly management as a jumping-off point for investigative analytics and/or human real-time operational control.
I find these claims persuasive enough to help Nestlogic with its marketing and fund-raising, and to cite them in my post here. Still, please understand that they are Nestlogic’s and David’s assertions, not my own.
1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:
- The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
- Software as a service, aka SaaS.
- Co-location in off-premises data centers, aka colo.
- On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.
Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.
This is a good example of Monash’s Laws of Commercial Semantics.
2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.
This fact now seems to be widely understood.
3. The public cloud is a natural fit for those use cases in which elasticity truly matters. Many websites and other consumer internet backends have that characteristic. Such systems are often also a good fit for cloud technologies in general.
This is frequently a good reason for new – i.e. “greenfield” – apps to run in the cloud.
4. Security and privacy can be concerns in moving to the cloud. But I’m hearing that more and more industries are overcoming those concerns.
In connection to that point, it might be interesting to note:
- In the 1960s and 1970s, one of the biggest industries for remote computing services — i.e. SaaS — was commercial banking.
- Other big users were hospitals and stockbrokers.
- The US intelligence agencies are building out their own shared, dedicated cloud.
5. Obviously, Amazon is the gorilla in the cloud business. Microsoft Azure gets favorable mentions as well. I don’t hear much about other public cloud providers, however, except that there are a lot of plans to support Google’s cloud just in case.
In particular, I hear less than I expected to about public clouds run by national-champion telecom companies around the world.
6. It’s inconvenient for an application vendor to offer both traditional and SaaS versions of a product. Release cycles and platform support are different in the two cases. But there’s no reason a large traditional application vendor couldn’t pull it off, and the largest are already more or less claiming to. Soon, this will feel like a market necessity across the board.
7. The converse is less universally true. However, some SaaS vendors do lose out from their lack of on-premises options. Key considerations include:
- Does your application need to run close to your customers’ largest databases?
- Do your customers still avoid the public cloud?
If both those things are true, and you don’t have an on-premises option, certain enterprises are excluded from your addressable market.
8. Line-of-business departments are commonly more cloud-friendly than central IT is. Reasons include:
- Departments don’t necessarily see central IT as any “closer” to them than the cloud is.
- Departments don’t necessarily care about issues that give central IT pause.
- Departments sometimes buy things that only are available via remote delivery, e.g. narrowly focused SaaS applications or market data.
I discussed some of this in my recent post on vendor lock-in.
9. When the public cloud was younger, it had various technological limitations. You couldn’t easily get fast storage like flash. You couldn’t control data movement well enough for good MPP (Massively Parallel Processing) in use cases like analytic SQL.
Those concerns seem to have been largely alleviated.
10. It takes a long time for legacy platforms to be decommissioned. At some enterprises, however, that work has indeed been going on for a long time, via virtualization.
11. If you think about system requirements:
- There is a lot of computing power in devices that may be regarded as IoT nodes — phones, TV boxes, thermostats, cars, industrial equipment, sensors, etc. Client-side computing is getting ever more diverse.
- Server-side computing, however, is more homogenous. Enterprises can, should and likely will meet the vast majority of their server requirements on a relatively small number of clusters each.
I argued the latter point in my 2013 post on appliances, clusters, and clouds, using terminology and reasoning that are now only slightly obsolete.
So what will those clusters be? Some will be determined by app choices. Most obviously, if you use SaaS, the SaaS vendor decides which cloud(s) your data is in. And if you’re re-hosting legacy systems via virtualization, that’s another cluster.
Otherwise, clusters will probably be organized by database, in the most expansive sense of term. For example, there could be separate clusters for:
- Operational data managed by your general-purpose RDBMS (Oracle, SQL Server, DB2, whatever).
- Relational data warehousing, whether in an analytic RDBMS or otherwise.
- Log files, perhaps managed in Hadoop or Splunk.
- Your website and other internet back-ends, perhaps running over NoSQL data stores.
- Text documents managed by some kind of search engine.
- Media block or object storage, if the organization’s audio/video/whatever would overwhelm a text search engine. (Text search or document management systems can often also handle low volumes of non-text media.)
Indeed, since computing is rarely as consolidated as CIOs dream of it being, a large enterprise might have several clusters for any of those categories — each running different software for data and storage management — with different deployment choices among colo, true on-prem, and true cloud.
I’ve been an analyst for 35 years, and debates about “real-time” technology have run through my whole career. Some of those debates are by now pretty much settled. In particular:
- Yes, interactive computer response is crucial.
- Into the 1980s, many apps were batch-only. Demand for such apps dried up.
- Business intelligence should occur at interactive speeds, which is a major reason that there’s a market for high-performance analytic RDBMS.
- Theoretical arguments about “true” real-time vs. near-real-time are often pointless.
- What matters in most cases is human users’ perceptions of speed.
- Most of the exceptions to that rule occur when machines race other machines, for example in automated bidding (high frequency trading or otherwise) or in network security.
A big issue that does remain open is: How fresh does data need to be? My preferred summary answer is: As fresh as is needed to support the best decision-making. I think that formulation starts with several advantages:
- It respects the obvious point that different use cases require different levels of data freshness.
- It cautions against people who think they need fresh information but aren’t in a position to use it. (Such users have driven much bogus “real-time” demand in the past.)
- It covers cases of both human and automated decision-making.
Straightforward applications of this principle include:
- In “buying race” situations such as high-frequency trading, data needs to be as fresh as the other guy’s, and preferably even fresher.
- Supply-chain systems generally need data that’s fresh to within a few hours; in some cases, sub-hour freshness is needed.
- That’s a good standard for many desktop business intelligence scenarios as well.
- Equipment-monitoring systems’ need for data freshness depends on how quickly catastrophic or cascading failures can occur or be averted.
- Different specific cases call for wildly different levels of data freshness.
- When equipment is well-instrumented with sensors, freshness requirements can be easy to meet.
E-commerce and other internet interaction scenarios can be more complicated, but it seems safe to say:
- Recommenders/personalizers should take into account information from the current session.
- Try very hard to give customers correct information about merchandise availability or pricing.
In meeting freshness requirements, multiple technical challenges can come into play.
- Traditional batch aggregation is too slow for some analytic needs. That’s a core reason for having an analytic RDBMS.
- Traditional data integration/movement pipelines can also be too slow. That’s a basis for short-request-capable data stores to also capture some analytic workloads. E.g., this is central to MemSQL’s pitch, and to some NoSQL applications as well.
- Scoring models at interactive speeds is often easy. Retraining them quickly is much harder, and at this point only rarely done.
- OLTP (OnLine Transaction Processing) guarantees adequate data freshness …
- … except in scenarios where the transactions themselves are too slow. Questionably-consistent systems — commonly NoSQL — can usually meet performance requirements, but might have issues with the freshness of accurate
- Older generations of streaming technology disappointed. The current generation is still maturing.
Based on all that, what technology investments should you be making, in order to meet “real-time” needs? My answers start:
- Customer communications, online or telephonic as the case may be, should be based on accurate data. In particular:
- If your OLTP data is somehow siloed away from your phone support data, fix that immediately, if not sooner. (Fixing it 5-15 years ago would be ideal.)
- If your eventual consistency is so eventual that customers notice, fix it ASAP.
- If you invest in predictive analytics/machine learning to support your recommenders/personalizers, then your models should at least be scored on fresh data.
- If your models don’t support that, reformulate them.
- If your data pipeline doesn’t support that, rebuild it.
- Actual high-speed retraining of models isn’t an immediate need. But if you’re going to have to transition to that anyway, consider doing do early and getting it over with.
- Your BI should have great drilldown and exploration. Find the most active users of such functionality in your enterprise, even if — especially if! — they built some kind of departmental analytic system outside the enterprise mainstream. Ask them what, if anything, they need that they don’t have. Respond accordingly.
- Whatever expensive and complex equipment you have, slather it with sensors. Spend a bit of research effort on seeing whether the resulting sensor logs can be made useful.
- Please note that this applies both to vehicles and to fixed objects (e.g. buildings, pipelines) as well as traditional industrial machinery.
- It also applies to any products you make which draw electric power.
So yes — I think “real-time” has finally become pretty real.
I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:
- Many of the independent vendors were swooped up by acquisition.
- None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
- Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
- Impala and other Hadoop-based alternatives are technology options.
- Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.
Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.
My answer, in a nutshell, is:
Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud – are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.
To see why, let’s start by asking: “With what do you want to integrate your analytic SQL processing?”
- If you want to integrate with relational OLTP (OnLine Transaction Processing), your OLTP RDBMS vendor surely has a story worth listening to. Memory-centric offerings MemSQL and SAP HANA are also pitched that way.
- If you want to integrate with your SAP apps in particular, HANA is the obvious choice.
- If you want to integrate with other work you do in the Amazon cloud, Redshift is worth a look.
Beyond those cases, a big issue is integration with … well, with data integration. Analytic RDBMS got a lot of their workloads from ELT or ETLT, which stand for Extract/(Transform)/Load/Transform. I.e., you’d load data into an efficient analytic RDBMS and then do your transformations, vs. the “traditional” (for about 10-15 years of tradition) approach of doing your transformations in your ETL (Extract/Transform/Load) engine. But in bigger installations, Hadoop often snatches away that part of the workload, even if the rest of the processing remains on a dedicated analytic RDBMS platform such as Teradata’s.
And suppose you want to integrate with more advanced analytics — e.g. statistics, other predictive modeling/machine learning, or graph analytics? Well — and this both surprised and disappointed me — analytic platforms in the RDBMS sense didn’t work out very well. Early Hadoop had its own problems too. But Spark is doing just fine, and seems poised to win.
My technical observations around these trends include:
- Advanced analytics commonly require flexible, iterative processing.
- Spark is much better at such processing than earlier Hadoop …
- … which in turn is better than anything that’s been built into an analytic RDBMS.
- Open source/open standards and the associated skill sets come into play too. Highly vendor-proprietary DBMS-tied analytic stacks don’t have enough advantages over open ones.
- Notwithstanding the foregoing, RDBMS-based platforms can still win if a big part of the task lies in fancy SQL.
And finally, if a task is “partly relational”, then Hadoop or Spark often fit both parts.
- They don’t force you into using SQL or everything, nor into putting all your data into relational schemas, and that flexibility can be a huge relief.
- Even so, almost everybody who uses those uses some SQL, at least for initial data extraction. Those systems are also plenty good enough at SQL for joining data to reference tables, and all that other SQL stuff you’d never want to give up.
But suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the trade-offs:
- They generally still provide the best performance or performance/concurrency combination, for the cost, although YMMV (Your Mileage May Vary).
- One has to load the data in and immediately structure it relationally, which can be an annoying contrast to Hadoop alternatives (data base administration can be just-in-time) or to OLTP integration (less or no re-loading).
- Other integrations, as noted above, can also be weak.
Suppose all that is a good match for your situation. Then you should surely continue using an analytic RDBMS, if you already have one, and perhaps even acquire one if you don’t. But for many other use cases, analytic RDBMS are no longer the best way to go.
Finally, how does the cloud affect all this? Mainly, it brings one more analytic RDBMS competitor into the mix, namely Amazon Redshift. Redshift is a simple system for doing analytic SQL over data that was in or headed to the Amazon cloud anyway. It seems to be quite successful.
Bottom line: Analytic RDBMS are no longer in their youthful prime, but they are healthy contributors in middle age. Mainly, they’re still best-of-breed for supporting demanding BI.
data Artisans and Flink basics start:
- Flink is an Apache project sponsored by the Berlin-based company data Artisans.
- Flink has been viewed in a few different ways, all of which are similar to how Spark is seen. In particular, per co-founder Kostas Tzoumas:
- Flink’s original goal was “Hadoop done right”.
- Now Flink is focused on streaming analytics, as an alternative to Spark Streaming, Samza, et al.
- Kostas seems to see Flink as a batch-plus-streaming engine that’s streaming-first.
Like many open source projects, Flink seems to have been partly inspired by a Google paper.
To this point, data Artisans and Flink have less maturity and traction than Databricks and Spark. For example:
- The first line of Flink code dates back to 2010.
- data Artisans and the Flink open source project both started in 2014.
- When I met him in late June, Kostas told me that Data Artisans had raised $7 million and had 15 employees.
- Flink’s current SQL support is very minor.
Per Kostas, about half of Flink committers are at Data Artisans; others are at Cloudera, Hortonworks, Confluent, Intel, at least one production user, and some universities. Kostas provided about 5 examples of production Flink users, plus a couple of very big names that were sort-of-users (one was using a forked version of Flink, while another is becoming a user “soon”).
The technical story at data Artisans/Flink revolves around the assertion “We have the right architecture for streaming.” If I understood data Artisans co-founder Stephan Ewen correctly on a later call, the two key principles in support of that seem to be:
- The key is to keep data “transport” running smoothly without interruptions, delays or bottlenecks, where the relevant sense of “transport” is movement from one operator/operation to the next.
- In this case, the Flink folks feel that modularity supports efficiency.
- Anything that relates to consistency/recovery is kept almost entirely separate from basic processing, with minimal overhead and nothing that resembles a lock.
- Windowing and so on operate separately from basic “transport” as well.
- The core idea is that special markers — currently in the ~20 byte range in size — are injected into the streams. When the marker gets to an operator, the operator snapshots the then-current state of its part of the stream.
- Should recovery ever be needed, consistency is achieved by assembling all the snapshots corresponding to a single marker, and replaying any processing that happened after those snapshots were taken.
- Actually, this is oversimplified, in that it assumes there’s only a single input stream.
- Alot of Flink’s cleverness, I gather, is involved in assembling a consistent snapshot despite the realities of multiple input streams.
The upshot, Flink partisans believe, is to match the high throughput of Spark Streaming while also matching the low latency of Storm.
The Flink folks naturally have a rich set of opinions about streaming. Besides the points already noted, these include:
- “Exactly once” semantics are best in almost all use cases, as opposed to “at least once”, or to turning off fault tolerance altogether. (Exceptions might arise in extreme performance scenarios, or because of legacy systems’ expectations.)
- Repetitive, scheduled batch jobs are often “streaming processes in disguise”. Besides any latency benefits, reimplementing them using streaming technology might simplify certain issues that can occur around the boundaries of batch windows. (The phrase “continuous processing” could reasonably be used here.)
We discussed joins quite a bit, but this was before I realized that Flink didn’t have much SQL support. Let’s just say they sounded rather primitive even when I assumed they were done via SQL.
Our discussion of windowing was more upbeat. Flink supports windows based either on timestamps or data arrival time, and these can be combined as needed. Stephan thinks this flexibility is important.
As for Flink use cases, they’re about what you’d expect:
- Plenty of data transformation, because that’s how all these systems start out. Indeed, the earliest Flink adoption was for batch transformation.
- Plenty of stream processing.
But Flink doesn’t have all the capabilities one would want for the kinds of investigative analytics commonly done on Spark.
Databricks CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world includes:
- What I called Databricks’ “secondary business” of “licensing stuff to Spark distributors” was really about second/third tier support. Fair enough. But distributors of stacks including Spark, for whatever combination of on-premise and cloud as the case may be, may in many cases be viewed as competitors to Databricks cloud-only service. So why should Databricks help them?
- Databricks’ investment in Spark Summit and similar evangelism is larger than I realized.
- Ali suggests that the fraction of Databricks’ engineering devoted to open source Spark is greater than I understood during my recent visit.
Ali also walked me through customer use cases and adoption in wonderful detail. In general:
- A large majority of Databricks customers have machine learning use cases.
- Predicting and preventing user/customer churn is a huge issue across multiple market sectors.
The story on those sectors, per Ali, is:
- First, Databricks penetrated ad-tech, for use cases such as ad selection.
- Databricks’ second market was “mass media”.
- Disclosed examples include Viacom and NBC/Universal.
- There are “many” specific use cases. Personalization is a big one.
- Conviva-style video operations optimization is a use case for several customers, naturally including Conviva. (Reminder: Conviva was Ion Stoica’s previous company.)
- Health care came third.
- Use cases here seem to be concentrated on a variety of approaches to predict patient outcomes.
- Analytic techniques often combine machine learning with traditional statistics.
- Security is a major requirement in this sector; fortunately, Databricks believes it excels at that.
- Next came what he calls “industrial IT”. This group includes cool examples such as:
- Finding oil.
- Predictive maintenance of wind turbines.
- Predicting weather based on sensor data.
- Finally (for now), there’s financial services. Of course, “financial services” comprises a variety of quite different business segments. Example use cases include:
- Credit card marketing.
- Investment analysis (based on expensive third-party data sets that are already in the cloud).
At an unspecified place in the timeline is national security, for a use case very similar to anti-fraud — identifying communities of bad people. Graph analytics plays a big role here.
And finally, of course we discussed some technical stuff, in philosophy, futures and usage as the case may be. In particular, Ali stressed that Spark 2.0 is the first that “breaks”/changes the APIs; hence the release number. It is now the case that:
- There’s a single API for batch and streaming alike, and for machine learning “too”. This is DataFrames/DataSets. In this API …
- … everything is a table. That said:
- Tables can be nested.
- Tables can be infinitely large, in which case you’re doing streaming.
- Based on this, Ali thinks Spark 2.0 is now really a streaming engine.
Other tidbits included:
- Ali said that every Databricks customer uses SQL. No exceptions.
- Indeed, a “number” of customers are using business intelligence tools. Therefore …
- … Databricks is licensing connector technology from Simba.
- They’re working on model serving, with a REST API, rather than just model building. This was demoed at the recent Spark Summit, but is still in the “nascent” stage.
- Ali insists that every streaming system with good performance does some kind of micro-batching under the hood. But the Spark programmers no longer need to take that directly into account. (In earlier versions, programmatic window sizes needed to be integer multiples of the low-level system’s chosen interval.)
- In the future, when Databricks runs on more than just the Amazon cloud, Databricks customers will of course have cloud-to-cloud portability.
I visited DataStax on my recent trip. That was a tipping point leading to my recent discussions of NoSQL DBAs and misplaced fear of vendor lock-in. But of course I also learned some things about DataStax and Cassandra themselves.
On the customer side:
- DataStax customers still overwhelmingly use Cassandra for internet back-ends — web, mobile or otherwise as the case might be.
- This includes — and “includes” might be understating the point — traditional enterprises worried about competition from internet-only ventures.
Customers in large numbers want cloud capabilities, as a potential future if not a current need.
One customer example was a large retailer, who in the past was awful at providing accurate inventory information online, but now uses Cassandra for that. DataStax brags that its queries come back in 20 milliseconds, but that strikes me as a bit beside the point; what really matters is that data accuracy has gone from “batch” to some version of real-time. Also, Microsoft is a DataStax customer, using Cassandra (and Spark) for the Office 365 backend, or at least for the associated analytics.
Per Patrick McFadin, the four biggest things in DataStax Enterprise 5 are:
- Graph capabilities.
- Cassandra 3.0, which includes a complete storage engine rewrite.
- Tiered storage/ILM (Information Lifecycle Management).
- Policy-based replication.
Some of that terminology is mine, but perhaps my clients at DataStax will adopt it too.
We didn’t go into as much technical detail as I ordinarily might, but a few notes on that tiered storage/ILM bit are:
- It’s a way to have some storage that’s more expensive (e.g. flash) and some that’s cheaper (e.g. spinning disk). Duh.
- Since Cassandra has a strong time-series orientation, it’s easy to imagine how those policies might be specified.
- Technologically, this is tightly integrated with Cassandra’s compaction strategy.
DataStax Enterprise 5 also introduced policy-based replication features, not all of which are in open source Cassandra. Data sovereignty/geo-compliance is improved, which is of particular importance in financial services. There’s also hub/spoke replication now, which seems to be of particular value in intermittently-connected use cases. DataStax said the motivating use case in that area was oilfield operations, where presumably there are Cassandra-capable servers at all ends of the wide-area network.
During my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion Stoica as well. Spark 2.0 is just coming out now, and of course has a lot of enhancements. At a high level:
- Using the new terminology, Spark originally assumed users had data engineering skills, but Spark 2.0 is designed to be friendly to data scientists.
- A lot of this is via a focus on simplified APIs, based on
- Unlike similarly named APIs in R and Python, Spark DataFrames work with nested data.
- Machine learning and Spark Streaming both work with Spark DataFrames.
- There are lots of performance improvements as well, some substantial. Spark is still young enough that Bottleneck Whack-A-Mole yields huge benefits, especially in the SparkSQL area.
- SQL coverage is of course improved. For example, SparkSQL can now perform all TPC-S queries.
The majority of Databricks’ development efforts, however, are specific to its cloud service, rather than being donated to Apache for the Spark project. Some of the details are NDA, but it seems fair to mention at least:
- Databricks’ notebooks feature for organizing and launching machine learning processes and so on is a biggie. Jupyter is an open source analog.
- Databricks has been working on security, and even on the associated certifications.
Two of the technical initiatives Reynold told me about seemed particularly cool. One, on the machine learning side, was a focus on training models online as new data streams in. In most cases this seems to require new algorithms for old model types, with a core idea being that the algorithm does a mini gradient descent for each new data point.
The other cool idea fits the trend of alternatives to the “lambda architecture”. Under the name “structured streaming”, which seems to be a replacement for “DStreaming”, the idea is to do set-based SQL processing even though membership of the set changes over time. Result sets are extracted on a snapshot basis; you can keep either all the results from each snapshot query or just the deltas.
Despite all this, there’s some non-trivial dissatisfaction with Spark, fair or otherwise.
- Some of the reason is that SparkSQL is too immature to be great.
- Some is annoyance that Databricks isn’t putting everything it has into open source.
- Some is that everything has its architectural trade-offs.
To the last point, I raised one of the biggest specifics with Reynold, namely Spark’s lack of a strong built-in data persistence capability. Reynold’s answer was that they’re always working to speed up reading and writing from other forms of persistent storage. E.g., he cited a figure of ~100 million rows/core/second decoded from Parquet.
I visited Databricks in early July to chat with Ion Stoica and Reynold Xin. Spark also comes up in a large fraction of the conversations I have. So let’s do some catch-up on Databricks and Spark. In a nutshell:
- Spark is indeed the replacement for Hadoop MapReduce.
- Spark is becoming the default platform for machine learning.
- SparkSQL (nee’ Shark) is puttering along predictably.
- Databricks reports good success in its core business of cloud-based machine learning support.
- Spark Streaming has strong adoption, but its position is at risk.
- Databricks, the original authority on Spark, is not keeping a tight grip on that role.
I shall explain below. I also am posting separately about Spark evolution, especially Spark 2.0. I’ll also talk a bit in that post about Databricks’ proprietary/closed-source technology.
Spark is the replacement for Hadoop MapReduce.
This point is so obvious that I don’t know what to say in its support. The trend is happening, as originally decreed by Cloudera (and me), among others. People are rightly fed up with the limitations of MapReduce, and — niches perhaps aside — there are no serious alternatives other than Spark.
The greatest use for Spark seems to be the same as the canonical first use for MapReduce: data transformation. Also in line with the Spark/MapReduce analogy:
- Data-transformation-only use cases are important, but they don’t dominate.
- Most other use cases typically have a data transformation element as well …
- … which has to be started before any other work can be done.
And so it seems likely that, at least for as long as Spark is growing rapidly, data transformation will appear to be the biggest Spark use case.
Spark is becoming the default platform for machine learning.
Largely, this is a corollary of:
- The previous point.
- The fact that Spark was originally designed with machine learning as its principal use case.
To do machine learning you need two things in your software:
- A collection of algorithms. Spark, I gather, is one of numerous good alternatives there.
- Support for machine learning workflows. That’s where Spark evidently stands alone.
And thus I have conversations like:
- “Are you doing anything with Spark?”
- “We’ve gotten more serious about machine learning, so yes.”
SparkSQL (nee’ Shark) is puttering along.
SparkSQL is pretty much following the Hive trajectory.
- Useful from Day One as an adjunct to other kinds of processing.
- A tease and occasionally useful as a SQL engine for its own sake, but really not very good, pending years to mature.
Databricks reports good success in its core business of cloud-based machine learning support.
Databricks, to an even greater extent than I previously realized, is focused on its cloud business, for which there are well over 200 paying customers. Notes on that include:
- As you might expect based on my comments above, the majority of usage is for data transformation, but a lot of that is in anticipation of doing machine learning/predictive modeling in the near future.
- Databricks customers typically already have their data in the Amazon cloud.
- Naturally, a lot of Databricks customers are internet companies — ad tech startups and the like. Databricks also reports “strong” traction in the segments:
- Financial services (especially but not only insurance)
- Health care/pharma
- The main languages Databricks customers use are R and Python. Ion said that Python was used more on the West Coast, while R was used more in the East.
Databricks’ core marketing concept seems to be “just-in-time data platform”. I don’t know why they picked that, as opposed to something that emphasizes Spark’s flexibility and functionality.
Spark Streaming’s long-term success is not assured.
To a first approximation, things look good for Spark Streaming.
- Spark Streaming is definitely the leading companion to Kafka, and perhaps also to cloud equivalents (e.g. Amazon Kinesis).
- The “traditional” alternatives of Storm and Samza are pretty much done.
- Newer alternatives from Twitter, Confluent and Flink aren’t yet established.
- Cloudera is a big fan of Spark Streaming.
- Even if Spark Streaming were to generally decline, it might keep substantial “good enough” usage, analogously to Hive and SparkSQL.
- Cool new Spark Streaming technology is coming out.
But I’m also hearing rumbles and grumbles about Spark Streaming. What’s more, we know that Spark Streaming wasn’t a core part of Spark’s design; the use case just happened to emerge. Demanding streaming use cases typically involve a lot of short-request inserts (or updates/upserts/whatever). And if you were designing a system to handle those … would it really be based on Spark?
Databricks is not keeping a tight grip on Spark leadership.
- Databricks’ main business, as noted above, is its cloud service. That seems to be going well.
- Databricks’ secondary business is licensing stuff to Spark distributors. That doesn’t seem to amount to much; it’s too easy to go straight to the Apache distribution and bypass Databricks. No worries; this never seemed it would be a big revenue opportunity for Databricks.
At the moment, Databricks is pretty clearly the general leader of Spark. Indeed:
- If you want the story on where Spark is going, you do what I did — you ask Databricks.
- Similarly, if you’re thinking of pushing the boundaries on Spark use, and you have access to the Databricks folks, that’s who you’ll probably talk to.
- Databricks employs ~1/3 of Spark committers.
- Databricks organizes the Spark Summit.
But overall, Databricks doesn’t seem to care much about keeping Spark leadership. Its marketing efforts in that respect are minimal. Word-of-mouth buzz paints a similar picture. My direct relationship with the company gives the same impression. Oh, I’m sure Databricks would like to remain the Spark leader. But it doesn’t seem to devote much energy toward keeping the role.
Starting with my introduction to Spark, previous overview posts include those in:
I learned some newish terms on my recent trip. They’re meant to solve the problem that “data scientists” used to be folks with remarkably broad skill sets, few of whom actually existed in ideal form. So instead now it is increasingly said that:
- “Data engineers” can code, run clusters, and so on, in support of what’s always been called “data science”. Their knowledge of the math of machine learning/predictive modeling and so on may, however, be limited.
- “Data scientists” can write and run scripts on single nodes; anything more on the engineering side might strain them. But they have no-apologies skills in the areas of modeling/machine learning.
- I raised concerns about the “data science” term 4 years ago.
In this video blog, Fishbowl Solutions’ Technical Project Manager, Justin Ames, and Marketing Team Lead, Jason Lamon, discuss Fishbowl’s Agile (like) approach to managing Oracle WebCenter portal projects. Justin shares an overview of what Agile and Scrum mean, how it is applied to portal development, and the customer benefits of applying Agile to an overall portal project.
“This is my first large project being managed with an Agile-like approach, and it has made a believer out of me. The Sprints and Scrum meetings led by the Fishbowl Solutions team enable us to focus on producing working portal features that can be quickly validated. And because it is an iterative build process, we can quickly make changes. This has lead to the desired functionality we are looking for within our new employee portal based on Oracle WebCenter.”
Staff VP, Compensation and HRIS
Large Health Insurance Provider
The post Fishbowl’s Agile (like) Approach to Oracle WebCenter Portal Projects appeared first on Fishbowl Solutions' C4 Blog.
Vendor lock-in is an important subject. Everybody knows that. But few of us realize just how complicated the subject is, nor how riddled it is with paradoxes. Truth be told, I wasn’t fully aware either. But when I set out to write this post, I found that it just kept growing longer.
1. The most basic form of lock-in is:
- You do application development for a target set of platform technologies.
- Your applications can’t run without those platforms underneath.
- Hence, you’re locked into those platforms.
2. Enterprise vendor standardization is closely associated with lock-in. The basic idea is that you have a mandate or strong bias toward having different apps run over the same platforms, because:
- That simplifies your environment, requiring less integration and interoperability.
- That simplifies your staffing; the same skill sets apply to multiple needs and projects.
- That simplifies your vendor support relationships; there’s “one throat to choke”.
- That simplifies your price negotiation.
3. That last point is double-edged; you have more power over suppliers to whom you give more business, but they also have more power over you. The upshot is often an ELA (Enterprise License Agreement), which commonly works:
- For a fixed period of time, the enterprise may use as much of a given product set as they want, with costs fixed in advance.
- A few years later, the price is negotiated, based on current levels of usage.
Thus, doing an additional project using ELAed products may appear low-cost.
- Incremental license and maintenance fees may be zero in the short-term.
- Incremental personnel costs may be controlled because the needed skills are already in-house.
Often those appearances are substantially correct. That’s a big reason why incumbent software is difficult to supplant unless the upstart substitute is superior in fundamental and important ways.
4. Subscriptions are closely associated with lock-in.
- Most obviously, the traditional software industry gets its profits from high-margin support/maintenance services.
- Cloud lock-in has rapidly become a big deal.
- The open source vendors meeting lock-in resistance, noted above, have subscription business models.
Much of why customers care about lock-in is the subscription costs it’s likely to commit them to.
5. Also related to lock-in are thick single-vendor technology stacks. If you run Oracle applications, you’re going to run the Oracle DBMS too. And if you run that, you’re likely to run other Oracle software, and perhaps use Exadata hardware as well. The cloud ==> lock-in truism is an example of this point as well.
6. There’s a lot of truth to the generality that central IT cares about overall technology architecture, while line-of-business departments just want to get the job done. This causes departments to both:
- Oppose standardization.
- Like thick technology stacks.
Thus, departmental influence on IT both encourages and discourages lock-in.
7. IBM is all about lock-in. IBM’s support for Linux, Eclipse and so on don’t really contradict that. IBM’s business models is to squeeze serve its still-large number of strongly loyal customers as well as it can.
8. Microsoft’s business model over the decades has also greatly depended on lock-in.
- Indeed, it exploited Windows/Office lock-in so vigorously as to incur substantial anti-trust difficulties.
- Server-side Windows tends to be involved in thick stacks — DBMS, middleware, business intelligence, SharePoint and more. Many customers (smaller enterprises or in some cases departments) are firmly locked into these stacks.
- Microsoft is making a strong cloud push with Azure, which inherently involves lock-in.
Yet sometimes, Microsoft is more free and open.
- Office for Macintosh allowed the Mac to be a viable Windows competitor. (And Microsoft was well-paid for that, generating comparable revenue for per Mac to what it got for each Windows PC.)
- Visual Studio is useful for writing apps to run against multiple DBMS.
- Just recently, Microsoft SQL Server was ported to Linux.
9. SAP applications run over several different DBMS, including its own cheap MaxDB. That counteracts potential DBMS lock-in. But some of its newer apps are HANA-specific. That, of course, has the opposite effect.
10. And with that as background, we can finally get what led me to finally write this post. Multiple clients have complaints that may be paraphrased as:
- Customers are locked into expensive traditional DBMS such as Oracle.
- Yet they’re so afraid of lock-in now that they don’t want to pay for our vendor-supplied versions of open source database technologies; they prefer to roll their own.
- Further confusing matters, they also are happy to use cloud technologies, including the associated database technologies (e.g. . Redshift or other Amazon offerings), creating whole new stacks of lock-in.
So open source vendors of NoSQL data managers and similar technologies felt like they were the only kind of vendor suffering from fear of lock-in.
I agree with them that enterprises who feel this way are getting it wrong. Indeed:
- The management of even NoSQL DBMS is a big issue, and help in that area has high cash value for customers.
- Serious users need support.
- Support and management tools happen to be synergistic with each other.
This is the value proposition that propelled Cloudera. It’s also a strong reason to give money to whichever MongoDB, DataStax, Neo Technology et al. sponsors open source technology that you use.
General disclosure: My fingerprints have been on this industry strategy since before the term “NoSQL” was coined. It’s been an aspect of many different consulting relationships.
Some enterprises push back, logically or emotionally as the case may be, by observing that the best internet companies — e.g., Facebook — are allergic to paying for software, even open source. My refutations of that argument include:
- Facebook has more and better engineers than you do.
- Facebook has a lot more servers than you do, and would presumably face much higher prices than you would if you each chose to forgo the in-house alternative.
- Facebook pays for open source software in a different way than through subscription fees — it invents and enhances it. Multiple important projects have originated at Facebook, and it contributes to many others. Are you in a position to do the same thing?
And finally — most of Facebook’s users get its service for free. (Advertisers are the ones who pay cash; all others just pay in attention to the ads.) So if getting its software for free actually does screw up its SLAs (Service Level Agreements) — well, free generally comes with poorer SLAs than paid. But if you’re in the business of serving paying customers, then you might want to have paying-customer kinds of SLAs, even on the parts of your technology — e.g. websites urging people to do business with you — that you provide for free yourself.
- I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
- The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
- I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
- I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.
A running list of recent posts is:
- As a companion to this post, I’m publishing a very long one on vendor lock-in.
Subjects I’d like to add to that list include:
- Spark (it’s prospering).
- Databricks (ditto, appearances to the contrary notwithstanding).
- Flink (it’s interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need).
- DataStax, MemSQL, Zoomdata, and Neo Technology (also prospering).
- Cloudera (multiple topics, as usual).
- Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
- Enterprises’ inconsistent views about vendor lock-in.
- Microsoft’s reinvention (it feels real).
- Metadata (it’s ever more of a thing).
- Machine learning (it’s going to be a big portion of my research going forward).
- Transitions to the cloud — this subject affects almost everything else.
I’ll edit these lists as appropriate when further posts go up.
Let’s cover some other subjects right here.
1. While Kafka is widely agreed to be the universal delivery mechanism for streams, the landscape for companion technologies is confused.
- Back in January I wrote that the leaders were mainly Spark Streaming, followed by Storm.
- I overlooked the fact that Storm creator Twitter was replacing Storm with something called Heron.*
- If there’s any buzz about Confluent’s replacement for distant-third-place contender Samza, I missed it.
- Opinions about Spark Streaming are mixed. Some folks want to get away from it; others like it just fine.
And of course Flink is hoping to blow everybody else in the space away.
*But that kind of thing is not necessarily a death knell. Cassandra inventor Facebook soon replaced Cassandra with HBase, yet Cassandra is doing just fine.
As for the “lambda architecture” — that has always felt like a kludge, and various outfits are trying to obsolete it in various ways. As just one example, Cloudera described that to me during my visit as one of the main points of Kudu.
2. The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.
- Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
- Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.
I had a moment of clarity on this point while visiting my clients at DataStax, and discussing their goal — shared by numerous companies — of being properly appreciated for the management tools they provide. In the room with me were CEO Billy Bosworth and chief evangelist Patrick McFadin — both of whom are former DBAs themselves.
3. I visited ClearStory, and Sharmila Mulligan showed me her actual sales database, as well as telling me some things about funding. The details are all confidential, but ClearStory is clearly doing better than rumor might suggest.
4. Platfora insisted on meeting circumstances in which it was inconvenient for me to take notes. So I have no details to share. But they sounded happy.
5. Pneubotics — with a cool new video on its home page — has found its first excellent product/market fit. Traditional heavy metallic robots are great at painting and related tasks when they can remain stationary, or move on rigid metal rails. Neither of those options works well, however, for large curved or irregular surfaces as might be found in the aerospace industry. Customer success for the leading soft robot company has ensued.
This all seems pretty close to the inspection/maintenance/repair area that I previously suggested could be a good soft robotics fit.