Real-time Data Warehousing: 5 Challenges & Solutions

From: Justin Langseth <realtime_at_enormic.com>
Date: Sun, 10 Feb 2002 08:20:41 -0500
Message-ID: <8su98.3150$ps.336740_at_news.direcpc.com>


-> Attached is the first of 5 sections of a whitepaper
-> that we are writing to address the many issues
-> associated with Real-time Data Warehousing. Please send
-> any questions, comments, or suggestions to realtime_at_enormic.com.

Real-time Data Warehousing: Challenges & Solutions Justin Langseth, CTO, Enormic LLC
(c) Copyright 2002 Enormic LLC

Today's real-time Requirement

Traditionally data warehouses do not contain today's data. They are usually loaded with data from operational systems at most weekly or in some cases nightly, but are in any case a window on the past. The fast pace of business today is quickly making these historical systems less valuable to the issues facing managers and government officials in the real world. Morning sales on the east coast will affect how stores are stocked on the west coast. Airlines and government agencies need to be able to analyze the most current information when trying to detect suspicious groups of passengers or potentially illegal activity. Fast-paced changes in the financial markets may make the personalized suggestions on a stockbroker's website obsolete by the time they are viewed.

As today's decisions in the business world become more real-time, the systems that support those decisions need to keep up. It is only natural that Data Warehouse, Business Intelligence, Decision Support, and OLAP systems quickly begin to incorporate real-time data.

Data warehouses and business intelligence applications are designed to answer exactly the types of questions that users would like to pose against real-time data. They are able to analyze vast quantities of data over time, to determine what is the best offer to make to a customer, or to identify potentially fraudulent, illegal, or suspicious activity. Ad-hoc reporting is made easy using today's advanced OLAP tools. All that needs to be done is to make these existing systems and applications work off real-time data.

This white paper examines the challenges of adding real-time data to these system, and presents several approaches to making real-time warehousing a reality today.

Challenge #1: Enabling Real-time ETL

One of the most difficult parts of building any data warehouse is the process of extracting, transforming, cleansing, and loading the data from the source system. Performing ETL of data in real-time introduces additional challenges. Almost all ETL tools and systems, whether based on off-the-shelf products or custom-coded, operate in a batch mode. They assume that the data becomes available as some sort of extract file on a certain schedule, usually nightly, weekly, or monthly. Then the system transforms and cleanses the data and loads it into the data warehouse.

This process typically involves downtime of the data warehouse, so no users are able to access it while the load takes place. Since these loads are usually performed late at night, this scheduled downtime typically does not inconvenience many users.

When loading data continuously in real-time, there can't be any system downtime. The heaviest periods in terms of data warehouse usage may very well coincide with the peaks periods of incoming data. The requirements for continuous updates with no warehouse downtime are generally inconsistent with traditional ETL tools and systems. Fortunately, there are new tools on the market which specialize in real-time ETL and data loading. There are also ways of modifying existing ETL systems to perform real-time or near real-time warehouse loading. Some of these tools and techniques are described below.

Solution 1a: "Near Real-time" ETL

The cheapest and easiest way to solve the real-time ETL problem is to not even attempt it in the first place. Not every problem actually requires, or can justify the costs of true real-time data warehousing. For these applications, simply increasing the frequency of the existing data load may be sufficient.

A data load that currently occurs weekly can perhaps be performed instead daily, or twice a day. A daily data load could be converted to an hourly data load. To get around the issue of system downtime, see section 1c below. This approach allows the warehouse users to access data that is more fresh than they are used to having, without having to make major modifications to existing load processes, data models, or reporting applications. While not real-time, near-real time may be a good inexpensive first step.

Solution 1b: Direct trickle feed

Assuming that an application requires a true-real time data warehouse, the simplest approach is to continuously feed the data warehouse with new data from the source system. This can be done by either directly inserting or updating data in the warehouse fact tables, or by inserting data into separate fact tables in a real-time partition (see section 2b).

How to actually transport the data between the operational systems and the data warehouse is beyond the scope of this white paper, but several approaches are worth mentioning. There are a number of new real-time data loading packages that are specifically designed for this challenge, including those from DataMirror and MetaMatrix.

EAI vendors such as Tibco provide solutions for real-time data transport. For systems based on the latest Java technologies, Java Messaging Service (JMS) can be used to transmit each new data element from the source system to a lightweight listener application that in turn inserts the new data into the warehouse tables. For data that is received over the Internet, the data can be transmitted in XML via HTTP using the SOAP standard, and then loaded into the warehouse.

Once the data is near the warehouse, simply inserting the new data in real-time is not particularly challenging. The problem with this approach, which will probably not be readily apparent during development and initial deployment, is that it does not scale well. The same logic as to why data warehouses exist in the first place applies-complex analytical queries do not mix well with continuous inserts and updates. Constantly updating the same tables that are being queried by a reporting or OLAP tool can cause the data warehouse's query performance to degrade.

Under moderate to heavy usage, either from concurrent warehouse user queries or from the incoming data stream, most relational database management systems will begin to temporarily block the incoming data transactions altogether, and the data will become stale. This may also cause the data loading application to fail altogether if the warehouse becomes unresponsive. Also as queries begin to slow and if the data becomes stale, the warehouse users may become frustrated and stop using the system altogether.

Two approaches to real-time data loading that help avert this scalability problem are described in the next section. Also the issue of real-time data warehouse scalability, and four additional approaches to building a scalable real-time warehouse system, are discussed in Section 4 of this paper.

Solution 1c: Trickle & Flip

The "Trickle & Flip" approach helps avert the scalability issues associated with querying tables that are being simultaneously updated. Instead of loading the data in real-time into the actual warehouse tables, the data is continuously fed into a staging tables that are in the exact same format as the target tables. Depending on the data modeling approach begin used (see section 2 below), the staging tables either contain a copy of just the data for the current day, or for smaller fact tables can contain a complete copy of all the historical data..

Then on a periodic basis the staging table is duplicated and the copy is swapped with the fact table, bring the data warehouse instantly up-to-date. If the "integrated real-time partition through views" approach is being used, this operation may simply consist of changing the view definition to include the updated table instead of the old table. Depending on the characteristics of how this swap is handled by the particular RDBMS, it might be advisable to temporally pause the OLAP server while this flip takes place so that no new queries are initiated while the swap occurs.

This approach can be used with cycle times ranging from hourly to every minute. Generally best performance is obtained with 5-10 minute cycles, but 1-2 minute cycles (or even faster) are also possible for smaller data sets or with sufficient database hardware. It is important to test this approach under full load before it is brought into production to find the cycle time that works best for the application.

Solution 1d: External Real-time Data Cache

All of the solutions discussed so far involve the data warehouse's underlying database taking on a lot of additional load to deal with the incoming real-time data, and making it available to warehouse users. The best option in many cases is to store the real-time data in an external real-time data cache (RTDC) outside of the traditional data warehouse, completely avoiding any potential performance and leaving the existing warehouse largely as-is.

The RTDC can simply be another dedicated database server (or a separate instance of a large database system) dedicated to loading, storing, and processing the real-time data. Applications that either deal with large volumes of real-time data (hundreds or thousands of changes per second), or those that require extremely fast query performance, might benefit from using in in-memory database (IMDB) for the real-time data cache. Such IMDBs are provided by companies such as Angara, Cacheflow, Kx, TimesTen, and InfoCruiser.

Regardless of the database that is used to hold the RTDC, its function remains the same. All the real-time data is loaded into the cache as it arrives from the source system. Depending on the approach taken and the analytical tools being used, either all queries that involve the real-time data are directed to the RTDC, or the real-time data required to answer any particular query is seamlessly imaged to the regular data warehouse on a temporary basis to process the query. These two options are described in detail in sections 4c and 4d below.

Using an RTDC, there's no risk of introducing scalability or performance problems on the existing data warehouse, which is particularly important when adding real-time data to an existing production warehouse. Further, queries that access the real-time data will be extremely fast, as they execute in their own environment separate from the existing warehouse.

This is important as users who need up-to-the-second data typically don't want to wait too long for their queries to return. Also by using just-in-time data merging from the RTDC into the warehouse, or reverse just-in-time merging from the warehouse into the RTDC (described in sections 4d and 4e), queries can access both real-time and historical data seamlessly.

The downside of using a RTDC solution, with or without just-in-time data merging, is that it involves an additional database that needs to be installed and maintained. Also there is additional work required to configure the applications that need to access the real-time data so that they point to the RTDC. The upside of the RTDC is increased performance, access to up-to-the-second data, and no scalability or performance risk on the existing warehouse. Also the cost of an RTDC solution is typically low compared to the cost to add sufficient hardware and memory to the existing database server to overcome the scalability and performance issues associated with the trickle feed approaches.

Coming Soon: Four More Challenges and their Solutions

Challenge #2: Modeling Real-time Fact Tables|

Solution 2a: Modeling as Usual
Solution 2b: Separate Real-time Partition & Attributes
Solution 2c: Integrated Real-time through Views
Solution 2d: Modeling with an External Real-time Data Cache

Challenge #3: OLAP Queries vs. Changing Data

Solution 3a: Use a Near Real-time Approach
Solution 3b: Use an External Real-time Data Cache
Solution 3c: Risk Mitigation for True Real-time

Challenge #4: Scalability & Query Contention

Solution 4a: Simplify, & Limit Real-time Reporting
Solution 4b: Apply More Database Horsepower
Solution 4c: Separate & Isolate in a Real-time Data Cache
Solution 4d: Just-in-time Data Merge from External Data Cache
Solution 4e: Reverse Just-in-time Data Merge

Challenge #5: Real-time Alerting

Solution 5a: n-Minute Cycle batches
Solution 5b: True real-time data monitoring & triggering
Solution 5c: Real-time Alert Threshold Management

If you would like to receive the future installments of this white paper as they become available, please send an email to realtime_at_enormic.com.

About the Author

Justin Langseth is a founder and CTO of Enormic LLC. Prior to founding Enormic, Mr. Langseth was founder and CTO of Strategy.com, a real-time data analysis and alerting subsidiary of MicroStrategy. At Strategy.com, he designed and implemented systems that provided near real-time and real-time reports and alerts to over 350,000 users, via web, email, wireless, and voice technologies.

About Enormic LLC

Enormic LLC provides system architecture, project management, and implementation consulting services to companies and government agencies implementing real-time data warehouses and alerting systems. Enormic was founded by former executives from MicroStrategy and Strategy.com, with many years of collective experience implementing high-end real-time data warehousing systems. For more information, please visit www.enormic.com or email us at realtime_at_enormic.com. Received on Sun Feb 10 2002 - 14:20:41 CET

Original text of this message