Re: Real-time Data Warehousing: 5 Challenges & Solutions

From: David Penney <anon_at_noaddress.com>
Date: Sat, 2 Mar 2002 13:15:54 -0000
Message-ID: <zBpg8.29534$jM1.289199_at_NewsReader>


Justin,

I enjoyed yr paper & you invited comment ;) - you are missing another option. Take a look at http://www.metamatrix.com & download our Enterprise Information Integration (EII) & Model Driven Architecture for Data Integration from the white papers section.

The option you are missing is the ability to extend the data warehouse to include operational, transactional & messaging systems in a seamless way - this allows you to create a report for example that takes summary info from the d/whse & adds data from a mainframe, other processing systems/databases & even realtime market data feeds from say Reuters or from MQ-Series.

We are leading the field in what Garter refer to as EII, MetaGroup refer to Distributed Information Access & Giga call Internet Data Integration. Doug Laney & Wayne Eckerson have done some very good papers on the subject.

I believe datawarehouses will continue to provide valuable services - but companies often build them without knowing their value & as the only technical solution presented to them. EII systems provide a new way to approach the ROI question, justify expenditures & accelerate delivery of integrated information.

An EII platform should allow you to effectively implement a virtual d/whse *very* quickly & then analyse a) will the big project be worthwhile & B) target the d/whse project only at that data that MUST reside in the d/whse. Overall huge savings in time/money - or perhaps more important for the same cost you can be much more ambitious in your project. Working this way, the EII platform is also a natural way to perform some of the data movements into the warehouse too - although not usually positioned as ETL.

Effectively all these sources look like they are contained in one unified source.

In this way you can use the warehouse for its best purpose & still access the operational systems for up to date info - in a single operation, as if they are already in a single database. This truly extends the datawahouse into the real world information sources where data is updated in realtime, without sacrificing any of the best features of both.

<ADV: Now for a small advertisement :)
Not only do we have these properties, & more, the MetaMatrix Information Integration Server is secure & hugely scalable, with the best references possible in Investment Banking & Fed. Govt. Our, award winning, product implements standards from the OMG work on modeling, metadata (MOF, CWMI, XMI), design (MDA) and our implementation is J2EE - entirely Java. Our customers create implementation independant models of how they prefer data to look & then map that to the real implementations (regardless of datawarehouse, database, API, messages, legacy access etc). This abstract data model (a Virtual View) is used to access all the data, immediately. This abstract data model creates the effect of a Virtual Database. ie you get the structure & content as you expect it, but have the option to leave the information in its original source - that means you can add a new source to a Virtual Database in minutes (rather than engage a very long d/whse extension project).
We refer to this as Model Driven Data Integration. /ADV> Good luck in your new venture - bear us in mind when you have an intractable data integration problem that needs a response in days, secure & scalable. By all means, leave a message for me at our website or here.

Regards,

             David Penney
             CTO - Europe
             MetaMatrix.

"Justin Langseth" <realtime_at_enormic.com> wrote in message news:8su98.3150$ps.336740_at_news.direcpc.com...
> -> Attached is the first of 5 sections of a whitepaper
> -> that we are writing to address the many issues
> -> associated with Real-time Data Warehousing. Please send
> -> any questions, comments, or suggestions to realtime_at_enormic.com.
>
> Real-time Data Warehousing: Challenges & Solutions
> Justin Langseth, CTO, Enormic LLC
> (c) Copyright 2002 Enormic LLC
>
> Today's real-time Requirement
>
> Traditionally data warehouses do not contain today's data. They are
 usually
> loaded with data from operational systems at most weekly or in some cases
> nightly, but are in any case a window on the past. The fast pace of
 business
> today is quickly making these historical systems less valuable to the
 issues
> facing managers and government officials in the real world. Morning sales
 on
> the east coast will affect how stores are stocked on the west coast.
> Airlines and government agencies need to be able to analyze the most
 current
> information when trying to detect suspicious groups of passengers or
> potentially illegal activity. Fast-paced changes in the financial markets
> may make the personalized suggestions on a stockbroker's website obsolete
 by
> the time they are viewed.
>
> As today's decisions in the business world become more real-time, the
> systems that support those decisions need to keep up. It is only natural
> that Data Warehouse, Business Intelligence, Decision Support, and OLAP
> systems quickly begin to incorporate real-time data.
>
> Data warehouses and business intelligence applications are designed to
> answer exactly the types of questions that users would like to pose
 against
> real-time data. They are able to analyze vast quantities of data over
 time,
> to determine what is the best offer to make to a customer, or to identify
> potentially fraudulent, illegal, or suspicious activity. Ad-hoc reporting
 is
> made easy using today's advanced OLAP tools. All that needs to be done is
 to
> make these existing systems and applications work off real-time data.
>
> This white paper examines the challenges of adding real-time data to these
> system, and presents several approaches to making real-time warehousing a
> reality today.
>
> Challenge #1: Enabling Real-time ETL
>
> One of the most difficult parts of building any data warehouse is the
> process of extracting, transforming, cleansing, and loading the data from
> the source system. Performing ETL of data in real-time introduces
 additional
> challenges. Almost all ETL tools and systems, whether based on
 off-the-shelf
> products or custom-coded, operate in a batch mode. They assume that the
 data
> becomes available as some sort of extract file on a certain schedule,
> usually nightly, weekly, or monthly. Then the system transforms and
 cleanses
> the data and loads it into the data warehouse.
>
> This process typically involves downtime of the data warehouse, so no
 users
> are able to access it while the load takes place. Since these loads are
> usually performed late at night, this scheduled downtime typically does
 not
> inconvenience many users.
>
> When loading data continuously in real-time, there can't be any system
> downtime. The heaviest periods in terms of data warehouse usage may very
> well coincide with the peaks periods of incoming data. The requirements
 for
> continuous updates with no warehouse downtime are generally inconsistent
> with traditional ETL tools and systems. Fortunately, there are new tools
 on
> the market which specialize in real-time ETL and data loading. There are
> also ways of modifying existing ETL systems to perform real-time or near
> real-time warehouse loading. Some of these tools and techniques are
> described below.
>
> Solution 1a: "Near Real-time" ETL
>
> The cheapest and easiest way to solve the real-time ETL problem is to not
> even attempt it in the first place. Not every problem actually requires,
 or
> can justify the costs of true real-time data warehousing. For these
> applications, simply increasing the frequency of the existing data load
 may
> be sufficient.
>
> A data load that currently occurs weekly can perhaps be performed instead
> daily, or twice a day. A daily data load could be converted to an hourly
> data load. To get around the issue of system downtime, see section 1c
 below.
> This approach allows the warehouse users to access data that is more fresh
> than they are used to having, without having to make major modifications
 to
> existing load processes, data models, or reporting applications. While not
> real-time, near-real time may be a good inexpensive first step.
>
> Solution 1b: Direct trickle feed
>
> Assuming that an application requires a true-real time data warehouse, the
> simplest approach is to continuously feed the data warehouse with new data
> from the source system. This can be done by either directly inserting or
> updating data in the warehouse fact tables, or by inserting data into
> separate fact tables in a real-time partition (see section 2b).
>
> How to actually transport the data between the operational systems and the
> data warehouse is beyond the scope of this white paper, but several
> approaches are worth mentioning. There are a number of new real-time data
> loading packages that are specifically designed for this challenge,
> including those from DataMirror and MetaMatrix.
>
> EAI vendors such as Tibco provide solutions for real-time data transport.
> For systems based on the latest Java technologies, Java Messaging Service
> (JMS) can be used to transmit each new data element from the source system
> to a lightweight listener application that in turn inserts the new data
 into
> the warehouse tables. For data that is received over the Internet, the
 data
> can be transmitted in XML via HTTP using the SOAP standard, and then
 loaded
> into the warehouse.
>
> Once the data is near the warehouse, simply inserting the new data in
> real-time is not particularly challenging. The problem with this approach,
> which will probably not be readily apparent during development and initial
> deployment, is that it does not scale well. The same logic as to why data
> warehouses exist in the first place applies-complex analytical queries do
> not mix well with continuous inserts and updates. Constantly updating the
> same tables that are being queried by a reporting or OLAP tool can cause
 the
> data warehouse's query performance to degrade.
>
> Under moderate to heavy usage, either from concurrent warehouse user
 queries
> or from the incoming data stream, most relational database management
> systems will begin to temporarily block the incoming data transactions
> altogether, and the data will become stale. This may also cause the data
> loading application to fail altogether if the warehouse becomes
> unresponsive. Also as queries begin to slow and if the data becomes stale,
> the warehouse users may become frustrated and stop using the system
> altogether.
>
> Two approaches to real-time data loading that help avert this scalability
> problem are described in the next section. Also the issue of real-time
 data
> warehouse scalability, and four additional approaches to building a
 scalable
> real-time warehouse system, are discussed in Section 4 of this paper.
>
> Solution 1c: Trickle & Flip
>
> The "Trickle & Flip" approach helps avert the scalability issues
 associated
> with querying tables that are being simultaneously updated. Instead of
> loading the data in real-time into the actual warehouse tables, the data
 is
> continuously fed into a staging tables that are in the exact same format
 as
> the target tables. Depending on the data modeling approach begin used (see
> section 2 below), the staging tables either contain a copy of just the
 data
> for the current day, or for smaller fact tables can contain a complete
 copy
> of all the historical data..
>
> Then on a periodic basis the staging table is duplicated and the copy is
> swapped with the fact table, bring the data warehouse instantly
 up-to-date.
> If the "integrated real-time partition through views" approach is being
> used, this operation may simply consist of changing the view definition to
> include the updated table instead of the old table. Depending on the
> characteristics of how this swap is handled by the particular RDBMS, it
> might be advisable to temporally pause the OLAP server while this flip
 takes
> place so that no new queries are initiated while the swap occurs.
>
> This approach can be used with cycle times ranging from hourly to every
> minute. Generally best performance is obtained with 5-10 minute cycles,
 but
> 1-2 minute cycles (or even faster) are also possible for smaller data sets
> or with sufficient database hardware. It is important to test this
 approach
> under full load before it is brought into production to find the cycle
 time
> that works best for the application.
>
> Solution 1d: External Real-time Data Cache
>
> All of the solutions discussed so far involve the data warehouse's
> underlying database taking on a lot of additional load to deal with the
> incoming real-time data, and making it available to warehouse users. The
> best option in many cases is to store the real-time data in an external
> real-time data cache (RTDC) outside of the traditional data warehouse,
> completely avoiding any potential performance and leaving the existing
> warehouse largely as-is.
>
> The RTDC can simply be another dedicated database server (or a separate
> instance of a large database system) dedicated to loading, storing, and
> processing the real-time data. Applications that either deal with large
> volumes of real-time data (hundreds or thousands of changes per second),
 or
> those that require extremely fast query performance, might benefit from
> using in in-memory database (IMDB) for the real-time data cache. Such
 IMDBs
> are provided by companies such as Angara, Cacheflow, Kx, TimesTen, and
> InfoCruiser.
>
> Regardless of the database that is used to hold the RTDC, its function
> remains the same. All the real-time data is loaded into the cache as it
> arrives from the source system. Depending on the approach taken and the
> analytical tools being used, either all queries that involve the real-time
> data are directed to the RTDC, or the real-time data required to answer
 any
> particular query is seamlessly imaged to the regular data warehouse on a
> temporary basis to process the query. These two options are described in
> detail in sections 4c and 4d below.
>
> Using an RTDC, there's no risk of introducing scalability or performance
> problems on the existing data warehouse, which is particularly important
> when adding real-time data to an existing production warehouse. Further,
> queries that access the real-time data will be extremely fast, as they
> execute in their own environment separate from the existing warehouse.
>
> This is important as users who need up-to-the-second data typically don't
> want to wait too long for their queries to return. Also by using
> just-in-time data merging from the RTDC into the warehouse, or reverse
> just-in-time merging from the warehouse into the RTDC (described in
 sections
> 4d and 4e), queries can access both real-time and historical data
> seamlessly.
>
> The downside of using a RTDC solution, with or without just-in-time data
> merging, is that it involves an additional database that needs to be
> installed and maintained. Also there is additional work required to
> configure the applications that need to access the real-time data so that
> they point to the RTDC. The upside of the RTDC is increased performance,
> access to up-to-the-second data, and no scalability or performance risk on
> the existing warehouse. Also the cost of an RTDC solution is typically low
> compared to the cost to add sufficient hardware and memory to the existing
> database server to overcome the scalability and performance issues
> associated with the trickle feed approaches.
>
> Coming Soon: Four More Challenges and their Solutions
>
> Challenge #2: Modeling Real-time Fact Tables|
> Solution 2a: Modeling as Usual
> Solution 2b: Separate Real-time Partition & Attributes
> Solution 2c: Integrated Real-time through Views
> Solution 2d: Modeling with an External Real-time Data Cache
>
> Challenge #3: OLAP Queries vs. Changing Data
> Solution 3a: Use a Near Real-time Approach
> Solution 3b: Use an External Real-time Data Cache
> Solution 3c: Risk Mitigation for True Real-time
>
> Challenge #4: Scalability & Query Contention
> Solution 4a: Simplify, & Limit Real-time Reporting
> Solution 4b: Apply More Database Horsepower
> Solution 4c: Separate & Isolate in a Real-time Data Cache
> Solution 4d: Just-in-time Data Merge from External Data Cache
> Solution 4e: Reverse Just-in-time Data Merge
>
> Challenge #5: Real-time Alerting
>
> Solution 5a: n-Minute Cycle batches
> Solution 5b: True real-time data monitoring & triggering
> Solution 5c: Real-time Alert Threshold Management
>
> If you would like to receive the future installments of this white paper
 as
> they become available, please send an email to realtime_at_enormic.com.
>
> About the Author
>
> Justin Langseth is a founder and CTO of Enormic LLC. Prior to founding
> Enormic, Mr. Langseth was founder and CTO of Strategy.com, a real-time
 data
> analysis and alerting subsidiary of MicroStrategy. At Strategy.com, he
> designed and implemented systems that provided near real-time and
 real-time
> reports and alerts to over 350,000 users, via web, email, wireless, and
> voice technologies.
>
> About Enormic LLC
>
> Enormic LLC provides system architecture, project management, and
> implementation consulting services to companies and government agencies
> implementing real-time data warehouses and alerting systems. Enormic was
> founded by former executives from MicroStrategy and Strategy.com, with
 many
> years of collective experience implementing high-end real-time data
> warehousing systems. For more information, please visit www.enormic.com or
> email us at realtime_at_enormic.com.
>
>
>
>
>
Received on Sat Mar 02 2002 - 14:15:54 CET

Original text of this message