The Challenges of Using Data Warehouses
As today's decisions in the business world become more real-time, the systems that support those decisions need to keep up. It is only natural that Data Warehouse, Business Intelligence, Decision Support, and OLAP systems quickly begin to incorporate real-time data. Data warehouses and business intelligence applications are designed to answer exactly the types of questions that users would like to pose against real-time data. They are able to analyze vast quantities of data over time, to determine what is the best offer to make to a customer, or to identify potentially fraudulent, illegal, or suspicious activity. Ad-hoc reporting is made easy using today's advanced OLAP tools. All that needs to be done is to make these existing systems and applications work off real-time data. The following are a few of the challenges of adding real-time data to these systems:
A D V E R T I S E M E N T
Enabling Real-time ETL
One of the most difficult parts of building any data warehouse is the process of extracting, transforming, cleansing, and loading the data from the source system. Performing ETL of data in real-time introduces additional challenges. Almost all ETL tools and systems, whether based on off-the-shelf products or custom-coded, operate in a batch mode. They assume that the data becomes available as some sort of extract file on a certain schedule, usually nightly, weekly, or monthly. Then the system transforms and cleanses the data and loads it into the data warehouse.
This process typically involves downtime of the data warehouse, so no users are able to access it while the load takes place. Since these loads are usually performed late at night, this scheduled downtime typically does not inconvenience many users.
When loading data continuously in real-time, there can't be any system downtime. The heaviest periods in terms of data warehouse usage may very well coincide with the peak periods of incoming data. The requirements for continuous updates with no warehouse downtime are generally inconsistent with traditional ETL tools and systems. Fortunately, there are new tools on the market that specialize in real-time ETL and data loading. There are also ways of modifying existing ETL systems to perform real-time or near real-time warehouse loading.
Modeling Real-time Fact Tables
The introduction of real-time data into an existing data warehouse, or the modeling of real-time data for a new data warehouse brings up some interesting data modeling issues. For instance, a warehouse that has all of its data aggregated at various levels based on a time dimension needs to consider the possibility that the aggregated information may be out of synch with the real-time data. Also some metrics such as month-to-date and week-to-date may behave strangely with a partial day of data that continuously changes. The main issue regarding modeling however revolves around where the real-time data is stored, and how best to link it into the rest of the data model.
OLAP Queries vs. Changing Data
OLAP and Query tools were designed to operate on top of unchanging, static historical data. Since they assume that the underlying data is not changing, they don't take any precautions to ensure that the results they produce are not negatively influenced by data changes concurrent to query execution. In some cases, this can lead to inconsistent and confusing query results.
Relational OLAP tools are particularly sensitive to this problem because they perform all but the simplest data analysis operations by issuing multi-pass SQL. A multi-pass SQL statement is made up of many smaller SQL statements that sequentially operate on a set of temporary tables.
This presents two problems. The first problem is that the results of a query that takes even one minute are arguably not exactly real-time anymore. While this data latency may be acceptable to a retail division manager, it might not be ok for an application that is looking for atmospheric trends that indicate the presence of a tornado, or for an application detecting real-time credit card or telecommunications fraud.
The second problem is that given the multiple passes of SQL required to perform almost any relational OLAP reporting or analytical operation, any real-time warehouse is likely to suffer from the result set internal inconsistency issue discussed above. There's nothing like the numbers not adding up properly to make a user skeptical of a report. For more complex product affinity or trend detection analytics, the results may be so confusing as to be completely useless.
Scalability & Query Contention
The issue of query contention and scalability is the most difficult issue facing organizations deploying real-time data warehouse solutions. Data warehouses were separated from transactional systems in the first place because the type of complex analytical queries run against warehouses don't "play well" with lots of simultaneous inserts, updates, or deletes.
Usually the scalability of data warehouse and OLAP solutions is a direct function of the amount of data being queried and the number of users simultaneously running queries. Given a fixed amount of data, the number of users on the system is proportional to query response time. Lots of concurrent usages causes reports to take longer to execute.
While this is still true in a real-time system, the additional burden of continuously loading and updating data further strains system resources. Unfortunately the additional burden of a continuous data load is not just equivalent to one or two additional simultaneously querying users due to the contention between data inserts and typical OLAP select statements. While it depends on the database, the contention between complex selects and continuous inserts tends to severely limit scalability. Surprisingly quickly the continuous data loading process may become blocked, or what used to be fast queries may begin to take intolerably long to return.
Real-time Alerting
Most alerting applications associated with data warehouses to date have been mainly used to distribute email versions of reports after the nightly data warehouse load. The availability of real-time data in a data warehouse makes alerting applications much more appealing, as users can be alerted to real-time conditions as they occur in the warehouse, not just on a nightly basis.
The availability of real-time data makes products such as MicroStrategy's NarrowCaster and similar products from Cognos and Business Objects very valuable. But real-time alerting using these products brings its own set of challenges, as surprisingly these products, like many query tools from the same vendors, were not designed to operate on or tested against real-time data feeds.
These products operate on a schedule or event basis, so they can either trigger an alert every few minutes or hours, or need to be triggered by an external system. Solutions to address these challenges are available in the market. There is also the issue of threshold management. When alerts are triggered frequently , there needs to be a mechanism in place to make sure that once an alert is sent due to a condition in the warehouse that the alert is not continuously sent over and over again during each alerting cycle.
As we have seen in this article, real-time data warehousing and OLAP are possible using today's technology, but challenges lurk seemingly every step of the way. For the determined team armed with the right knowledge and experience, it is possible to make real-time reporting, analysis, and alerting systems work. The challenge is to make the right tradeoffs along the way, to make sure the systems meet the needs of the user base while ensuring that they don't collapse under their own weight, or cause existing production warehouses to malfunction
It is likely that a lot of the challenges discussed above will become less challenging over time, as database, ETL, OLAP, reporting, and alerting tool vendors begin to add features to their systems to make them work better with real-time data streams. In the meantime, it is important to make sure real-time warehousing systems are well planned and designed, and thoroughly tested under realistic data and user load conditions before they are deployed.
The benefits of data warehousing in real-time are becoming clearer every day. With the right tools, designs, advice, approaches, and in some cases tricks, real-time data warehousing is possible using today's technologies, and will only become easier in the future.
|