Protecting Data Warehouse
During the late 1980s and early 1990s, data warehouse systems were introduced to provide �big picture� and historical views of the business. Before the introduction of integrated decision support systems and data warehouses, business users did not have the best information available to make informed decisions. When the value of data warehouse information became clear, data warehouses were used not only to measure the performance of the business, but to change that performance by directing day-to-day tactical decisions. Once data warehouses emerge from the IT sandbox and are used for day-to-day decisions, they become mission critical and must be protected.
A D V E R T I S E M E N T
Because warehouse data is mission critical, the data must be copied in a way that enables it to be restored in hours rather than days. Most mission critical data warehouses are multiple terabytes in size, which generally eliminates tape backup as a recovery strategy. Even if restore time were not an issue, the availability and performance of high capacity disks has changed the total cost of ownership (TCO) in favor of disk technology
The data warehouse architect must decide if local, remote, or both disk replication methods will be used. These options have advantages and disadvantages, and the best protection and highest service levels are achieved by using both. The primary advantage to local replication is speed, since it occurs within the host computer or on a high-speed local network. The disadvantage to local replication is it does not protect data from the loss of the data center itself. Remote replication has lower response time because of the increased distance and speed of light limitation, but does protect the data against loss of the primary data center.
If local data replication is one of the protection methods that will be used, there are many ways to accomplish it. Operating system replication, third party software, and storage based replication can all accomplish the task. Every large, mission critical data warehouse that the author has been involved with in the last five years has used some form of storage-based replication to protect data. In addition, storage-based local replication is used to decrease ETL process time while minimizing the impact on the host system. Storage-based local data replication also protects the data warehouse and enables fast recovery from database corruption, operator error, procedural error, and erroneous load data. TimeFinder from EMC is the leading storage-based replication product. The power of TimeFinder is its ability to capture the changed data tracks between the source and replicated data. When the time comes to synchronize the source and replicated data, only the changed tracks are moved, saving considerable replication and synchronization time.
Symmetrix Remote Data Facility (SRDF) is to remote data replication what TimeFinder is to local data replication. Because SRDF replicates data over a long distance, SRDF protects warehouse data against loss due to a data center disaster. Most data warehouse sites using SRDF use the remote site as an active site for both processing and disaster recovery. The combination of TimeFinder and SRDF offers the ability to rapidly recover from local data loss or the loss of a data center, and is the preferred data protection method for mission critical data warehouses.
The data warehouse developer's can make use of the meta data to "protect" the data warehouse. Meta data often is spoken about in terms associated with data warehouses. While it is understood that knowledge workers need meta data to perform their functions, other individuals related to the data warehouse can also reap great rewards from the use of meta data. This article is about how the data warehouse developer's can make use of the meta data to "protect" the data warehouse.
Knowledge Workers need to know:
- the business definitions of the warehouse data
- where the data was extracted from
- the logic that was used to transform the data from operational to warehouse format
- the valid values of the data
- where the data is located and how they can get access to it and so on
As a matter of common practice, the meta data to support the data warehouse is stored in some sort of meta data repository. The repository can exist in several warehouse building products or it can be stored in a centralized repository that brings together disparate meta data (much like a warehouse project of its own). This topic has been written about and presented many times.
However, one item that is almost never mentioned is that warehouse developers need support in the form of the meta data also.
A centralized meta data repository (or a coordinated set of product repositories) can be used to avert potential disasters by giving the developers an early "heads up" when operational system changes may affect the loading process.
In a perfect world, warehouse developers should be notified when changes are being made to operational systems and data that may impact the warehouse building process. In the real world, this type of communications does not always exist. That is why "warehouse protection" using the meta data repository is important.
The Set Up
To "protect" the data warehouse, there are five primary types of meta data that must be stored and kept current in the repository:
- file definitions for operational data sources that feed the warehouse; this includes database and table names, copybooks names, or any name that represents the data definition
- the application programs that reference these operational data sources
- jobs that execute the application programs that reference these operational data sources
- the relationship between file definitions and the programs that reference them
- the relationship between the programs and the jobs that reference them
By capturing these types of meta data, warehouse developers can effectively monitor the operational systems for changes that may impact data that feeds the warehouse.
The sandbox days for data warehouse are over. Client demand for higher data warehouse availability requires that data warehouses be recoverable quickly. Investigate advanced technologies for eliminating or minimizing downtime.
|