When we contrast the data warehouse with a transaction-oriented, online system, the differences become apparent. These differences are shown in Table 10.1.
Table 10.1 Differences between OLTP and data warehouse.
| OLTP
| Data Warehouse
|
Normalization
| High (3NF)
| Low (1NF)
|
Table sizes
| Small
| Large
|
Number of rows/table
| Small
| Large
|
Size/duration of transactions
| Small
| Large
|
Number of online users
| High (1,000s)
| Low (< 100)
|
Updates
| Frequent
| Nightly
|
Full-table scans
| Rarely
| Frequently
|
Historical data
| < 90 days
| Years
|
Aside from the different uses for data warehouses, many developers are using relational databases to build their data warehouses and simulate multiple dimensions. Design techniques are being used for the simulations. This push toward STAR schema design has been somewhat successful, especially since designers do not have to buy a multidimensional database or invest in an expensive front-end tool. In general, using a relational database for OLAP is achieved by any combination of the following techniques:
- Prejoining tables togetherThis is an obtuse way of saying that a denormalized table is created from the normalized online database. A large prejoin of several tables is sometimes called a fact table in a STAR schema.
- PresummarizationThis prepares the data for any drill-down requests that may come from the end user. Essentially, the different levels of aggregation are identified, and the aggregate tables are computed and populated when the data is loaded.
- Massive denormalizationThe side effect of very inexpensive disks has been the rethinking of the merits of Third Normal Form. Today, redundancy is widely accepted, as seen by the popularity of replication tools, snapshot utilities, and nonFirst Normal Form databases. If you can precreate every possible result table at load time, your end user will enjoy excellent response time when making queries. The STAR schema is an example of massive denormalization.
- Controlled periodic batch updatingNew detail data is rolled into the aggregate table on a periodic basis while the online system is down, with all summarization recalculated as the new data is introduced into the database. While data loading is important, it is only one component of the tools for loading a warehouse. There are several categories of tools that are used to populate the warehouse, including:
- Data extraction toolsDifferent hardware and databases
- Metadata repositoryHolds common definitions
- Data cleaning toolsTools for ensuring uniform data quality
- Data sequencing toolsRI rules for the warehouse
- Warehouse loading toolsTools for populating the data warehouse
As we know, most data warehouses are loaded in batch mode after the online system has been shut down. In this sense, a data warehouse is bi-modal, with a highly intensive loading window, and an intensive read-only window during the day. Because many data warehouses collect data from nonrelational databases such as IMS or CA-IDMS, no standard methods for extracting data are available for loading into a warehouse. However, we do have a few common techniques for extracting and loading data:
- Log sniffingApply the archived redo logs from the OLTP system to the data warehouse.
- Using update, insert, and delete triggersFire off a distributed update to the data warehouse.
- Use snapshot logs to populate the data warehouseUpdate the replicated tables and change log files.
- Run nightly extract/load programsRetrieve the operational data and load it into the warehouse.
There are several methods that can be used to aggregate data within OLAP servers, as shown in Figure 10.4. As you can see, this method extracts data from the relational engine, summarizing the data for display. Another popular method preaggregates the data and keeps the summarized data ready for retrieval.
Figure 10.4 Aggregation and OLAP servers.
Two major changes have occurred over the past several years that have driven the movement toward data warehousing:
- Disk space became inexpensiveOne gigabyte of disk carried a price tag of $100,000 in 1988. Today, one gigabyte is less than $1,000. To support large data warehouses, it is not uncommon to require terabytes of disk storage.
- The movement into open systemsThe migration away from centralized processors has led to data residing on a plethora of different computer and database architectures.
Since data is collected from a variety of sources, many warehouse installations find it necessary to create a metadata repository. But what is the role of a metadata repository? When data is consolidated from a variety of diverse systems, many intrasystem problems can arise. These issues include:
- HomonymsDifferent columns with the same name
- SynonymsThe same column with different names
- Unit incompatibilitiesInches versus centimeters, dollars versus yen
- Enterprise referential integrityBusiness rules that span operational systems
- The warehouse design rulesDetermining how the tables will be built
- Horizontal denormalization (fat tables)
- Vertical denormalization (chunking tables based on time periods)
- Using multidimensional front ends
There are other alternatives than using a pure multidimensional database (MDDB). One common approach is to insert a metadata serve between the OLTP relational database and the query tool, as shown in Figure 10.5.
Figure 10.5 Using metadata repositories for multidimensional databases.
Examples of this approach include:
- DSS Agent by MicroStrategy
- MetaCube by Stanford Technology Group
- HOLOS by Holistic Systems
|