Previous Table of Contents Next


When we contrast the data warehouse with a transaction-oriented, online system, the differences become apparent. These differences are shown in Table 10.1.

Table 10.1 Differences between OLTP and data warehouse.
OLTP Data Warehouse
Normalization High (3NF) Low (1NF)
Table sizes Small Large
Number of rows/table Small Large
Size/duration of transactions Small Large
Number of online users High (1,000s) Low (< 100)
Updates Frequent Nightly
Full-table scans Rarely Frequently
Historical data < 90 days Years

Aside from the different uses for data warehouses, many developers are using relational databases to build their data warehouses and simulate multiple dimensions. Design techniques are being used for the simulations. This push toward STAR schema design has been somewhat successful, especially since designers do not have to buy a multidimensional database or invest in an expensive front-end tool. In general, using a relational database for OLAP is achieved by any combination of the following techniques:

  Prejoining tables together—This is an obtuse way of saying that a denormalized table is created from the normalized online database. A large prejoin of several tables is sometimes called a fact table in a STAR schema.
  Presummarization—This prepares the data for any drill-down requests that may come from the end user. Essentially, the different levels of aggregation are identified, and the aggregate tables are computed and populated when the data is loaded.
  Massive denormalization—The side effect of very inexpensive disks has been the rethinking of the merits of Third Normal Form. Today, redundancy is widely accepted, as seen by the popularity of replication tools, snapshot utilities, and non–First Normal Form databases. If you can precreate every possible result table at load time, your end user will enjoy excellent response time when making queries. The STAR schema is an example of massive denormalization.
  Controlled periodic batch updating—New detail data is rolled into the aggregate table on a periodic basis while the online system is down, with all summarization recalculated as the new data is introduced into the database. While data loading is important, it is only one component of the tools for loading a warehouse. There are several categories of tools that are used to populate the warehouse, including:
  Data extraction tools—Different hardware and databases
  Metadata repository—Holds common definitions
  Data cleaning tools—Tools for ensuring uniform data quality
  Data sequencing tools—RI rules for the warehouse
  Warehouse loading tools—Tools for populating the data warehouse

As we know, most data warehouses are loaded in batch mode after the online system has been shut down. In this sense, a data warehouse is bi-modal, with a highly intensive loading window, and an intensive read-only window during the day. Because many data warehouses collect data from nonrelational databases such as IMS or CA-IDMS, no standard methods for extracting data are available for loading into a warehouse. However, we do have a few common techniques for extracting and loading data:

  Log “sniffing”—Apply the archived redo logs from the OLTP system to the data warehouse.
  Using update, insert, and delete triggers—Fire off a distributed update to the data warehouse.
  Use snapshot logs to populate the data warehouse—Update the replicated tables and change log files.
  Run nightly extract/load programs—Retrieve the operational data and load it into the warehouse.

There are several methods that can be used to aggregate data within OLAP servers, as shown in Figure 10.4. As you can see, this method extracts data from the relational engine, summarizing the data for display. Another popular method preaggregates the data and keeps the summarized data ready for retrieval.


Figure 10.4  Aggregation and OLAP servers.

Two major changes have occurred over the past several years that have driven the movement toward data warehousing:

  Disk space became inexpensiveOne gigabyte of disk carried a price tag of $100,000 in 1988. Today, one gigabyte is less than $1,000. To support large data warehouses, it is not uncommon to require terabytes of disk storage.
  The movement into open systemsThe migration away from centralized processors has led to data residing on a plethora of different computer and database architectures.

Since data is collected from a variety of sources, many warehouse installations find it necessary to create a metadata repository. But what is the role of a metadata repository? When data is consolidated from a variety of diverse systems, many intrasystem problems can arise. These issues include:

  Homonyms—Different columns with the same name
  Synonyms—The same column with different names
  Unit incompatibilities—Inches versus centimeters, dollars versus yen
  Enterprise referential integrity—Business rules that span operational systems
  The warehouse design rules—Determining how the tables will be built
  Horizontal denormalization (fat tables)
  Vertical denormalization (chunking tables based on time periods)
  Using multidimensional front ends

There are other alternatives than using a “pure” multidimensional database (MDDB). One common approach is to insert a metadata serve between the OLTP relational database and the query tool, as shown in Figure 10.5.


Figure 10.5  Using metadata repositories for multidimensional databases.

Examples of this approach include:

  DSS Agent by MicroStrategy
  MetaCube by Stanford Technology Group
  HOLOS by Holistic Systems


Previous Table of Contents Next