Previous Table of Contents Next


Note that the variance analysis is done directly in the SQL statement. The above case displays California users whose usage has dropped by more than 5 percent (comparing January 1991 to January 1992).

But what if the user wants to compare one full year with another year? The table is structured for simple comparison of two specific month dates, but the SQL query could be modified slightly to aggregate the data, offering comparison of two ranges of dates.

The query shown in Listing 10.2 will aggregate all sales for an entire year and compare 1991 with 1992. Here we meet the request “show me all customers in California whose sales have dropped by more than 5 percent between 1991 and 1992.”

Listing 10.2 Aggregating sales for an entire year.

SELECT INTEGER
(((E.NUMBER_OF_SALES - S.NUMBER_OF_SALES) / S.NUMBER_OF_SALES) * 100) ,
E.CUSTOMER_NAME , E.CITY_NAME , E.ZIP , S.NUMBER_OF_SALES , E.NUMBER_OF_SALES

FROM DETAIL S , DETAIL E

WHERE
S.CUSTOMER_NAME = E.CUSTOMER_NAME
 AND
E.STATE_ABBR = 'CA'
 AND
substr(E.DATE_YYMM,1,2) = '91'
 AND
substr(S.DATE_YYMM,1,2) = '92'
 AND
E.NUMBER_OF_SALES < S.NUMBER_OF_SALES - (.05 * S.NUMBER_OF_SALES)

ORDER BY E.ZIP ASC , E.CITY_NAME ASC , 1 ;

On the surface, it appears that SQL can be used against two-dimensional tables to handle three-dimensional time-series problems. It also appears that SQL can be used to roll up aggregations at runtime, alleviating the need to do a roll up at load time, as with a traditional database. While this implementation does not require any special multidimensional databases, two important issues remain to be resolved:

  Performance—Joining a table against itself—especially when comparing ranges of dates—may create many levels of nesting in the SQL optimization and poor response time.
  Ability—No end user would be capable of formulating this type of sophisticated SQL query.

If one strips away all of the marketing hype and industry jargon, we see that data warehouse and a multidimensional database can be easily simulated by precreating many redundant tables, each with precalculated roll-up information. In fact, the base issue is clear. Complex aggregation with either needs to be computed at runtime—or when the data is loaded.

There are many different types of OLAP and MDDB products on the market today, as shown in Table 10.6. Each has its own relative advantages and disadvantages, and they are all fighting to achieve recognition for their strengths.

Table 10.6 OLAP/MDDB product information.
Vendor Tool Description
Oracle Express Excel spreadsheet extension, true OO
Oracle Oracle 7.3 STAR query hints, parallel query, bitmap indexes
MicroStrategy DSS Agent MDDB queries against RDBMS
D&B Pilot Lightship OLAP w/custom and spreadsheet GUI
IBI Focus Fusion MDDB engine
VMark uniVerse NT-based MDDB engine
Kenan Acumate ES MDDB with PC-OLAP GUI
Arbor OLAP Builder Extracts data from DW for Essbase
Arbor Essbase MDDB engine w/Excel spreadsheet GUI
Think Systems FYI Planner PC-GUI with MDDB and OLAP server

Data Mining And OLAP

The recent interest in data warehousing has created many new techniques and tools for getting useful information out of these behemoth databases. Data mining is one area that tends to hold a great deal of promise for users of data warehouses.

In traditional decision support systems, the user was charged with formulating the queries against the database and deciphering any trends that were present in the data. Unfortunately, this approach is only as good as the user of the system, and many statistically valid associations between data items can be missed. This is especially true in data warehouse systems where unobtrusive trends may be present. For example, the Psychology Department at the University of Minnesota—developers of the hugely popular Minnesota Multiphasic Personality Inventory (MMPI)—have discovered some startling patterns correlating a psychological diagnosis to seemingly unrelated, ordinary questions. The results provide unobtrusive measures for human personality. For example, they found that people with low self-esteem tend to prefer baths to showers. While no “reason” for this preference is obvious, the statistically valid correlation between self-concept and cleaning preferences remains.

These types of unobtrusive trends also plague the business world, and it is the goal of data mining software to identify these trends for the users of the warehouse. In addition to simply identifying trends, some data mining software goes one step further in an attempt to analyze other data and determine the underlying reasons for the trend.

While basic statistical tools are adequate for doing correlations between a small number of related variables, large databases with hundreds of data items are quickly bogged down in a mire of multivariate “chi-square” techniques that are hard to follow for even the most experienced statistician. As such, the new data mining tools are meant to accept only general hints from the users, and then go forth into the data probing for trends.

In other cases, data mining techniques are used to prove a hypothesis based on existing data. For example, a marketing person may speculate that those with an income between $50,000 and $80,000 are likely to buy his or her product. A quick verification of this hypothesis can be run, thereby either confirming or disproving the hypothesis.


Previous Table of Contents Next