Previous Table of Contents Next


Yet another class of tools uses a relatively straightforward exception detection mechanism to cruise the database looking for unexpected trends or unusual patterns. Many of the data mining tools use techniques borrowed from Artificial Intelligence (AI), including fuzzy logic, neural networks, fractals, and sundry other statistical techniques. Since many of these tools perform a huge amount of internal processing, many of them read selected information from the relational database into a proprietary, internal data representation for analysis. No widely used data mining tools are available that run directly against the relational database, although there are several promising start-up companies, as shown in Table 10.7.

Table 10.7 Data mining product information.
Vendor Tool Description
Thinking Machines Darwin Neural nets
MIT GmbH DataEngine Fuzzy logic
Reduct Systems DataLogic Fuzzy sets
IBM Data Mining Toolkit Fuzzy logic
Epsilon Epsilon Rule-based
Cross/Z F-DBMS Fractals
Info. Discovery IDIS Rule-based
Info. Harvester InfoHarvester Rule-based
Angoss Software KnowledgeSEEKER Rule-based
Software AG NETMAP Neural networks
NeuralWare NeuralWorks Neural nets
Nestor PRISM Neural nets
Cognitive Systems ReMind Inductive logic

While there is still a great deal of interest in data mining applications, no single vendor has stepped up to claim market leadership. It will probably be many years before all owners of a data warehouse have tools that will be able to fully exploit their data resources.

An Illustration Of STAR Schema Design

Let’s assume that Moncato State Savings & Loan uses the E/R model shown in Figure 10.10 for its database. Using this E/R model, perform the following tasks. State any assumptions that you may have to make.


Figure 10.10  A non-STAR schema for Moncato State Savings & Loan.

Part 1

1.  Write an SQL query to display the cust_name for all customers who have had a transaction of more than $5,000 at the Boston branch.
2.  Starting with this E/R model, denormalize the structure into a STAR schema.
3.  How many “dimensions” are present in this system?
4.  Rewrite the query from Question 1.

Part 2

1.  Using the size estimates, determine the space required for the original schema.
2.  How much additional storage is required for your STAR schema?
3.  Now, assuming that each day’s transactions are “rolled-up” into the account_balance field, design a new STAR schema and give sizing estimates for a database that summarizes each day’s transactions.

Using Oracle 7.3’s Features With Data Warehouses

Please note that the new features of Oracle 7.2 and Oracle 7.3 will not be activated unless the following init.ora parameter has been used:

COMPATIBILITY=7.3.0.0.0

With Oracle version 7.3, several new features can dramatically improve performance of Oracle data warehouse and decision support systems.

Parallel Query For Data Warehouses

It is a common misconception that parallel processors (SMP or MPP) are necessary to use and benefit from parallel processing. Even on the same processor, multiple processes can be used to speed up queries. Data warehouses generally employ parallel technology to perform warehouse loading and query functions. These include:

  Parallel backup/recovery—Some parallel tools are capable of rates in excess of 40 GB/hour.
  Parallel query (SMP and MPP)—Multiple processes are used to retrieve table data.
  Parallel loading—Multiple processes are used to simultaneously load many tables.
  Parallel indexing—Multiple processes are used to create indexes.

For parallel query, the most powerful approach deals with the use of the SQL UNION verb in very large databases (VLDBs). In most very large Oracle data warehouses, it is common to logically partition a single table into many smaller tables in order to improve query throughput. For example, a sales table that is ordered by date_of_sale may be partitioned into 1997_SALES, 1998_SALES, and 1999_SALES tables. This approach is often used in data warehouse applications where a single logical table might have millions of rows. While this splitting of a table according to a key value violates normalization, it can dramatically improve performance for individual queries. For large queries that may span many logical tables, the isolated tables can then easily be reassembled using Oracle’s parallel query facility, as shown here:

CREATE VIEW all_sales AS
    SELECT * FROM 1997_SALES
    UNION ALL
    SELECT * FROM 1998_SALES
    UNION ALL
    SELECT * FROM 1999_SALES;

We can now query the all_sales view as if it were a single database table. Oracle parallel query will automatically recognize the UNION ALL parameter, firing off simultaneous queries against each of the three base tables. For example, the following query will assemble the requested data from the three tables in parallel, with each query being separately optimized. The result set from each subquery is then merged by the query manager:

SELECT customer_name
FROM all_sales
WHERE
sales_amount > 5000;

For more details on using Oracle’s parallel query facility, refer to Chapter 2, Physical Performance Design For Oracle Databases.


Previous Table of Contents Next