Previous | Table of Contents | Next |
In a hierarchical, or CODASYL (Network), database, it is possible to define and implement a database design that contains absolutely no redundant information (such as pure Third Normal Form or 3NF). Hierarchical and network databases can be truly free of redundant information because all data relationships are represented through pointers and not through duplicated foreign keys. Because object-oriented systems use pointers to establish data relationships, many object-oriented systems can also be designed totally free of redundant data. Since the elimination of redundancy requires embedded pointers to establish the data relationships, no relational database can ever be totally free of redundant data.
An Oracle database with either one-to-many or many-to-many relationships has redundant foreign keys embedded in the tables to establish the logical relationships. Redundant duplication of foreign keys in the subordinate tables creates the data relationships, making it possible to join tables together and relate the contents of the data items in the tables.
As the size of the database increases, redundancy can become a major problem. Today, many users create very large databases, many of which contain trillions of bytes. For databases of this size, a single table can contain more than a billion rows, and the introduction of a single new column to a table can represent thousands of dollars in additional disk expense. Data redundancy is detrimental for two reasons. First and foremost, duplicating the redundant material consumes disk storage space. The second and most ominous reason is that updating redundant data requires extra processing. Redundant duplication of very large and highly volatile data items can cause huge processing bottlenecks.
However, this does not imply that redundancy is always undesirable. Performance is still an overriding factor in most systems. Proper control of redundant information implies that redundant information may be introduced into any structure so long as the performance improvements outweigh the additional disk costs and update problems.
Since the first publication of Dr. E. F. Codds 1993 research paper Providing OLAP (Online Analytical Processing) to User-Analysts: An IT Mandate, database designers have attempted to find an optimum way of structuring tables for low data redundancy. Codds rules of normalization guide the designer to create a logically correct table structure with no redundancy, but performance rules often dictate the introduction of duplicated data to improve performance.
This is especially true for distributed Oracle databases. Any node in a distributed database may want to browse a list of customers at other nodes without establishing a connection to that node. The technological problems inherent in the two-phase commit necessitate widespread replication of entire tables or selected columns from tables. However, the distributed database designer does not have free reign to introduce redundancy anywhere in the enterprise. Redundancy always has a price tag, whether it is the cost of the disk storage or the cost of maintaining a parallel update scheme. Figure 1.8 shows a strategy for analyzing the consequences of data redundancy.
Figure 1.8 A comparison of size versus volatility for redundant data.
In Figure 1.8, a boundary line lies within a range between the size of a redundant data item and the frequency of update of the data item. The size of the data item relates to the disk costs associated with storing the item and the frequency of update is associated with the cost of keeping the redundant data current, whether by replication techniques or by two-phase commit updates. Because the relative costs are different for each hardware configuration and for each application, this boundary may be quite different depending on the type of application. The rapid decrease in the disk storage costs designates that the size boundary is only important for very large-scale redundancy. A large, frequently changing item (for example, street_address) is not a good candidate for redundancy. But large static items (for example, service_history) or small, frequently changing items (for example, product_price) are acceptable for redundancy. Small static items (for example, gender) represent ideal candidates for redundant duplication.
Denormalizing One-To-Many Data Relationships
One-to-many relationships exist in many real-world situations. Many entities that possess one-to-many relationships can be removed from the data model, eliminating some join operations. The basic principle here is simple: Redundant information avoids expensive SQL joins and yields faster processing. But remember, we must deal with the issue of additional disk storage and the problems associated with updating the redundant data. For example, consider the entity/relation (E/R) model shown in Figure 1.9, where we see that the structure is in pure Third Normal Form. Note that the CITY and STATE tables exist because each state has many cities and each city has many customers. This model works for most transactions on an online transaction processing (OLTP) system. However, this high degree of normalization would require the joining of the CITY and STATE tables each time that address information is requested, forcing some SQL requests to perform very slowly.
Figure 1.9 A fully normalized E/R model sales database.
Consider a query to display the state_bird for all orders that have been placed for BIRDSEED. This is a cumbersome query that requires the joining of six tables:
SELECT state_bird FROM STATE, CITY, CUSTOMER, ORDER, QUANTITY, ITEM WHERE item_name = 'BIRDSEED' AND ITEM.item.nbr = QUANTITY.item.nbr AND QUANTITY.order_nbr = ORDER.order_nbr AND ORDER.cust_nbr = CUSTOMER.cust_nbr AND CUSTOMER.cust_city = CITY.cust_city AND CITY.state_name = STATE.state_name;
With Oracle and the rule-based optimizer, this type of complex join guarantees that at least one table is read front to back using a full-table scan. This is a shortcoming of Oracles rule-based optimizer, since an SQL optimizer should always avoid a full-table scan whenever indexes are presentand full-table scans are very expensive. This situation might be avoided by using Oracle hints with the cost-based optimizer to determine the optimal path to this data. A hint is an extension of Oracles SQL that directs the SQL optimizer to change its normal access path. For more detailed information on optimizing full-table scans, see Chapter 6, Oracle DBA Performance And Tuning. For more information on hints, refer to Chapter 4, Tuning Oracle SQL.
Previous | Table of Contents | Next |