Oracle7 Server Distributed Systems Volume II: Replicated Data
Survivability
Survivability provides the capability to continue running applications despite system or site failures. It allows applications to be run on a fail-over system, accessing the same, or very nearly the same, data as they were on the primary system when it failed. As shown in Figure 8 - 1, the Oracle Server provides two different technologies for accomplishing survivability: the Oracle Parallel Server and the symmetric replication facility.
Figure 8 - 1. Survivability Methods: Symmetric Replication vs. Parallel Server
Oracle Parallel Server versus Symmetric Replication
The Oracle Parallel Server supports fail-over to surviving systems when a system supporting an instance of the Oracle Server fails. The Oracle Parallel Server requires a cluster or massively parallel hardware platform, and thus is applicable for protection against processor system failures in the local environment where the cluster or massively parallel system is running.
In these environments, the Oracle Parallel Server is the ideal solution for survivability -- supporting high transaction volumes with no lost transactions or data inconsistencies in the event of an instance failure. If an instance fails, a surviving instance of the Oracle Parallel Server automatically recovers any incomplete transactions. Applications running on the failed system can execute on the fail-over system, accessing all of the data in the database.
The Oracle Parallel Server does not, however, provide survivability for site failures (such as flood, fire, or sabotage) that render an entire site, and thus the entire cluster or massively parallel system, inoperable. To provide survivability for site failures, you can use the symmetric replication facility to maintain a replicate of a database at a geographically remote location.
Should the local system fail, the application can continue to execute at the remote site. Symmetric replication, however, cannot guarantee that no transactions will be lost. Also, special care must be taken to prevent data inconsistencies when the primary site is recovered.
Designing for Survivability
If you choose to use the symmetric replication facility for survivability, you should consider the following issues:
- The symmetric replication facility must be able to keep up with the transaction volume of the primary system. This is application specific, but generally much lower than the throughput supported if you are using the Oracle Parallel Server.
- If a failure occurs at the primary site, recently committed transactions at the primary site may not have been asynchronously propagated to the fail-over site yet. These transactions will appear to be lost.
- These "lost" transactions must be dealt with when the primary site is recovered.
Suppose, for example, you are running an order-entry system that uses replication to maintain a remote fail-over order-entry system, and the primary system fails.
At the time of the failure, there were two transactions recently executed at the primary site that did not have their changes propagated and applied at the fail-over site. The first of these was a transaction that entered a new order, and the second was a transaction that cancelled an existing order.
In the first case, someone may notice the absence of the new order when processing continues on the fail-over system, and re-enter it. In the second case, the cancellation of the order may not be noticed, and processing of the order may proceed; that is, the canceled item may be shipped and the customer billed.
What happens now, when you restore the primary site? If you simply push all of the changes executed on the fail-over system back to the primary system, you will encounter conflicts.
Specifically, there will be duplicate orders for the item originally ordered at the primary system just before it failed. Additionally, there will be data changes resulting from the transactions to ship and bill the order that was originally canceled on the primary system.
You must carefully design your system, as described in the next section, to deal with these situations.
Implementing a Survivable System
Oracle's symmetric replication facility can be used to provide survivability against site failures by using multiple replicated master sites. You must configure your system using one of the following methods. These methods are listed in order of increasing implementation difficulty.
- The fail-over site is used for read access only. That is, no updates are allowed at the fail-over site, even when the primary site fails.
- After a failure, the primary site is restored from the fail-over site using export/import, or via full backup.
- Full conflict resolution is employed for all data/transactions. This requires careful design and implementation. You must ensure proper resolution of conflicts that can occur when the primary site is restored, such as duplicate transactions.
- Provide your own special applications-level routines and/or procedures to deal with the inconsistencies that occur when the primary site is restored, and the queued transactions from the active fail-over system are propagated and applied to the primary site.