Oracle and resilvering on AIX
Oracle and resilvering data on Mirrored Logical Volumes on AIX
How to set up your storage with regards to Mirror Write Consistency
Suppose you have or plan to implement an Oracle RAC database on AIX and you are using raw logical volumes and logical volume mirroring (OK, this may sound rather exotic, but there are people/businesses out there that actually have this). There are some things to take into consideration when you implement an architecture like this. For example: Mirror Write Consistency (MWC).
This article will explain how to set up your storage architecture taking into account the considerations and recommendations given by both Oracle and IBM about mirroring and MWC.
Mirroring in AIX
In AIX, when running RAC (whether it be on 9i or on 10g) with raw logical volumes, you need HACMP to be installed and configured, to have concurrent access to your storage. In order to increase availability on the storage side, you may want to introduce hardware mirroring (on storage system level, e.g. RAID 1), but sometimes this is not enough. You then may need to introduce Logical Volume Mirroring across multiple storage systems to have multiple copies of your datafiles in multiple locations. In order to do this, you create a Volume Group (VG) that has multiple disks (or physical volumes) across multiple (e.g. 2) storage systems. Then you create Logical Volumes (LV) in this VG, and define this LV to span multiple Physical Volumes and have them mirrored across the storage locations.
In order to keep your mirror in sync after a crash, an option in the (mirrored) LV definition is to enable or disable Mirror Write Consistency.
Why would you want to enable or disable MWC? In order to answer this question, you have to understand what MWC is.
Mirror Write Consistency I
When you have a mirrored LV, in order to be able to recover as quickly as possible after a crash (that might have corrupted your mirror) AIX provides a caching mechanism that keeps track of the last 128 writes done to the mirrored LV. After a crash, the system checks whether the mirror is in sync, and if this is not the case, it uses the MWC cache to redo the last writes, therewith syncing the mirror. If you do not have the MWC enabled, you have to manually sync the mirror by issuing the syncvg command. All mirrors in the VG must be synced before you can varyon (i.e. enable) the VG (under normal circumstances). When you have many LVs defined this can become time-consuming, because a complete synchronization cycle is needed (everything will be synchronized, not only the out-of-sync data).
IBM tells you to disable MWC when multiple nodes have concurrent access to a Volume Group. More on that later.
Oracle’s way of (crash) recovery: Resilvering
When running Oracle RAC with mirrored raw logical volumes on AIX, MWC must be disabled for all of your datafiles (specified by IBM).
When running on raw Logical Volumes, Oracle has control over the raw logical volumes (the LVs are the oracle datafiles!). Oracle has a mechanism to overcome inconsistencies in a mirror after a crash.
This mechanism is called "resilvering".
What is resilvering? The following is an excerpt from the Oracle Administrators Reference Guide and explains pretty good what it does:
(It is taken from the 9i Admin reference, but I am pretty sure this still holds with 10G)
If you disable mirror write consistency (MWC) for an Oracle datafile allocated on a raw logical volume (LV), the Oracle9i crash recovery process uses resilvering to recover after a system crash.
This resilvering process prevents database inconsistencies or corruption. During crash recovery, if a datafile is allocated on a logical volume with more than one copy, the resilvering process performs a checksum on the data blocks of all of the copies. It then performs one of the following:
1. If the data blocks in a copy have valid checksums, the resilvering process uses that copy to update the copies that have invalid checksums.
2. If all copies have blocks with invalid checksums, the resilvering process rebuilds the blocks using information from the redo log file. It then writes the datafile to the logical volume and updates all of the copies.
On AIX, the resilvering process works only for datafiles allocated on raw logical volumes for which MWC is disabled. Resilvering is not required for datafiles on mirrored logical volumes with MWC enabled, because MWC ensures that all copies are synchronized. If the system crashes while you are upgrading a previous release of Oracle9i that used datafiles on logical volumes for which MWC was disabled, enter the syncvg command to synchronize the mirrored LV before starting the Oracle server. If you do not synchronize the mirrored LV before starting the server, Oracle might read incorrect data from an LV copy.
Note: If a disk drive fails, resilvering does not occur. You must enter the syncvg command before you can reactivate the LV.
Caution: Oracle Corporation supports resilvering for data files only. Do not disable MWC for redo log files.
Now, what do we do with the online redo logs (considering the note above)?
IBM says the online redo logs are on shared concurrent storage, so according to IBM’s specs, MWC should be disabled.
Oracle tells us (in the Oracle Administrators Reference Guide for 9i) not to disable MWC on the redo logs, because resilvering is not performed on the redo logs.
Some people told me to disable MWC (but this was based on a gut-feeling) on the redo logs, however, a failure in the SAN could potentially leave your redo log logical volumes inconsistent,
for example when the (SAN) link between two sites that both hold a copy of your data fails. Both nodes will remain working with their own copy of the data (in case you turn off quorum regulations
in order to be able to survive a site failure). Oracle will not be able to tell which of the two copies is leading, so a corruption is detected and Oracle will not start.
In order to solve this issue, we must first understand how MWC works.
Mirror Write Consistency II
All disks in a Volume Group have a specific sector/track assigned to the MWC cache on the outer edge of the disk (which by the way could imply that when you locate your mirrored LVs on the outer edge of the disk you will have better performance on the LV!)
On this track the last 128 writes are stored. In case of a crash which leaves your mirror inconsistent, the MWC cache comes into the picture.
When restarting your machine, the MWC Cache is used to synchronize your mirror copies (which could still leave your volume group with a data corruption, but your mirror will at least be consistent).
However, all nodes that have access to the logical volumes in this volume group will use the same tracks on the disks for Mirror Write Consistency. This track is not provided with a locking mechanism, so there is no way to determine whether the write in this track did accidentally delete information from another node, let alone which write came from which machine or the order in which the information should be read.
This is the reason why you must disable Mirror Write Consistency on Logical Volumes in a VG that is concurrently accessed by multiple nodes in the cluster. It isn’t even supported if you do otherwise.
Solution for Redo Log Files
When we gather all the information as stated above, the conclusion should be that when you have mirrored logical volumes, you must put your redo log threads in separate volume groups dedicated to the thread and nothing else. These VGs still need to be concurrent capable and shared between all nodes in the cluster. However, an Oracle instance will not write in a thread that is not assigned to it (as specified in the parameter file) and therefore no multiple nodes will be accessing the Volume Group concurrently. Therefore, you can safely enable Mirror Write Consistency for these Logical Volumes, and it still is a supported configuration by IBM as well.
Example storage architecture
|Element||Location||Mirror Write Consistency|
|UNDO, Temp and Control-/spfiles||VG2||Disabled|
|Redo Log Files Thread 1||VG3||Enabled|
|Redo Log Files Thread 2||VG4||Enabled|