RAC vs. Data Guard or "Political High Availability"

Thursday 14 december 2006, we had here at AMIS a Query called "High Availability: RAC vs. Data Guard" the Query was lead by Oak Table Network members Carel Jan Engel & Jeroen Evers. The session had a very informative character and made a compare between Data Guard (Carel Jan Engel) and RAC (Jeroen Evers), based on high availability on outage time and the data safety for user errors.

 

Looking at the above comparison Data Guard has some very nice advantages above RAC, because of the fact that in Data Guard not one database is involved, but more. The Data Guard solution is using a standby database principle, what makes it possible to have a delay between the databases and in that way data safety for user errors. For example a truncate of a table in the production database can be stopped being applied on the standby. Further more, the Data Guard solution works with an “active”, “passive” principle and switching the database function is easy and very fast to be accomplished (a few minutes is possible on database level). So in principle production can be running on site A and in case of failure of site A, the passive (standby) database is brought up as production on site B. After Site A has been brought up, site A can just be working as passive (standby) database. So in the end it is not important on which side the production is running. ....

 

In practice the above is overshadowed with the fact that most of the time the passive side has not the same amount of resources available, like CPU and memory. So in that case the standby site is much slower as the normal production, or resources have to be increased before able to run production on the standby site. The other thing is that if the active site and the passive site are equally sized the amount of license fee for Oracle software and OS is the double of the production system capacity. In that case there is a possibility to have the standby database opened for reporting purposes (management reporting), in that way the standby hardware capacity can be used. Another option is to run development and test systems on the standby hardware and in case of failover only the production is running.

 

When looking at RAC, the most important difference is that it is in the end just one database. This has as advantage that just one dataset is involved and different memory structures (instances), running on different machines are using the same database (same dataset/storage), so also able to use the capacity of all involved servers for production. The big disadvantage is that protection against user errors like a truncate is not there (with Data Guard it is depending on the configuration and set delay).

 

In the end it is really important to define “High Availability”. In the Query from 14 december 2006 it was positioned to be the time of unplanned down time. I personal would like to add the statement and definition of “Political High Availability”. Although Data Guard needs very little time to failover between the “active” and the “passive” site, it still needs a few minutes of real down time (nobody can work). In case of RAC and one server dies we have still end users able to use the system. In general the need to reconnect, as a result of a node/instance failure when RAC is in place, is not considered to be downtime.

 

For performance reasons it is good in RAC to separate the functions of the different nodes running an instance (database servers), this to prevent I/O contention on the interconnect between the nodes. So the overhead of RAC will be decreased. This principle is simple to understand by knowing that using the same datasets on different RAC nodes will mean that database blocks will be transferred (block pinging) via the interconnect, because different rows, used on different nodes are physically in the same database block. The bigger the database block size the bigger the impact will be.

 

Based on this performance principle a function separation is very attractive in RAC, but has a coin side. We could think of a situation of a company operating in two countries, like the Netherlands and the UK, based on the performance principle we could choose to have both countries using their own server (two way RAC). In case of a failure of one of the two nodes, a complete country will have an outage. This is from an SLA (Service Level Agreement) not a very nice situation. For this I introduce the term “Political High Availability”, so it is better to have the two countries mixed and in that way just loose half of both countries, and not one whole country. From management point of view in the second situation we will not loose a country and in that case it will not be called an outage, but a disruption.

 

Another important thing is the fact that with increasing number of nodes in RAC we will have a lower percentage of users having an issue when one of the servers dies. On the other hand the more nodes in RAC we will have, the more RAC overhead we will get. This will be more severe in cases the datasets used on the different servers are more similar. In Oracle Applications for example a table like FND_CONCURRENT_REQUESTS can give a lot of overhead.

 

Also the complexity between solutions is important. The more complex the configuration the bigger the chance that something goes wrong due to user maintenance error. This means also that a more complex solution needs more skilled personnel, more expensive resources and due to that a more expensive solution.

 

For high availability over big distances it is recommended to use Data Guard, but for smaller distances like several kilometers it is possible to create an extended RAC cluster. When considering such a solution it is important to think of a triple site solution and not going for a twin data center solution, because of data consistency.

 

From a definition point of view I think the following reasons should be to use RAC instead of Data Guard or in combination with Data Guard (Oracle Maximum Availability Architecture):

  • The “Political High Availability” is really important for the business.
  • There is not enough capacity in one server to handle the load.
  • Scalability is a requirement. 

2 Comments

  1. Atif June 24, 2008
  2. Cray April 14, 2008