One of my clients had an issue with their Dataguard setup, after having to move tables and rebuild indexes the transport to their standby databases failed. The standby databases complained about not being able to fetch archivelogs from the primary database. In this short blog I will explain what happened and how I diagnosed the issue and fixed it.
Below you can see a diagram of the setup: a primary site with both a primary database and a standby database. At the remote site there are two standby databases both get their redo stream from the primary database.
This setup was working well for the company, but having two redo streams going to the remote site with limited bandwith can give issues when doing massive data manipulation. When the need arrived for doing massive table movements and rebuilding of indexes the generation of redo was too much for the WAN link and also to the local standby database. After trying to fix the standby databases for several days my help was requested because the standby databases were not able to fix the gaps in the redo stream.
While analyzing the issues I found that the standby databases failed to fetch archived logs from the primary database, usually you can fix this by using RMAN to supply the primary database with the archived logs needed for the standby, because in most cases the issue is that het archived logs have been deleted on the primary database. The client’s own DBA already supplied the required archived logs so the message was kind of misleading, the archived logs are there, but the primary doesn’t seem to be able to supply them.
When checking the alert log for the primary database there was no obvious sign that there was anything going on or going wrong. While searching for more information I discovered the default setting for the parameter log_archive_max_processes is 4. This setting controls the amount of processes available for archiving, redo transport and FAL servers. Now take a quick look at the drawing above and start counting with me: at least one for local archiving, and three for the redo transport to the three standby databases. So when one of the standby databases wants to fetch archived logs to fill in a gap, it may not be able to request this from the primary database. So time to fix it:
ALTER SYSTEM SET log_archive_max_processes=30 scope=both;
Now the fetching start working better, but I discovered some strange behaviour, the standby database closest to the primary database was still not able to fetch archive logs from the primary. The two remote standby databases were actually fetching some archived logs, so thats an improvement… but still, the alert log for the primary database was quiet silent… fortunately Oracle provides us with more server parameters: log_archive_trace. This setting enables extra logging for certain subprocesses. add the values in the linked documentation to see the desired logging: in this case 2048 and 128 for getting FAL server logging and redo transport logging.
ALTER SYSTEM SET log_archive_trace=2176 scope=both;
With this setting I was able to see that all 26 other archiver processes were busy with supplying one of the standby databases with archived logs. It seems to me that the database thats furthest behind will get the first go at the primary database…. Anyway, my first instinct was to have the local standby database fixed first so this one is available for failover, so by stopping the remote standby databases the local standby database was now able to fetch archived logs from the primary database. The next step is to start the other standby databases, to speed up things I started the first one and only after this database has completed its archive log gap I started the second database.
In conclusion, it’s important that you tune your settings for your environment: set log_archive_max_processes as appropriate and set your log level so you see what’s going on.
Please mind that both of these settings are also managed by the Dataguard Broker. To prevent warnings from Dataguard Broken make sure you set these parameters via dgmgrl:
edit database <<primary>> set property LogArchiveTrace=2176; edit database <<primary>> set property LogArchiveMaxProcesses=30;