The Tale of the Ardent SpringCleaning Maniac (ASM) EveryFailure

The Tale of the Ardent SpringCleaning Maniac (ASM)

or

If you want to get rid of stuff, you can always do a good spring-cleaning. Or you can do what I do. Move.

Ellen DeGeneres

In order to prepare for a network reorganization we needed to move our current Oracle Enterprise Manager machine into another network segment. So we grabbed the chance, and instead of just moving the existing OEM12c Rel.3 machine we decided to install a new Oracle Enterprise Manager 13(.2.2.0.0) onto two Linux RedHat 6.9 machines and get rid of the SPOF (single point of failure) as well in one elegant movement.

Little did we know that we would almost cripple our RAC-clusters by just preparing for the new agents.

Here comes the story of this little adventure.

When we were finally ready to move the monitored targets into the new OMS, I started to install second agents on all the hosts, so we still could quickly switch back to EM12c in case we hit a major road block with 13c. Starting with our single instance development- and test machines and then tentatively I began to install on some acceptance machines. Most of our acceptance and production databases are 2-node RAC 11.2.0.4 with (active) DataGuard, meaning each database consisting of 4 instances on four different machines. Except for some connectivity problems in the network, so far all went smoothly.

In order to install the new agents I had to check each machine upfront for the required 1Gb free space in /tmp. Often, I had to clean out a lot of old stuff from years ago like long-forgotten Oracle install logs, spoolfiles from SQL adhoc-queries of the developers, downloaded rpm’s of the Linux administrators, never-checked Puppet logs and so on. I was intrigued that in all cases node 2 was more cluttered than node 1.

BTW: 1G was not even enough, I would advise to go for at least 1,3G free space in /tmp for the 13c-agents.

At first, I removed only all old files of user oracle before I even considered removing root files or stuff other users had left years ago. And I did not touch Oracle files and directories younger than 4 weeks. Often that was sufficient to clear enough space to be able to install the agent.

But there were machines which were particularly cluttered and I was happily cleaning away until I had enough free space in /tmp freed to install the agent via the GUI. The first node of the first cluster I installed, was ready at about 2018-01-29 15:10h, followed by Node2, due to the necessary cleanup in /tmp, at about 15:45h. The newly installed agents on the cluster worked fine and produced entries about “Compliance …”  which were quickly suppressed for the target. Everything seemed fine and I went home quite contently.

The next morning we noted that one of the databases on this cluster behaved strangely and complained about not being able to complete the backup due to unrecognized files in the FRA. What?! Some hours later, other databases on this node suddenly began likewise to issue events in OMS13c and slowly hell broke loose which reminded us about ASM issues we had some weeks earlier after some updates. But the only change this machine had undergone lately was the new EM agent. No releation with ASM, forget about that idea….

We took a look into the alert log of the first complaining database (primary) and it looked like this (DB and instance names changed):

2018-01-29 22:00:00.031000 +01:00

Setting Resource Manager plan SCHEDULER[0x32D9]:DEFAULT_MAINTENANCE_PLAN via scheduler window

Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter

Starting background process VKRM

VKRM started with pid=43, OS id=13348

Errors in file /u01/oracle/diag/rdbms/db1_prim/DB1_PRIM2/trace/DB1_PRIM2_m000_13400.trc:

ORA-01114: IO error writing block to file (block # )

2018-01-29 22:18:11.242000 +01:00

Errors in file /u01/oracle/diag/rdbms/db1_prim/DB1_PRIM2/trace/DB1_PRIM2_m001_35726.trc:

ORA-01114: IO error writing block to file (block # )

2018-01-29 22:28:11.814000 +01:00

Errors in file /u01/oracle/diag/rdbms/db1_prim/DB1_PRIM2/trace/DB1_PRIM2_m001_53927.trc:

ORA-01114: IO error writing block to file (block # )

2018-01-29 22:38:12.346000 +01:00

Errors in file /u01/oracle/diag/rdbms/db1_prim/DB1_PRIM2/trace/DB1_PRIM2_m001_65572.trc:

ORA-01114: IO error writing block to file (block # )

2018-01-29 22:48:12.872000 +01:00

Errors in file /u01/oracle/diag/rdbms/db1_prim/DB1_PRIM2/trace/DB1_PRIM2_m001_75911.trc:

ORA-01114: IO error writing block to file (block # )

2018-01-29 22:58:13.371000 +01:00

Errors in file /u01/oracle/diag/rdbms/db1_prim/DB1_PRIM2/trace/DB1_PRIM2_m001_86086.trc:

ORA-01114: IO error writing block to file (block # )

2018-01-29 23:05:10.655000 +01:00

Thread 2 advanced to log sequence 917 (LGWR switch)

Current log# 3 seq# 917 mem# 0: +DATA01/db1_prim/onlinelog/group_3.294.918479161

Current log# 3 seq# 917 mem# 1: +FRA01/db1_prim/onlinelog/group_3.306.918479161

2018-01-29 23:08:02.093000 +01:00

WARNING: ASM communication error: op 11 state 0x50 (3113)

ERROR: slave communication error with ASM

Unable to create archive log file ‘+FRA01’

Errors in file /u01/oracle/diag/rdbms/db1_prim/DB1_PRIM2/trace/DB1_PRIM2_arc0_18470.trc:

ORA-19816: WARNING: Files may exist in db_recovery_file_dest that are not known to database.

ORA-17502: ksfdcre:4 Failed to create file +FRA01

ORA-03113: end-of-file on communication channel

Process ID:

Session ID: 708 Serial number: 1895

*************************************************************

WARNING: A file of type ARCHIVED LOG may exist in

db_recovery_file_dest that is not known to the database.

Use the RMAN command CATALOG RECOVERY AREA to re-catalog

any such files. If files cannot be cataloged, then manually

delete them using OS command. This is most likely the

result of a crash during file creation.

*************************************************************

ARC0: Error 19504 Creating archive log file to ‘+FRA01’

ARCH: Archival stopped, error occurred. Will continue retrying

ORACLE Instance DB1_PRIM2 – Archival Error

ORA-16038: log 4 sequence# 916 cannot be archived

ORA-19504: failed to create file “”

ORA-00312: online log 4 thread 2: ‘+DATA01/db1_prim/onlinelog/group_4.295.918479161’

ORA-00312: online log 4 thread 2: ‘+FRA01/db1_prim/onlinelog/group_4.307.918479163’RNING: A file of type ARCHIVED LOG may exist in …

Remember the installation of the agent was in the afternoon at about 4 o’clock and the trouble seemed to have started at 22:00h and not all databases on the cluster seem to have a problem with ASM or their FRA’s. The connection between FRA-trouble and agent was not obvious, I would say.

At that moment we did not ‘see’ the WARNING: ASM communication error: op 11 state 0x50 (3113) but reacted to the archival error. But of course, that did not solve the problem.

When we had a look into the ASM-log on node2 of this cluster, it revealed that its ASM instance had crashed with ORA-29701… while node1 still functioned normally!

A quick search on MOS for “ORA-29701” resulted in nothing that seem to fit our problem, until I widened the search on the net and almost accidentally found a website about EXADATA, describing an ORA-29701 error of ASM and giving advice about “restoring a lost socket file” by trying to restart the complete CRS stack on the node. And if that didn’t do the trick, try rebooting the server. The last was in this environment a little tricky and takes some time to get approval, so I was prepared to do whatever necessary, as long as it did not involve a reboot of the server to solve this problem.

But a lost socket file?!? They used to be in /var/tmp on Linux, if I was not mistaken, but I only touched /tmp…

I decided to go along with this advice and stopped the CRS stack on the unhealthy node and therefore all database instances on it, which were still limping on as good or bad as they could on one node, as grid owner with crsctl stop crs.
Then I checked if all processes were really down, which they weren’t. Therefore I wielded the kill-9–sword until everything which had anything to do with CRS or oracle was gone from the process list (Do I hear some deep groans about this?). And started anew with crsctl start crs.
Luckily, the result was a node with CRS, ASM, listeners and all database instances started as if the machine just came back from a successful and smooth reboot. And even a new /tmp/.oracle directory … one of the older directories I had cleaned up in /tmp to make room for the agent installs!
Pfew, blessed are the innocent! It all worked again.

But what had caused all this trouble? According to this article, the socket files for Linux can be found in a hidden directory called .oracle. Depending on the *NIX dialect that could be under /tmp/.oracle, /usr/tmp/.oracle or /var/tmp/.oracle. Under Linux86_64 this directory is officially under /var/tmp but Oracle also hides a copy under /tmp/.oracle and also stores socket files of the Oracle Net Listener or Cluster Healthy Monitor in there (see DOC ID 1322234.1).

The article also cited DOC ID 391790.1, which is not among the first 10 documents presented when you query MOS for “ORA-29701”, which would have helped to find the culprit far quicker!

What an irony – that later that day, Jan 30th 2018, Oracle released DOC ID 370605.1 …. which tries to remind us all again:
Never delete any hidden .oracle directory you come across or fear for your cluster!

Have fun!