10gR2 RAC service failover and ORA-12545

A few years ago I was lucky to be able to start working for a client that has Oracle RAC clusters. There we encountered lot’s of interesting issues, and one of them was that several client applications couldn’t seem to connect to the RAC cluster whenever a service was failed over to an other instance then the default.

The network environment was a bit different then I was used to work with. There are three domains in use:
– domainA for the RAC and other servers,
– domainB for the desktop pc’s and laptops
– domainC for the citrix farm and the applications hosted there

10gR2 RAC service failover and ORA-12545 netwerk

The tnsnames.ora for domainB and domainC contained an entry like:

SERVICE_ONE=
(DESCRIPTION=
(ADDRESS_LIST=
(ADDRESS= (PROTOCOL=TCP) (HOST=IP_ONE) (PORT=1521))
(ADDRESS= (PROTOCOL=TCP) (HOST=IP_TWO) (PORT=1521))
(ADDRESS= (PROTOCOL=TCP) (HOST=IP_THREE) (PORT=1521))
)
(CONNECT_DATA=
(SERVICE_NAME=SERVICE_ONE)
)
)

Whenever SERVICE_ONE was failed over to server2 with IP_TWO that service was no longer reachable by the client applications. And you got an ORA-12545 / TNS-12545.

To enable sql client tracing (see for example) which I did with level 16 / SUPPORT.

Searching in the created client trace files it shows:
[26-AUG-2009 12:34:26:106] nscall: connecting…
[26-AUG-2009 12:34:26:106] nsc2addr: entry
[26-AUG-2009 12:34:26:106] nsc2addr: (ADDRESS=(PROTOCOL=TCP)(HOST=server2)(PORT=1521))
[26-AUG-2009 12:34:26:106] nttbnd2addr: entry
[26-AUG-2009 12:34:26:106] snlinGetAddrInfo: entry
[26-AUG-2009 12:34:26:106] snlinGetAddrInfo: Invalid IP address string server2
[26-AUG-2009 12:34:26:106] snlinFreeAddrInfo: entry
[26-AUG-2009 12:34:26:106] snlinFreeAddrInfo: exit
[26-AUG-2009 12:34:26:106] snlinGetAddrInfo: exit
[26-AUG-2009 12:34:26:106] nttbnd2addr: looking up IP addr for host: server2
[26-AUG-2009 12:34:26:106] snlinGetAddrInfo: entry
[26-AUG-2009 12:34:28:372] snlinGetAddrInfo: Name resolution failed for server2
[26-AUG-2009 12:34:28:372] snlinFreeAddrInfo: entry
[26-AUG-2009 12:34:28:372] snlinFreeAddrInfo: exit
[26-AUG-2009 12:34:28:372] snlinGetAddrInfo: exit
[26-AUG-2009 12:34:28:372] nttbnd2addr: *** hostname lookup failure! ***
[26-AUG-2009 12:34:28:372] nttbnd2addr: exit

It seems that only the hostname is returned and not the Fully Qualified Domain Name (FQDN). Thus since the clients are located in different domains then the servers they cannot resolve the hostname as they use their own default domain to append to the hostname they get from the listener.

If you search for “nsprecv: packet dump” in the client trace file you can see the name of the server to use for that service_name as returned by the listener. Which was “server2” in the example above.

See also DocID 460982.1 on MOS for a failover case when you already had a working connection and that instance crashed.

DocID 333159.1 gave me the idea to set the local_listener parameter and use the IP of the server in the tnsnames.ora on the servers.

Now the sql*plus session did connect and the client trace showed that the listener returned which IP address to use for SERVICE_ONE.

Problem solved.

Now we suddenly got a lot of “ORA-01013: user requested cancel of current operation ” and it seemed that the cancel button in one of the applications never worked but now started to work as well.

If anyone has an idea on why only the short hostname is being returned instead of the FQDN then I would like to hear it.

Some details about the environment:
Oracle RAC 10.2.0.3
Clusterware 10.2.0.4
Solaris x86-64
SunOS 5.10 Generic_137138-09 i86pc i386 i86pc