Recently I came across the situation where I knew for a fact that my sessions to the database were dead because I pulled the power plug out of my application server for a failover test. But the sessions stayed visible in the database and kept their locks therefore the failover failed.
Now how is that possible?
First let me sketch the layout of the setup.
Suppose you have a vendor supplied application that runs on their middleware. That middleware is using two application servers in a cluster with only one of the application servers communicating to the database. Lets name this one AppServerOne.
The second application server (AppServerTwo) contacts AppServerOne when its clients need to talk to the database. But when AppServerOne becomes unresponsive then AppServerTwo will try to setup a connection to the database and check if AppServerOne is indeed no longer handling the database requests.
In this case the middleware checks if AppServerOne has released a lock in the database that AppServerTwo can then take.
When I kill AppServerOne by killing an important process on that server, AppServerTwo indeed takes over and the clients do not experience a problem. But I want to make this a bit more realistic and decide to pull a power plug.
AppServerTwo still notices the problem with AppServerOne and tries to obtain the lock, but it fails to do so. AppServerTwo gives up trying after about 15 minutes.
High Availability down the drain! Now what?
I repeat the test and look a bit closer at what is happening in the database.
In the first test the sessions from AppServerOne are disappearing from the database, but in the second test they are still there after 18 minutes. In fact they disappear somewhere between 18 and 21 minutes.
This is not acceptable as the failover should not be noticed by clients and they are noticing.
Of course I had created a sqlnet.ora on the database server with sqlnet.expire_time specified to enable Dead Connection Detection:
$ cat sqlnet.ora sqlnet.expire_time = 2
But that is apparently not working as expected.
The vendor had specified that the following TCP/IP parameters had to be set at all the servers involved.
net.ipv4.tcp_keepalive_time = 180 net.ipv4.tcp_keepalive_probes = 10 net.ipv4.tcp_keepalive_intvl = 6
With these settings a stale socket is detected after 4 minutes (180 + (10 * 6)). So they claim.
Their default values are (On Oracle Linux 6.5):
net.ipv4.tcp_keepalive_intvl = 75 net.ipv4.tcp_keepalive_probes = 9 net.ipv4.tcp_keepalive_time = 7200
Humm? Could the vendor be right after all?
I wanted to get to the bottom of this and needed to test more without causing problems on the application servers. So I created a test with a sqlplus session to a database. When I kill that session the database notices that the session has died and removes it from the database. But when I pull the power plug of the server that initiated the sqlplus session I get the same behavior.
Setting the parameters that the vendor wanted set didn’t change the test result. The sqlplus session still stays way too long in the database.
Using Google didn’t get me closer to a solution, but searching on Oracle Support let me to these documents:
“Performance problem with Oracle*Net Failover when TCP Network down (no IP address) (Doc ID 249213.1)”
“Tuning TCP/IP parameter in Linux Box for SQLNET (Doc ID 274953.1)”
“Dead Connection Detection (DCD) Explained (Doc ID 151972.1)”
The problem here is that the database cannot reach the network endpoint that initiated the connection (because it is powered off) and the network stack tries too long to reach it before giving up. Once it gives up the database removes the session. That is probably one of the reasons that Oracle started using VIP’s and SCAN_LISTENERS for RAC. Those will be started on the surviving nodes and thus reappear on the network.
The parameter that I had to change is: net.ipv4.tcp_retries2.
It defaults to 15 retries.
Once this was set to 3 the database sessions where removed quickly enough. Just a little over 4 minutes, and still way more then the 2 minutes an unsuspecting DBA might expect.
Apparently it isn’t a linear function as my results varied in the time that a session was removed from the database. The explanation of the parameter net.ipv4.tcp_retries2 didn’t help me in the reason why, and I had to give up looking for a reason because I ran out of time.
If you are reading this blog because you have a similar problem please make sure you test your situation carefully. And do not forget to make your changes reboot persistent. In Linux 6 you edit the file: /etc/sysctl.conf but that can change in future versions.
Hope This Helps