Upgrade Failure as Blessing in Disguise? bill 2

Upgrade Failure as Blessing in Disguise?

or

The Findings of an Failed Enterprise Manager Upgrade to OEM13.3

When Oracle Enterprise Manager Cloud Control 13.3 (here just called EM 13.3) came available I read the upgrade documentation and decided that the most cost-effective option for us was to stay with EM 13.2.

Later, a rumor about a security issue with Weblogic spooked the management and it requested that we should upgrade the EM to 13.3 to be on the sure side and to keep Oracle support for the Enterprise Manager.
EM 13.3 issued not really new features for us, only bug fixes we could also tackle with bundle patches on EM 13.2. The security issue with Weblogic would stay the same because both versions use the same WebLogic server version. So, why bother with an upgrade?
But we were overruled and soon started to prepare for the upgrade.

The tale of this upgrade, or should I say nightmare, and our (temporary ?) return to version 13.2 began when I wanted to check the prerequisites.

There was no Upgrade option visible in the OUI!

At first I nosed through MOS and my impression was … it seems to happen rather often and there was nothing really helpful to remedy it. Later I found a little notice in small letters mentioning that it might occur that the option is not visible and the customer should record a SR with Oracle.
I began to get a little concerned about this upgrade and I tried all tips I found to get the installer into upgrade mood.

The rescue came when I talked to a colleague who mentioned under a coffee and a cigarette that somebody else had had the same problem, issued a SR, waited quite some time for an answer, and … finally got one from Oracle Support (a fact which we take currently not for granted anymore). And it even worked as intended!

The problem was that during some OMS patches deeply burried xml’s were corrupted by adding extra attributes or extending values with extra numbers. This had to be restored by deleting all of the malformed entries. I took me about three weeks to find and implement this solution and during this time my thoughts went wild. I thought that we once had the problem that superadmin users could not change their password via the EM-site. This problem could easily be solved by switching from the default JDK7_111 of EM13.2 to JDK7_131. This JDK-upgrade involuntarily resolved an other issue we had an open SR for at that time, namely that the OMSPatcher would not start patching and failed with error code 235 (see one of my previous articles https://technology.amis.nl/2018/07/17/sinatra-solution/  ). And I had found a knowledge article saying that one of the post-upgrade actions to EM 13.3 is upgrading the JDK7 from 171 to 211. And I naively thought, maybe it is a good idea to switch to this version even before the upgrade.
WRONG! When you do this, the upgrade does not even start because it also “hides” the upgrade option in the OUI.
So, back to the former version of 131, as our Oracle consultant suggested.

As soon as the option was visible again all my natural caution went overboard and on a Thursday afternoon at 15:00h hours I happily clicked the Next-button to start the upgrade of our existing OMS 13.2 environment.
About 3,5 hours later the OUI reported the successful execution. We had an 13.3 OMS – environment. Lets go and check it out!

First it took about 10 minutes to start the OMS – then the errors appeared… The default HTTPS-port we used to login was changed or unreachable, network related errors, certificate errors – you name it. But the mail problem were the agents which could not reach the OMS and upload on their formerly working upload ports. 

What the heck had happened here?

The Upgrade creates a second directory tree parallel to the original EM13.2, so we should be able to compare the two versions rather easily. But were to begin?

The first thing  I noticed was that in OH/bin of the old version the file “emctl” was renamed to “emctl_old”, and we supposed that there would be more files be “switched off” like that than this single file. We did not follow up our suspicion because the certificate errors were more pressing to investigate. But we were quite sure that we could not switch back to version 13.2 without a restore of the file system or a complete reinstall. Both options we dreaded because they contained chances to “misconfigure” the installation. And dealing with our network guys is sometimes a real – challenge.
So, back to finding the underlying problem and correct that – hopefully.

Next train of thought: JDK again? which gave problems to read the certificates? So, JDK was increased from 171 (default in EM 13.3) to 211, but to no avail. So back to 171 but it was late at night and I accidentally overwrote it , so looking for the correct JDK to download and install. Costing valuable time again and it got now very late. So we decided to stop for the night and continue the next day, a Friday (but not the 13th 😉 .

The main problem was that the OMS was up and running but all the agents (still 13.2) could not connect on the upload port and the OMS could not sent commands to them so they were unreachable. The weekend came and we could not expect any help from the network guys to get a new trusted certificate if necessary. Re-Securing only a test-agent, un-securing the OMS and re-securing OMS and the agents, re-creating the self-signing certificates, applying OMS-patches – nothing seem to work. After calling in an Oracle consultant specialized in EM, it crystallized that the problem seemed to be the Apache-WLS-bridge and how the SSL-certificates were handled by the Apache-HTTP-server. In the meantime all the research took so much time, time in which we could not monitor our (production) databases adequately (as a workaround I changed to scripts run in SQL Developer) and we got bit by bit really nervous. And still we tried to figure out where the errors came from. 

After almost a complete week we had to get our monitoring back. But what could we do? Install EM13.3 afresh and run the chance of configuration errors and loose more time to get new ports opened in the firewalls? Or to return to EM 13.2? But how could we return to our old and homely 13.2 environment as quickly as possible?
The repository database was changed to 13.3 but exported before the upgrade, but the software tree was not backed up (ouw!). We also could not switch ports from HTTPS to HTTP because the firewalls expected HTTPS traffic on this ports and blocked the HTTP-traffic.
Finally, we were left with a flashback of the database to before the upgrade (which was still just possible the day we decided to do that) and somehow hope that the old software tree had only the one renamed file I found in the beginning of the troubleshooting.
I can only smile now and think of the motto on the Dollar bills: “In God we trust“…

… and He delivered! Even the Oracle Expert was surprised that the old version can be reactivated by just renaming the “emctl_old”-file back to “emctl”!

So, what did we learn?
  • Make backups – ha, what an open door!
  • Best to run EM13.2 with JDK7_131 and leave EM 13.3 on its default JDK7_171.
  • And finally: The software configuration of OMS 13.2 can be reactivated by renaming just 1 file!

BTW: We still don’t know wat actually happened to our certificates during the upgrade… if anybody has an idea, please let us know.