How to dynamically Schedule EM Blackouts after PatchTuesday

Karin Kriebisch

In the organization I’m currently working for, the OS-patch schedule depends on PatchTuesday (PT) just as Microsoft, Adobe or Oracle are using. And PatchTuesday is, as you all know, the Second Tuesday of each month. But this is a schedule you can not set in the Oracle Enterprise Manager Scheduler. Also, hosts will be patched at different timestamps. Some hosts go at PT+2 (PT plus 2 days) at 3:00h and others at PT+3 at 5:30h. And when one node of a RAC is patched the other node should be in blackout as well … Still with me?!?

Why setting blackouts had to be automated …

Have look in the attached file where a complete patchround of the servers is described (You can switch to an other month by changing the date in field D1) and you get an inkling why we wanted an automated solution.
Due to this quite complicated timetable we often forgot to blackout the servers in service. That meant a “sudden” tsunami of mails from the EM Incident Manager when the final reboots were executed. And it often meant being disturbed at 3 or 5 o’clock in the morning by frantic pinging of our phones. And just to save our sleep, the blackouts had to be set automatically. Just to be sure they are set!

Here is what I came up with to rescue our good night sleep.

Some ruminations …

First I wanted to install a new scheduler, but one of the Rules of Engagement at this moment is “No New Tools”. Try to use what you already have.
My next idea was to use our EM-repository as much as possible. And I wanted to group the servers into dynamic groups in EM using their patch days and times as parameters somehow. But in order to be able to do that, each host target had to have a common attribute I could group by. And that attribute has to correlate to the patch moments somehow.

And in which language should I work? How or where should it be scheduled?

The solution develops …

The first question solved was the attribute, the servers should have in EM to be able to group them by. It came when one of the Linux admins gave us access to a site on which we could check the progress / success of the running patches. One of the fields in the table was called “PatchDay” and sorting on it showed the groups of server which are served concurrently. That is why I added the attribute “OSPatchDay” with below EMCLI -command to the host targets:

emcli add_target_property -target_type="host" -property="OSPatchDay"

And I know that all targets types got this attribute, but I only filled it for the host targets with strings like “PT+3_0500”. Originally this name is concatenated by the fields “PatchDay” (“PT+x”) and “PatchTime”. The “:” had to be dropped because otherwise the string could not be used as EM-attribute and I had no time to figure out how to escape that in EMCLI.

Then the dynamic groups were created and named, this time with the “:” in the string, like “PT+3_05:00” to make clear that these two correspond. The colon is later used as delimeter in PL/SQL to identify the timestamps more easily. And the group names are used to compose the blackout names later on.

Why adding the Partner Nodes…

So far I had enough data to query for the blackouts, but what to do about the cluster partners? Actually this question came up when the base script was already in production.

We noticed that sometimes during the patching of a RAC node the partner agents react to the absent node by preparing a node eviction when the downtime becomes too long. At times the eviction actually takes place. That does not seem to happen when the second node is also in blackout.
For that reason we opted for a second blackout on the partner node, saving us time, we don’t have to spend evaluating and/or repairing afterwards. And they are the ones to keep mailing …

But that was not straight forward to implement as one might think. The grouping of the partners was the issue here. Not all the patch days needed blackouts for partners and the partners could not be grouped into dynamic groups by an attribute, at least not without creating a new one. Because a second parameter seems less meaningful in the long run or other administrators, I decided to create “shadow”-groups and fill them manually. Mayby I will implement it one day with the second attribute.

For example: If you have a look in the attached Excel sheet , “PTS+3_05:00” would be the shadow group (PTS-group) of group “PT+3_05:00” (PT-group) and it only contains cluster partners which are not themselves members of the PT-group. In this case servers001 t/m 005 are not in the shadow group, but all other named partners are. Also, all servers without partners are never members of a PTS-group, nor have they the need for a shadow group.

As for the script, after some intense brooding, this issue was quite easy to solve. You will see why it was simple to amend, when you have a look at Script1.txt.

Now, I have groups of servers I could schedule to blackout, one question remained: Which scripting tools to use and how to schedule it?

Chosing the programming language(s)…

In PL/SQL it is easy to determine PT and create the date and time components from the EM-repository data, but I don’t dare to insert them as blackouts into the repository somewhere directly.
EM blackouts are best administered via the EM GUI, EMCTL or EMCLI. The last option is best for scripting, but is often used in conjunction with Python, which I’m not very much acquainted with. But I can script for Linux in bash, ksh or posix. That would mean I had to script in PL/SQL for a shell-script, which then has to log in into EMCLI, execute a couple of commands and then log off again. And I had to obfuscate all used passwords…
… but that is what I did.

And it became …

First, I searched the internet for a query to determine PT for the current month. Then I had a good look at the EMCLI-commands I needed to set or remove a blackout. Then I looked in my script-archive for an example of creating a good logging in order to get a readable report and I got inspired by Carlos Sierra’s eDB360 and the orachk tool. Only, that my resulting script is not even close to the sophistication of these two scripting examples.

The first script is a PL/SQL-script (see Script1.txt) which is scheduled as a EM job executing an SQL-script on the first of each month and it creates a shell-script full of EMCLI-commands like Script2.txt.
I could have scheduled the script on the EM-repos database, but it is more inconvenient to schedule SQL-scripts in the database scheduler than to create a SQL-job for the EM itself (which is actually the same database). And when the EM-machine itself gets patched and rebooted, a DB-scheduler job might be skipped afterwards, while EM-jobs are automatically re-scheduled when the OMS restarts.

The second script is scheduled in the crontab of the EM server on the 6th of the month. That gives us enough time to check and change, or repeat if necessary. It starts with the login to EMCLI, cleans the old blackouts of last month (if there are any), sets the blackouts for the planned servers and than adds the blackouts for the partner nodes. Finally it logs out and all is done.
And we can sleep without being disturbed by frantic mailing about “server down” in the Dead of Night…

Please, have fun with the scripts and as always, feel free to comment!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Next Post

AWS Migration part 1: how to migrate Windows Failover Clustering servers to AWS?

Facebook 0 Twitter Linkedin Introduction In this series, I will look at the migration from on-premise Windows Failover Clusters to AWS. What is the difference in recovery times between the application on-premise, the 1:1 migration of a Failover Cluster to AWS and the commonly used pattern of an Auto Scaling […]