Smoke Escaping from a Black Box - how to tackle major production issues on a system that is not very transparent image

Smoke Escaping from a Black Box – how to tackle major production issues on a system that is not very transparent

This story is based on true events. It happened fairly recently.

imageI was called into an organization that was experiencing performance and stability issues that affected mission critical processes. Every now and again, and increasingly frequently, their service bus stopped working. It appeared as if only a restart could get things going again. There was no insight into the trigger nor the underlying problem that caused this loss of service.

Unfortunately – although it did help me to this story you’re reading – the system was not instrumented or monitored very well. Especially the end-to-end overview of requests was not available, nor was there a way to search through and do analysis on the combination of log files. We could not easily trace back to the minutes just prior to system failure – what happened, how did the crash unfold. We could not really tell. Individual log files were available to inspection and message reports were produced from individual services into the service bus management environment.

We needed to analyse events in a comprehensive, integral way in order to get understanding of what was going on inside the black box. In a way that allowed us to trace the request through the stack from left to right (from requestor through OSB to back end services) and understand the activity and resource usage through the stack from top to bottom (individual OSB services to WebLogic and JVM resources and even to server status – CPU, memory, file handles). And we needed to do that fast – as the upper management of this organization was not relaxed (at all).

The stack we were dealing with is Oracle Service Bus (still on 11gR1) on WebLogic, Java 7 JDK and Linux Operating System. OSB calls out to JMS, SOAP and JDBC (especially many calls to PL/SQL packages). Work flows into the OSB from synchronous requests from external consumers, the workflow coordinator and the Web Portal. Additionally, multiple inbound adapters (pollers) check for input in FTP directories and in Database tables. Anything these pollers encounter results in work performed by OSB.

Enter: Oracle Management Cloud

Application Performance Monitoring & Log Analytics with Oracle Management Cloud

An extremely quick way to start getting insight in the black box was the introduction of Oracle Management Cloud. Within half a day of deciding to leverage OMC, we had insights arriving in a cloud environment that immediately started to offer meaningful insights in the end-to-end and top-to-bottom activities inside the black box.

This figure shows roughly the salient part of the stack and the chain – and the OMC agents that have been added. Cloud agents were installed on each host. Subsequently, APM Agents were installed and configured in the WebLogic Startup scripts (the APM Agent is a Java agent that is injected at JVM level through a JVM startup parameter). Finally, the Log Analytics agent is installed on each Managed Server host to gather relevant (WebLogic) log files. The agents send all information they gather to the OMC Cloud instance that was provisioned as a first step.

image

In order to get going with this set up, our steps were:

  1. Create a free trial account on Oracle Cloud (valid for 30 days)
  2. Provision an OMC instance
  3. Download the agent installers om OMC
  4. Install and configure the agents – both on the host machines as well as in the cloud
  5. Restart WebLogic Managed Servers for the APM Java Agents to be enabled
  6. Start collecting metrics and log files, start visualizing an analysing
  7. Grant OMC access to all stakeholders; note: OMC is read only. In order for operators to inspect the data gathered by OMC, they do not need any direct access on the stack and components under scrutiny; they only need access to OMC and perform their analysis in the cloud environment.

First Impressions

OMC provides insight in activity in the monitored systems: requests (usually HTTP requests) arriving from the outside world and resulting in Java threads performing work and making call outs to external back ends such as Database, SOAP & REST Services and JMS endpoints. With OMC Application Performance Monitoring, it is easy to get a quick insight in the numbers of requests (over time) – across all end points and services and per service. Similar to the numbers of requests, OMC also offers quick insight in the numbers of errors, the average and maximum response times – as well as the more meaningful P25, P50,P90, P95 and P99 response times; again, OMC shows the evolution of these numbers over time.

image

Server Requests handled by the managed servers can be inspected collectively over a specific period of time. Ordered by Total Calls, Errors, Max or Average Response Time – we can visualize the characteristics for server requests – most of which correspond in our case to inbound OSB Service calls or inbound Adapter activity.

image

We can drill down on specific server requests – to learn more about its history, the distribution of response time and also about what is does. Where is the time used for handling the request spent on? In which tier, on which activity? For services that we do not know intimately – which at this organization applies to all services as I am new here – this helps tremendously in understanding the end to end dependencies and activities. For example, the network diagram with the activities performed for a specific type of server request (i.e. OSB Service execution):

image

This diagram tells us a lot: what other OSB services are directly invoked (in green), which external services are invoked (in orange) and which database calls are made (in blue). I can make the size of the nodes reflect the relative number of times they are called, the time spent in the node and the number of errors originating in the node.

When hovering over a node, I see the details:

  • which service is invoked
  • which external resource is called
  • which SQL (or PL/SQL) statement is executed (against which JDBC Data Source)

This is unbelievably powerful stuff. Especially in a case like this:

image

Here we have a server request that takes on average 48 seconds. That is not acceptable. I have never seen this OSB Service in my life. Without even looking at source code I can tell straightaway that this PL/SQL Package is the culprit – that is where the time is primarily spent when processing this request. So any remedial action should probably start with that package.

Error Analysis

Server Requests can be sorted by the number of errors they have produced, as is done here:

image

I can now drill down on the server request listed with 12 errors. This gives me some insight in when, where and hopefully why these requests ended with an error. The diagram for this server request with the arrow width set to the number of errors is already quite revealing:

image

The Instances tab provides the call stack with the exact names and timing for each step as well as the end status.

I can click on the View Related Logs link, to navigate to Log Analytics, a different component in Oracle Management Cloud.

image

Log Analytics gathers log files from all observed entities – including WebLogic standard out, admin log, access log – and allows smart browsing, filtering, clustering and visualization of the log files. Additionally, Alerts can be defined on the log file, to notify operators when certain conditions are met.

In this case, we check for associated log entries and Cluster these – filtering for only error logs. Then we analyze the result. It is now easy to find what is the underlying cause of the failed server requests: a SQL TimeOut – after approximately two minutes.

SNAGHTML5349fcb

This turns out to be not a technical problem: we have set our JTA Timeout to 120 seconds. Whenever a JDBC call takes longer than that – as in quite a few cases it does – the application server intervenes and interrupts the operation. This is perceived  by OSB as an error. But it is one we deliberately force upon ourselves. The PL/SQL call when left to its own devices completes successfully. However: it takes more than two minutes to do so.

Inspecting Log Files

It is so easy to check log files for other errors. To see when they occur and to find the specific types (or clusters) of these errors.

Here I am checking for all error log entries in the last 8 hours:

image

There are 98 of them. Mmm, quite a bunch to inspect one by one.

However, using the Cluster option, I can group these messages into clusters of messages with high similarity. This means for example that server names, directories, Data Source names are different but otherwise the messages are the same. This results in:

image

Just 15 clusters – one of which accounts for 82 of the error messages. This helps tremendously to quickly zoom in on the frequently occurring error message patterns – that typically all indicate the same problem.

Conclusion

In just a few days, using Oracle Management Cloud, we were able to get control over our environment. From a black box from which clouds of smoke escaped, we started to regard the OSB as a tameable beast. We did not understand everything. But we could much easier determine what was going on, where and in conjunction with what. Across the stack from top to bottom (also using OMC Infrastructure Monitoring, which I have not touched upon in this article) and from proxy to business service and beyond (especially into the database). From this understanding of what, where and when, we could start to implement controlling measures. Such as throttling settings on all inbound database and FTP adapters. And an investigation into the longest running SQL and PL/SQL jobs. And a change in time out settings. Oh, and a number of alerts with automated notifications, to not have to check manually all the time – or wait for end users to start to complain.

As it happens, the organization was so impressed with what Oracle Management Cloud can provide in terms of insight, proactive alerting and [root cause] analysis capabilities, it has decided to continue on after the trial period. A quick first cost estimate suggests that the organization is looking at $3200 per month (Pay as You Go) or $2500 monthly flex (12 months commitment) for the scope they require.

Resources

Oracle Management Cloud – Application Performance Monitoring: https://cloud.oracle.com/en_US/application-performance-monitoring 

Oracle Netherlands Cloud Day (SlideShare) – How Oracle Management Cloud enabled a successful scratch of a 7-year old performance itch