In this article I will try to explain how to monitor your Dell Raid Controller with Cloud Control when running Oracle Virtual server 3.x (OVS3.x).
Some time ago a physical drive broke down on one of our Dell PowerEdge Servers. Not causing any problem because it was configured in a raid set using the Dell build in PowerEdge Raid Controller( hereafter PERC). However, we did not notice this breakdown until also a second physical disk broke down causing the whole raid-set getting unavailable.
The technical part of replacing disks and recovery has been done, but we left with the issue of not being notified of the first failure at all. So I started a quest on the web finding out how to prevent this from happening again.
First solution found was (of course) to install Dell Open Manage. Looks promising, but unfortunately it is not certified (and also not working) on a server running Oracle Virtual Server 3.x. which we use on almost all our hardware. I tried, but after installing the software the server refused to start after a reboot.
Next try: SNMP… Unfortunately all SNMP and PERC related information which can be found is based on the Dell Open manage software, which we could not use as described above…
Crawling on the web I finally bumped into some blog articles stating that the PERC is a branded Megaraid adapter. The same article mentioned something like MegaCLI (Megaraid Command Line Interface). Aha, a new hook into a possible solution?
Yep! With this information as a new starting point I was able to retrieve enough information to use the command line to query the PERC. And if it’s command line, we can script, and if we can script, we can monitor with Oracle Cloud Control (or any other monitoring tool).
OK, enough blabla, let’s walk through the steps required to get this thing moving.
First we need to download 2 small rpm’s to be installed on the server:
– Lib_Utils-1.00-09.noarch.rpm, which can be found here
– MegaCli-8.02.16-1.i386.rpm, which can be found on the support site of LSI. Download this zipfile which contains the RPM (and other tools for different OS’s).
- Login on your host as root and navigate to /tmp
- Install both rpm’s
[root@host tmp]# yum localinstall Lib_Utils-1.00-09.noarch.rpm –nogpgcheck [root@host tmp]# yum localinstall MegaCli-8.02.16-1.i386.rpm –nogpgcheck
- Create a softlink in /usr/sbin to the MegaCli executable using 1 of these statements
[root@host tmp]# ln /opt/MegaRAID/MegaCli/MegaCli /usr/sbin/MegaCli [root@host tmp]# ln /opt/MegaRAID/MegaCli/MegaCli64 /usr/sbin/MegaCli
- Test the functionality by executing the following command which should show the number of raid controllers in the system
[root@host tmp]# /usr/sbin/MegaCli -adpCount
- Execute the following command which will give the number of physical disks. The output will be used in a later stage configuring Cloud Control
[root@host tmp]# /usr/sbin/MegaCli -PdGetNum -a0 -NoLog |grep Number
- If the command above executes without error you can execute /usr/sbin/MegaCli –h to get a log of help-information which tells about the huge load of options.
When the part above is executed successfully we can proceed with the next part: Getting Oracle Cloud Control to monitor the PERC.
At this point I assume you already have an Cloud Control agent running correctly on the specific host. If not, you have to install and configure one before you continue.
In Cloud Control 12c we have a beautiful feature called Metric Extensions.
Quote from the documentation:
Metric Extensions enhance Enterprise Manager’s monitoring capability by allowing you to create new metrics to monitor conditions specific to your environment. These Metric Extensions can be added to any target monitored by Enterprise Manager. Once developed and deployed to your targets, the metrics integrate seamlessly with the Oracle-provided metrics.
This means that almost anything you can execute on a command line interface (CLI) and gives a formatted result can be used as metric. This can be at the OS-prompt, SQL, RMAN, ODI, Dell Open Manage, Microsoft SQL etc.
The development cycle for a metric extension looks like this:
Since all commands to the PERC have to be executed as root, I decided (for simplicity) to use the Monitoring Credential facility in Cloud Control for this with the root account. You could also setup to use sudo and a specific useraccount, but that is beyond the scope of this blogpost.
In the next steps we will create a metric which will alert us when the number of available disks changes (i.e. a disk fails or is removed). Based on this example you should be able to create your own variants depending on the requirements.
- Log in in Cloud Control
- Navigate to <Setup><Security><Monitoring Credentials>
- Select the <Host> target type and click on <Manage Monitoring Credentials>.
Next you will see a list of all hosts in Cloud Control with 3 Credential Sets. We will use the set called “Host Credentials For Real-time Configuration Change Monitoring”.
- Select the required line (hostname-credentialset) and click <Set Credentials>.
- Fill in the username (root) and the corresponding password.
- Click on<Test and Save> to store the password.
When the security has been setup, we can start creating the Metric extension.
- Navigate to <Enterprise><Monitoring><Metric Extensions>.
- Click on <Actions><Create> to start the wizard which will assist you in creating the Metric extension.
On the first screen we set the general setting regarding this metric.
- Select the Target Type <Host>
- Give the Metric extension a name, I used ME$Raid_PD_Count
- Give the metric a usefull Display name, i.e. Raid Physical Disk count
- Set the adaptertype to “OS Command – Multiple Columns”
- Add a description if desired and leave the Collection schedule on default settings
- Click <Next> to proceed to the Adapter screen
The adapter screen defines how a specific query is executed A proper description of the options is on the right side of the screen.
- Since we have to execute a (very small) script the command we will use is ‘/bin/bash’
- Click on the small pencil behind the script box
- As Filename we use “RaidPhDiskCount”
- In the File Contents box paste the following line:
/usr/sbin/MegaCli -PdGetNum -a0 -NoLog |grep Number
- Click <OK>
You can notice that the “Script” textbox has been filled with “%scriptsDir%/RaidPhDiskCount”. Also has the script been added to the “Custom Files” on the left bottom of the screen.
If you take a closer look to the output generated earlier by “/usr/sbin/MegaCli -PdGetNum -a0 -NoLog |grep Number” you will see it contains some text and a number, divided by a : (colon).
Number of Physical Drives on Adapter 0: 6
- Based on the above we put a : (colon) in de Delimiter field.
- Click <OK> to proceed to the Columns page
On the Columns page we have to define each column which will be existing in the output of our command. As we can see above the output contains 2 columns, separated by a colon.
For each column in the output we have to define if it is a key column or a data column (containing the measurement data). For a data column we can also specify the default thresholds for warning and critical level. Note: The suggested values for warning and Critical are no typo’s . We will correct this in a later stage.
- Click <Add><new metric column> on to add the first column
- In the Name box write Description, and the same goes in the Display Name box.
- The column Type should be Key Column and the Value Type is String
- Click <OK> to save
- Click <Add><new metric column> on to add a second column
- In the Name box write PhysicalDiskcount, Display Name will be “Physical Disks”
- The Column Type will be Data Column with Number as Value Type
- Comparison Operator should be set to <, warning level to 1 and critical to 0.7
- Change the Alert Message to “Number available disks on raid controller is degraded to %value%”
- Click <OK> to save, and <Next> to proceed to the Credentials page
On the Credentials page we select which credential set should be used to measure this specific metric. Earlier we did prepare the “Host Credentials For Real-time Configuration Change Monitoring” set for this.
- Select the “Specify Credential Set” radio button and select if the correct credential set if not done automatically.
- Click to go to the Test page
The Test page offers the possibility to test the metric and check the output. The metric can be tested against all targets of the correct type if required.
- Click <Add> and select 1 (or more) targets where you want to test the metric. Click <Select>
- Select the target you want to test against and click <Run test>.
- Cloud control will execute the test and present the results in the bottom half of the screen. If an error message is thrown you might use the button to go back in the wizard to correct. After correction get back to this page and retry.
- If you´re happy with the test results click <Next> to go to the review page.
As could be expected based on the name you can once more review all settings for this metric and click <Finish> to save and close.
When the Mextric Extension has been tested and saved the next step is to save it as “Deployable Draft”. From this point on, it cannot be modified anymore.
- Select the Metric Extension
- Click on <Actions><Save as Deployable Draft>
Once a Metric Extension has reached the Deployable Draft status, it can be deployed to 1 (or more) server to test it in real life.
- Select the Metric Extension
- Click on <Actions><Deploy to Targets…>to open the Deployment screen.
- Click on <Add>
- Select the target(s) where you want to deploy on and click <Select>
- Click <Submit> to start deployment
At this stage the metric is deployed to our server which means that every 15 minutes it is executed, the results are stored in the database and alerts can be generated. However, the metric needs some small tweaks to work properly. Remember we did set a warning level on 1, and critical on 0.7?
- In Cloud Control navigate to the homepage of the host involved.
- Click on <Host><Monitoring><Metric and collection settings>
- On this page you see an overview of the active Metrics on this host. Locate the Metric we just created.
- Find out how many physical disk this particular host contains by executing the following command at the specific host (as root)
/usr/sbin/MegaCli -PdGetNum -a0 -NoLog |grep Number”
- I want to be notified as soon the number of disks is lower as it should be (this means a disk broken or removed). In my opinion this is always a critical situation.However, Cloud Control requires that the warning and critical value are filled in and different. For this reason warning threshold should be equal to the number of physical disks, and the critical threshold 0.5 (half a disk :-)) lower. So, if you have 6 physical disks, the warning threshold is 6, and the critical 5.5.
The result of this is that, as soon as 1 disk is gone, the value is below the critical threshold which should generate a critical alert.
- Click <OK> to continue
- Click <OK>
From this point on Cloud Control will monitor the PERC in your host every 15 minutes and generate an incident as soon something is wrong. Of course, you will need configuration to send out alerts to your mailbox, pager or ticketing system. but I assume (and hope) that this has been done already if you are already using Cloud Control