How to Set Up NAL (Nagios Alarm Handler) to monitor an EPICS network

From EPICSWIKI

This is a Wiki: Please add or correct things as you find them wrong, misleading or out-of date.


Here's how to install NAL using yum on RedHat Enterprise 5 x86 Linux box

Nagios default installation

Nagios application is provided by rpmforge repository, so you have to install it to configure yum properly.

    root> wget http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.2-2.el5.rf.i386.rpm 


To install the package, you have to install DAG's GPG key:

    root>  rpm --import http://apt.sw.be/RPM-GPG-KEY.dag.txt 


(Optional) Verify the package you have downloaded

    root>  rpm -K rpmforge-release-0.5.2-2.el5.rf.i386.rpm 


NB (Security warning): The rpmforge-release package imports GPG keys into your RPM database. As long as you have verified the md5sum of the key injection package, and trust Dag, et al., then it should be as safe as your trust of them extends.


Install the package


    root> rpm -i rpmforge-release-0.5.2-2.el5.rf.i386.rpm


At this point it is possible to install the Nagios application. For the "server side", you need these following packages:

1) nagios: the main application package
2) nagios-plugins: provides all the command scripts used by users to define nagios services. In some cases there is also nagios-plugins-all (that's better)
3) nagios-plugins-nrpe: provides the check_nrpe script used to communicate with nagios clients and run remote services


    root> yum install -y nagios
    root> yum install -y nagios-plugins
    root> yum install -y nagios-plugins-nrpe

With that I've installed Nagios 3.2.3 version.

Nagios: configuration

When you install Nagios by yum, all the apache configuration are done by default.

To check the web interface you must define the password for nagiosadmin user (default nagios administrator). This passsword must be encrypted. you can use htpasswd command to set that an save in

         /etc/nagios/htpasswd.users


Start the apache and nagios services

    root> service httpd start (restart)
    root> service nagios start

and check the nagios webpage at http://localhost.localdomain/nagios . If it is all correct, you have to see the authentication popup. When you are in the main page, you can monitor the localhost machine (nagios provides some information about hosts and services); all the services should be OK, but in some case you have to check some permissions/configurations.


The main configuration file is /etc/nagios/nagios.cfg

in this file you can configure every feature of nagios. We use most of the default options, the only parameters enabled are:

  • enable servers directory: you can define all the servers' cfg files into this directory (cleaner job)
 cfg_dir=/etc/nagios/servers
and create the folder:
 # mkdir /etc/nagios/servers


In the servers folder you have to define all the hosts you want to monitor. For a correct management, you have to define 2 different files:

  • HOST.cfg: define the specifics that nagios uses to monitor the host desired. You must define one file per host!
  • groups.cfg: indicate all the different groups of hosts. It is very useful to manage and monitor a large number of machines

Example:

  • File servers/example.cfg:
define host{
       use                     linux-server            
       host_name               example
       alias                   example display in web interface
       address                 10.6.0.1
       notification_period     24x7
       icon_image              example.jpg
       }
define service{
       use                             local-service   
       host_name                       example
       service_description             PING
       check_command                   check_ping!100.0,20%!500.0,60%
       }
define service{
       use                             local-service   
       host_name                       example
       service_description             SSH
       check_command                   check_ssh
       }
in this code:
1) notification_period is defined in /etc/nagios/objects/timeperiods.cfg --> you can edit this file to add/set different time periods
2) icon_image is situated in /usr/share/nagios/images/logos/. if you want to add new images you must save them in this place
3) service_description is the service name displayed in the web interface
4) check_command define the command desired and situated in /usr/lib/nagios/plugins


  • File servers/groups.cfg:
define hostgroup{
       hostgroup_name  example ; The name of the hostgroup
       alias           example @ MyLab ; Long name of the group
       members         localhost, example
       }

You have to define the host_name used before in members variable.


After these changes, verify the configuration files through

# nagios -v /etc/nagios/nagios.cfg

and then, if there aren't any error, restart the service

# service nagios restart

Nagios Default Folder Locations

By default Nagios yum installation, Nagios stores the following file location into your harddisk

   * /etc/nagios/ - Nagios configuration folder locations
   * /var/log/nagios/nagios.log - Nagios log 
   * /usr/share/nagios/ - Nagios, docs, sounds, and image folder locations
   * /usr/lib/nagios/cgi/ - Nagios CGI folder location
   * /usr/bin - Nagios binaries
   * /etc/httpd/conf.d/nagios.conf - Nagios Apache folder files


Insert the EPICS Nagios Plugins

What you did in the chapters above was a generic Nagios installation/setup.

Going to here. You will find the nagios plugin to EPICS _and_ a complete tra ball of a nagios configuration. Download the pkugin and save into the

/usr/lib/nagios/plugins/

Change the permission to check_caget.sh

    root> chmod  +x check_caget.sh

now verify that is usable with:

    > ./check_caget.sh --help

verfing using camonitor a PV, ie for me giacchinHost:aiExample

    > camonitor giacchinHost:aiExample

Note: After the version 1.3 the plugin assume the presence of caget into the /usr/bin if that is not true to your site please fix it making a symbolic link like (i.e.): ln -s /opt/epics/base-3.14.9/bin/linux-x86/caget /usr/bin/caget


Using the EPICS follows variables you should avoid to broadcast the network, for me the variblese were:

    EPICS_CA_AUTO_ADDR_LIST=NO
    EPICS_CA_ADDR_LIST=127.0.0.1

therefore may I test the plugin with the follow command:

    > ./check_caget_dev_gw.sh -pv giacchinHost:aiExample -H 127.0.0.1
    > STATE_OK: giacchinHost:aiExample 5 2007-11-16 15:23:18.560231  ; te: 0 sec.

if that reply correctly the status of your PV you can continue the installation.

Now install the EPICS logos images

Download the epics.gif image available from the same place

and install that:

   root> mv epics,gif /usr/share/nagios/images/logos/

Save the original Nagios setup and replace it

Go to /etc folder and save the original setup

    root> tar cvf nagios.or.tar ./nagios/

download there the etc.nagios.tar available at same place

and restore the nagios folder with that:

    root> tar xvf  ./etc.nagios.tar

Note: Now looking around the files into /etc/nagios and adjust that to meet your network setup requirements. You will find an epicsExample.cfg which contains a pre-setted PV names, please adjust that to meet the yours.

NAGIOS check configuration file

For sanity checking, make sure you verify Nagios config files. This can be done like so

    root> nagios -v /etc/nagios/nagios.cfg

The above command would show you for any erroneous lines frin Nagios config file.


HTTPD configuration

Check the presence of line: "include conf.d/*.conf"

in /etc/httpd/conf/httpd.conf

Check the paths into the file : /etc/httpd/conf.d/nagios.conf

Make a file named .htaccess into /usr/lib/nagios/cgi-bin/ and /usr/share/nagios/html/

which will contains:

   AuthName "Nagios Access"
   AuthType Basic
   AuthUserFile /etc/nagios/passwd
   require valid-user

Now create a nagios user with the following command:

    root> htpasswd -c /etc/nagios/passwd nagiosadmin

SELinux setup

For the first test: set it permissive by

   root> system-config-securitylevel

NAGIOS as a Linux service

Basically, at this point of basic Nagios configuration, restarting Nagios should be successful.

Reload your apache service together with your Nagios service like so

    root> service httpd restart
    root> service Nagios stop
    root> service Nagios start
    root> service Nagios status

Open your favorite web-browser on http://localhost/nagios/

login like "nagiosadmin", give your password and enjoy!

NB. If you are using my etc.nagios.tar the passwd to login is "nagiosadmin"

See my nagios screen shots in action:

Nagios Service Details


NagiosServicesDetail.jpg


Nagios Alert Histogram


NagiosAlertHistogram.jpg


Nagios Status Map

NagiosStatusMap.jpg



Conclusions

There are a lot of other interesting feature that comes from free using NAGIOS, looking around you should find a lot yourself. There is a cool Firefox plugin https://addons.mozilla.org/it/firefox/addon/3607 which give you the possibility to continuous monitoring the PVs during the regular usage of the browser.

At this time Ralph Lange has realized a test to NAL at Bessy. A great acknowledgments to him, he has supported me since the idea of use Nagios born in my mind. Acknowledgments to Maurizio Montis, who made a kickstart script to deploy a RHEL5 box equipped with Nagios ready to use, and, adjust and fix the old notes on FC7 to the new OS: RHEL5.

More information about NAL could be found here. A special LivEPICS version (Linux Live CD EPICS fully equipped) with NAGIOS pre-setted and ready to use here .

Thank you for your attention! Please, give me a your feedback, and fell free to drop me an email, I'll be happy to continue to work on this idea if someone is interested to use it.


Mauro Giacchini (INFN-LNL)

--MauroGiacchini 15.54, 2 Dec 2011



The Plugin Script

/usr/lib/nagios/plugins/check_caget_dev_gw.sh script for Nagios


#!/bin/sh
#
#####################################################################################
#####################################################################################
##                           Nagios plugin to check EPICS PV Status                ##
#####################################################################################
#####################################################################################
#
# Script to retrieve EPICS PV Name status using the "caget" command.
# Written by Mauro Giacchini ([email protected])
# Last Modified: 17-11-2007
#
# Usage: ./check_caget.sh -pv <PV name>
#
# Description:
#   	This script uses caget command to retrieve the PV status. 
#
# Limitations:
# 	This script has been tested on Linux Fedora Core 6.
#
# Output:
# 	The output contains the "te" time elapsed calculated like a difference from PV's
# timestamp and the linux "date" command (suggestion: use ntp common server
# to IOCs and Nagios server box). The STATUS of the service (..of the PV)
# follow the severity rules:
#
# Severity (none) >>>> STATE_OK		# OK = green
#
# Severity MINOR  >>>> STATE_WARNING	# WARNING = yellow
#
# Severity MAJOR  >>>> STATE_CRITICAL	# CRITICAL = red
#
# PV not found    >>>> STATE_UNKNOWN	# UNKNOWNN = orange
#
# In case of Severity (none) it show the stdout of "caget -a" with appended the "te".
#
# Other notes:
#  Firefox Plugin : A FireFox extension is avilable to monitor Nagios server.
#  https://addons.mozilla.org/it/firefox/addon/3607
#
# Nagios configuration setup: 
# 	You need to add the command to commands.cfg
# 
# define command{
# 	command_name	check_caget
# 	command_line	$USER1$/check_caget.sh -pv $ARG1$
# 	}
#
#	And, you need to add the service to services.cfg
#
# define service{
#        use         		generic-service	;
#        host_name		IOC_Example	;
#        service_description   	aiExample	;
#        is_volatile           	0		;
#        check_period		24x7		;
#        max_check_attempts    	3		;
#        normal_check_interval 	3		;
#        retry_check_interval  	1		;
#        contact_groups        	admins		;
#        notification_interval 	120		;
#        notification_period   	24x7		;
#        notification_options  	w,u,c,r		;
#        check_command         	check_caget!rootHost:aiExample	;
#        }
#
# then place this script in the /usr/lib/nagios/plugins/ on the Nagios box server.
# Don't forget to set the right execution permission to this file.
#
# Threshold and ranges: please, have a look at:
# http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
#
# Last: This script still needs debugging and fixups (exercise for reader) :-)
#
#####################################################################################
# DEBUGGING OPTION
# This option determines whether or not debugging messages are showed 
# Values: 0=debugging off, 1=debugging on

DEBUG="0"


#####################################################################################
# CAGET LOCATION
# This option determines where the caget executable is located.
# The default /usr/bin/caget should be made with a symbolic link
# made by root (i.e.): ln -s /opt/epics/base-3.14.9/bin/linux-x86/caget /usr/bin/caget


CAGET_LOCATION=/usr/bin/caget


#####################################################################################
# Script exit status

STATE_OK=0		# OK = green

STATE_WARNING=1		# WARNING = yellow

STATE_CRITICAL=2	# CRITICAL = red

STATE_UNKNOWN=3  	# UNKNOWNN = orange

VERSION="v1.3"

#####################################################################################
# print_revision() function

print_revision (){
    
    echo "Check_caget (nagios-plugins 1.4 to nagios 2.9) (EPICS base 3.14.9) $VERSION"
}

#####################################################################################
# print_usage() function

print_usage() {

    	echo ""
    	echo "Usage: check_caget_dev_gw -pv <PV name> "
    	echo "Usage: check_caget_dev_gw -pv <PV name> -H <EPICS_CA_ADDR_LIST>"
    	echo "Usage: check_caget_dev_gw -pv <PV name> -p <EPICS_CA_SERVER_PORT>"
    	echo "Usage: check_caget_dev_gw -pv <PV name> -expval <EXPECTED VALUE>"
    	echo "Usage: check_caget_dev_gw [-h] [--help]"
    	echo "Usage: check_caget_dev_gw [-V]"
    	echo ""
}

#####################################################################################
# print_help() function

print_help() {
	echo ""
	print_usage
	echo ""
	echo "Script to retrieve the PV status for EPICS control systems."
	echo ""
	echo "This plugin not developped by the Nagios Plugin group."
	echo "Please do not e-mail them for support on this plugin, since"
	echo "they won't know what you're talking about :P"
	echo ""
	echo "For contact info: [email protected]"
	echo "Download : http://www.lnl.infn.it/~epics/"
	echo ""
}

#####################################################################################
# Check the caget presence.


verify_caget_presence() {


if ! type $CAGET_LOCATION >/dev/null 2>&1; then

	echo "STATUS CRITICAL: caget not found (Did you set up the rigth one Nagios USERn? _or_ caget not found!)"
	exit $STATE_CRITICAL
fi
}


#####################################################################################
# Control caget plugin input parameters

EXPVAL=""
EPICS_CA_ADDR_LIST="" 	# Default YES
EPICS_CA_SERVER_PORT="" # Default 5064 _and_  	value > 5000
EPICS_CA_SERVER_PORT_MIN="5000"

while test -n "$1"; do

    case "$1" in

	--help)
	print_help
	exit $STATE_OK
	;;

	-h)
	print_help
	exit $STATE_OK
	;;

	-V)
	print_revision
	exit $STATE_OK
	;;

	-pv)
	PVNAME=$2
	shift
	;;

	-expval)
	EXPVAL=$2
	if [ -z $EXPVAL ]; then
  	  	echo "STATUS CRITICAL: Expected value absent"
   	 	exit $STATE_CRITICAL
	fi
	shift
	;;

	-H)
	EPICS_CA_ADDR_LIST=$2
	if [ -z $EPICS_CA_ADDR_LIST ]; then
  	  	echo "STATUS CRITICAL: Expected EPICS_CA_ADDR_LIST absent"
   	 	exit $STATE_CRITICAL
	fi
	export EPICS_CA_ADDR_LIST
	EPICS_CA_AUTO_ADDR_LIST="NO"
	export EPICS_CA_AUTO_ADDR_LIST
	shift
	;;

	-p)
	EPICS_CA_SERVER_PORT=$2
	if [ -z $EPICS_CA_SERVER_PORT ]; then
  	  	echo "STATUS CRITICAL: Expected EPICS_CA_SERVER_PORT absent"
   	 	exit $STATE_CRITICAL
	fi
	if [ $EPICS_CA_SERVER_PORT -le $EPICS_CA_SERVER_PORT_MIN ]; then
  	  	echo "STATUS CRITICAL: Expected EPICS_CA_SERVER_PORT minor than allowed (5001)"
   	 	exit $STATE_CRITICAL
	fi
	export EPICS_CA_SERVER_PORT
	shift
	;;

	*)
	echo ""
	echo "Unknow argument: $1"
	print_usage
	exit $STATE_UNKNOWN
	;;
 
esac
shift
done


verify_caget_presence

if [ -z $PVNAME ]; then

    echo "STATUS CRITICAL: PV Name not specified"
    exit $STATE_CRITICAL
fi

#####################################################################################
# FINALLY... RETRIEVING THE VALUES (caget)


#CAGET_REPLY=`caget -a $PVNAME`
CAGET_REPLY=`$CAGET_LOCATION -a $PVNAME`

IFS=" "
read pvname date time value status severity<<END
$CAGET_REPLY
END

if [ -z $pvname ]; then

    echo "STATE_UNKNOWN: $PVNAME not found"
    exit $STATE_UNKNOWN
 fi

#####################################################################################
# Calculus difference between the PV timestamp and the actual time

	SPACE=" "
	dte1=$(date --date "$date$SPACE$time" +%s)
    	dte2=$(date +%s)
    	diffSec=$((dte2-dte1))
    	if ((diffSec < 0)); then abs=-1; else abs=1; fi
	te=$((diffSec/abs))
#    	echo "Time elapsed (sec.): $te"

#####################################################################################
# Output the NAGIOS status using an expected value

if [ $EXPVAL ]; then

		if  [[ $value -eq $EXPVAL ]] ;
			then echo "STATE_OK: Expected value ($EXPVAL) to $pvname match; te: $te sec."
			exit $STATE_OK;
			else  echo "STATUS CRITICAL: Expected value ($EXPVAL) to $pvname didn't match"
			exit $STATE_CRITICAL; 
		fi
fi

#####################################################################################
# Output the NAGIOS status using the Severity field
case $severity in

	MAJOR)
    	echo "STATUS CRITICAL: $pvname in MAJOR severity status; te: $te sec."
    	exit $STATE_CRITICAL
	;;

	MINOR)
    	echo "STATE_WARNING: $pvname in MINOR severity status; te: $te sec."
    	exit $STATE_WARNING
	;;

	*)
	echo "STATE_OK: $pvname $value $date $time $status ; te: $te sec."
    	exit $STATE_OK
	;;
esac