Nagios and cluster setup...few questions
Paul Weaver
paul.weaver at bbc.co.uk
Tue Oct 9 15:49:27 CEST 2007
Lacking the time and knowledge to understand monitoring clusters, I
concocted a home brew web page that has the concept of a virtual
service, and virtual service groups. It's all configured in an XML file.
I have a virtual service called "clipcache 01", one called "clipcache
02", and so on. Easch service has two checks pulled from the nagios
status file, using hostname and service description. If both are OK,
then the virtual service is ok, otherwise it's critical.
I then have a group called "Clipcaches", which looks at the number of
virtual services, and is critical if none ar running, warning if 1 is
running, or OK if 3 or more are running.
A group called "live system" monitors the clipcaches group, amongst
others, and exports it's status back up. It also looks for shceduled
downtime and acknowlegments.
Another group that's a member of "live system" is "midtier", which
consists of checks of virtual service "Main midtier" (which monitors a
certain process on one of 3 machines, must be on one, and one only, to
be OK), "Search instance", which is OK if 2 instances are found on one
of 4 machines, warning if 1, and critical if 0, and a few other checks.
The program then displays this as a tree on a webpage, expanding
branches with problems, it gives a quick comforting overview of the
whole system, while nagios' "Service Problems" page gives a list of
things to fix (which might not be of immediate importance to the overall
health of the system, but need fixing anyway)
It does it using Perl's "Nagios::StatusLog" module.
No idea how well it scales, and I'm sure there's a better way of doing
it. It's definatly a work in progress, and has made me think a lot more
about defining system health.
--
Paul Weaver
Systems Development Engineer
News Production Facilities, BBC News
Work: 020 822 58109
Room 1244 Television Centre,
Wood Lane, London, W12 7RJ
> -----Original Message-----
> From: nagios-users-bounces at lists.sourceforge.net
> [mailto:nagios-users-bounces at lists.sourceforge.net] On Behalf
> Of Tarak Patel
> Sent: 09 October 2007 14:32
> To: nagios-users at lists.sourceforge.net
> Subject: [Nagios-users] Nagios and cluster setup...few questions
>
>
> Hi all,
>
> Here is a quick background of my current setup for monitoring:
>
> I have an in-house tool monitoring clusters. The tool simply
> uses ssh to
> launch perl scripts on remote machines and grab all of the output to
> stores it on a central location in a logfile. This output is
> parsed and
> for any pre-defined tags (WARNING/CRITICAL/ERROR). If any of
> these tags
> are noticed the message is logged using syslog. The scripts
> residing on
> remote hosts is a collection of perl functions. Each one is
> executed one
> after another. Some of these functions utilize a status file from
> previous run to verify if state of items changed from last
> time. Some of
> these functions can be given a special argument to set the
> current state
> as default state for next iteration of checks.
>
> Cluster are monitored from the head nodes since not all nodes are
> accessible from central location. Head node checks contain a special
> function that simply use DSH to launch checks on all nodes.
>
> After looking at nagios and its check_cluster plugins I
> realized I would
> really like to monitor each of the nodes individually since I
> want to be
> able to disable a particular check on a particular node. Also
> I want to
> be able to use status files for some of the checks. As of now
> I have yet
> to find any plugin that utilizes a status file to monitor hosts. All
> plugin simply use current output from commands to verify the status.
>
> I will be using active checks on the clusters therefore I
> will configure
> nrpe on all nodes. My plan of attack was to simply use head node as a
> gateway and all nodes and services to be defined on the head node
> (under nrpe). From central location I can simply execute a check_nrpe
> type script to verify backend nodes.
>
> I still haven't figured out how I can use status files from each
> iteration of checks to validate status. I'd appreciate some
> inputs as to
> what are the best options in monitoring clusters where
> backend nodes are
> hidden from the central monitoring server. Also some help with use of
> state files.
>
> Thanks all,
>
> TP.
>
>
> --------------------------------------------------------------
> -----------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and
> a browser. Download your FREE copy of Splunk now >>
> http://get.splunk.com/ _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS
> when reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>
http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list