[Fwd: Nagios Performance Monitoring]
Hendrik Bäcker
andurin at process-zero.de
Tue Oct 23 18:00:48 CEST 2007
Hi Ethan,
please drop the graph PDF I sent you yesterday!
I've made a failure with handling the data from my perl script.
I've executed nagiostats with MRTG_DATA_VARS in Order 1,2,3,4 and put
them internally in a non-sorted hash... so the order of the requested
data resultet in a non-sorted list of graph names.
A few minutes ago I fixed my failure and deleted all graphs.
Tommorow I can send you the exact data.
-
Hendrik
-------- Original-Nachricht --------
Betreff: Nagios Performance Monitoring
Datum: Mon, 22 Oct 2007 20:14:28 +0200
Von: Hendrik Bäcker <andurin at process-zero.de>
An: Ethan Galstad <nagios at nagios.org>
Hi Ethan,
*** BEWARE ***
*** TONS OF INFORMATION IN HERE ***
*** OK - you have been warned :) ***
as mentioned in the last e-mail and talked about at the conference I
have made some graphing about the performance.
Pre-Scriptum: If you think the nagios-devel might help here, too I will
try to write some compressed information to it.
First, some words about my (scary) setup.
Cause of the magic border of ~2000 Service checks from Nagios 2.x I had
to compile four different nagios instances, all running on the same
hardware server.
So I have one
/usr/local/nagios/etc/
for common files like, ressource.cfg, checkcommands, misccommands, ...
and mainly four directories like
/usr/local/nagios/_1_/bin/
/usr/local/nagios/_1_/etc/
/usr/local/nagios/_1_/var/
etc.
/usr/local/nagios/_2_/bin/
/usr/local/nagios/_2_/etc/
/usr/local/nagios/_2_/var/
etc.
To be able to see each of my instances I renamed the nagios binary to
nagios-1, nagios-2, nagios-3 and so on.
(Yes - you're right. I have four different Web Interfaces ;) )
I am running a fifth instance to monitor the earlier four instances, so
my 5th instance has just 3 hosts (Nagios_Master, Nagios_Slave, my
dedicated internetserver) and <100 Servicechecks.
I am not dealing with NSCA or s.th. similar to feed up my Failover
Server "Nagios_Slave".
Currentliy I am _not_ using NDOUtils.
But all of my instances are processing performance data, but only the
services that are delivering perfdata has the service option
process_perfdata set to "1".
All of the five instance do nearly the same with perfdata:
1. nagios writes the perfdata to a file
( /usr/local/nagios/_instance_/var/perfdata )
2. nagios move that file every 30 seconds to a shared place
( /usr/local/nagios/var/perfspool/ )
3. End of dealing with perfdata for the nagios daemon.
4. Standalone C Daemon to catch up the files from the spooldir and
feeding a perlscript to create and update the rrdfiles ( this shouldn't
care for the nagios processes, i think )
So, for every Nagios Instance it is just writing to a file handle and
move the inode every 30 seconds and re-open a new FH.
Today I've written a small perl script that runs every 60 seconds to
call each of the nagiostats binaries and grep some Data for charting.
See the attached PDF.
I think only the values since 16.40 are interesting (the last time I've
restarted all of the 5 processes).
What you see there is:
Graph Title:
Nagios_X_Performance_AHCL are the "Hostcheck Last" from nagiostats output.
Nagios_X_Performance_SHCL are the "Servicecheck Last" from nagiostats
output.
At the end of Page 4 and 5 you can see the AVG Service Check Latency for
all the instances.
The data are nearly the truth, cause the 4th instance is my "Nagios"
Instance with only the 3 hosts and quiet no latencies.
Interesting:
Please have a look to the left side of the graph "check_latency_5".
There you can see the big latency (max. 992 seconds), from friday until
today.
The latency just climbs up without an end like you can see on the actual
time.
Some information about the host/service count:
Instance 1:
Total Hosts: 371
Total Services: 2156
Instance 2:
Total Hosts: 206
Total Services: 1405
Instance 3:
Total Hosts: 381
Total Services: 3144
Instance 4:
Total Hosts: 3
Total Services: 44
Instance 5:
Total Hosts: 299
Total Services: 3247
As you can see, instance 3 and 5 are my horrible childs with over 3000
active Servicechecks and instance 5 is the "main nagios" for our company.
Here typically host definition (from objects.cache)
define host {
host_name AIX100E001
alias xxx
address xxx
parents xxx,xxx (yes - two parent hosts!)
check_command check-host-alive (this is a check_ping)
contact_groups two,groups
notification_period 24x7
initial_state o
check_interval 60.000000 (my interval_length=1)
retry_interval 15.000000 (my interval_length=1)
max_check_attempts 2
active_checks_enabled 1
passive_checks_enabled 1
obsess_over_host 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options o,d,u
freshness_threshold 0
check_freshness 0
notification_options d,u,r,f
notifications_enabled 1
notification_interval 3600.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
failure_prediction_enabled 1
retain_status_information 1
retain_nonstatus_information 1
}
cached_host_check_horizon=30
cached_service_check_horizon=30
So, I think I got the first killerwave for informations.
Currently, I have much time for testing and debugging on that
installation, hope you have some ideas for debugging, I can do anything
you want (without getting you shell access :) )
Regards,
Hendrik
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
More information about the Developers
mailing list