Distributed Monitoring Central Server no status changes
Marc Powell
marc at ena.com
Wed Feb 25 20:58:20 CET 2009
Hi Paul,
Please always respond on list so that others now, and in the future,
can learn from your experience and so that you can benefit from the
experience of others on the list. More below...
On Feb 25, 2009, at 12:54 PM, Paul Landauer wrote:
> On Wed, 2009-02-25 at 12:06 -0600, Marc Powell wrote:
> I'm using 2 servers following the documentation at
> http://nagios.sourceforge.net/docs/3_0/distributed.html
Thanks.
>> - example host and service definitions from both servers (complete
>> definitions please)
> Definitions are the same on both servers.
> Example host definition:
> define host{
> use generic-host
> host_name surf
> alias Surf Control
> address ip_address_of_surf_is_here
> max_check_attempts 5
> check_command check-host-alive
> check_interval 5
> retry_interval 1
> check_period 24x7
> contact_groups admins
> notification_interval 30
> notification_period 24x7
> notification_options d,u,r
> }
>
> Example Service Definitions (surf is a member of
> sunrise_windows_servers):
> define service{
> use generic-service
> hostgroup_name sunrise_windows_servers
> service_description NSClient++ Version
> check_command check_nt!CLIENTVERSION
> }
For future reference, these are not 'complete' since you use
templates. There's lots of important information within those
templates that's needed when troubleshooting as well. I expect that
the definitions are indeed different between the servers when you take
the templates into account otherwise your central server is doing
active checks of the services in addition to receiving the passive
checks, overwriting their results. (I don't think this is the problem).
>> - related nagios.log information from both servers
> I included excerpts that I thought applied. If you'd like the whole
> log, let me know.
> Nagios.log for Distributed server:
> [1235575724] SERVICE ALERT: surf;Explorer;CRITICAL;HARD;
> 3;Explorer.exe:
> not running
> [1235575724] SERVICE NOTIFICATION:
> nagiosadmin;surf;Explorer;CRITICAL;notify-service-by-
> email;Explorer.exe:
> not running
>
> Nagios.log for Central Server:
> [1235575777] EXTERNAL COMMAND:
> PROCESS_SERVICE_CHECK_RESULT;surf;Explorer;0;Explorer.exe: not running
> [1235575778] PASSIVE SERVICE CHECK: surf;Explorer;0;Explorer.exe: not
> running
This is interesting and useful. As you can see, on your distributed
server, the status is 3 (CRITICAL) but by the time NSCA dumps it into
the command pipe on the central server, that has been translated to 0
(OK) by something in the process. This could be because nagios isn't
passing the correct status code to your submission script, your
submission script is not interpreting or passing it to send_nsca
correctly or nsca on the receiving side isn't reading it correctly.
>> - the contents of your check result submission script if it's not
>> exactly like the documented one.
> printfcmd="/usr/bin/printf"
>
> NscaBin="/usr/bin/send_nsca"
> NscaCfg="/etc/nagios/send_nsca.cfg"
> NagiosHost="I_have_the_ip_address_of_my_central_server_here"
>
> # Fire the data off to the NSCA daemon using the send_nsca script
> $printfcmd "%s\t%s\t%s\t%s\n" "$1" "$2" "$3" "$4" | $NscaBin -H
> $NagiosHost -p 5
> 721 -c $NscaCfg
To say whether this is correct or not I'd have to see your OCSP
command definition. If you're using the $SERVICESTATE$ macro, then
this is broken. send_nsca expects a numeric state code but
$SERVICESTATE$ provides a grammatical code (OK, CRITICAL, etc).
Normally that needs to be translated to the proper numeric by the
submission script first but you can also use the $SERVICESTATEID$
macro instead to get the numeric code. My bets are on this being the
problem.
>> Running nagios and/or NSCA in debug mode on the central server might
>> provide additional information.
> Let me know if you still want this to be done.
Running NSCA in debug to see if it's receiving the 0 status code from
the distributed machine would further narrow down the source of the
problem.
--
Marc
------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list