Passive host down result is interpreted as up on master
Ethan Galstad
nagios at nagios.org
Mon Mar 19 20:28:44 CET 2007
Ton Voon wrote:
> Hi!
>
> On 16 Mar 2007, at 18:02, Ton Voon wrote:
>
>> I was wondering if anyone has seen this before. On a slave, we have a
>> host that is marked as DOWN with a plugin output of "CRITICAL - Plugin
>> timed out after 10 seconds", as expected. However, on the master, that
>> host is marked as UP with the same text.
>>
>>
>> The logs on the master server, show:
>>
>> [1174045717] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;host1;0;PING
>> OK - Packet loss = 0%, RTA = 0.37 ms|
>>
>> Host is marked as UP. Later on:
>>
>> [1174045949] EXTERNAL COMMAND:
>> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after 10
>> seconds|
>>
>> Failure arrives.
>>
>> [1174045949] HOST ALERT: host1;DOWN;HARD;1;CRITICAL - Plugin timed out
>> after 10 seconds
>>
>> Marked it as DOWN with alert. As expected.
>>
>> [1174045951] Warning: The results of service '/ - partition' on host
>> 'host1' are stale by 24 seconds (threshold=82 seconds). I'm forcing
>> an immediate check of the service.
>> [1174045953] SERVICE ALERT: host1;/ -
>> partition;UNKNOWN;HARD;1;UNKNOWN: Service results are stale
>> [1174045959] EXTERNAL COMMAND:
>> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after 10
>> seconds|
>>
>> More passive results
>>
>> [1174045971] EXTERNAL COMMAND:
>> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after 10
>> seconds|
>>
>> And again, but this time...
>>
>> [1174045973] HOST ALERT: host1;UP;HARD;1;CRITICAL - Plugin timed out
>> after 10 seconds
>>
>> Nagios has marked the host as UP, even though the
>> PROCESS_HOST_CHECK_RESULT is down.
>>
>>
>> The complete nagios.log around this period is attached. I'm at a lost
>> understanding why this has happened. Has anyone got any clues, or seen
>> something similar?
>>
>> We haven't been able to reproduce this consistently yet.
>>
>> This is on Nagios 2.5 (with some local patches).
>
>
> We think we've found the root problem.
>
> In checks.c, if a host does not have a check_command, there is a debug
> line that says: "No host check command specified, so no check will be
> done (host state assumed to be unchanged)". However, it then returns
> HOST_UP. We have amended this to return hst->current_state instead.
>
> In our distributed setup, we define a host without a check_command,
> instead relying on the passive host results sent by the slave. However,
> on the master, if a service on this host passes its freshness threshold,
> a host check is scheduled, with the force flag. This then gets to this
> portion of the code and returns a HOST_UP state rather than the current
> state, thus showing an incorrect state for the host.
>
> Our patch is below, made against nagios 2.8.
>
> Ton
>
Good catch! I'll get this into CVS pronto.
Ethan Galstad,
Nagios Developer
---
Email: nagios at nagios.org
Website: http://www.nagios.org
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
More information about the Developers
mailing list