Verion 2.0b4 how does cgi's nagios_check_command work?

John P. Rouillard rouilj at cs.umb.edu
Thu Oct 13 17:28:21 CEST 2005


In message <434E752C.3050301 at op5.se>,
Andreas Ericsson writes:

>John P. Rouillard wrote:
>> In message <434D5AF2.7010504 at op5.se>,
>> Andreas Ericsson writes:
>> 
>> 
>>>John P. Rouillard wrote:
>>>
>>>>In message <43467920.4070508 at op5.se>,
>>>>Andreas Ericsson writes:
>>>>
>>>>>John P. Rouillard wrote:
>>>>>
>>>>>
>>>>>>In message <43465AB9.6020304 at op5.se>,
>>>>>>Andreas Ericsson writes:
>>>>>>
>>>>>>
>>>>>>>John P. Rouillard wrote:
>>>>
>>>>
>>>>>>>>The reason I ask is
>>>>>>>>that nagios was down and the cgi's all happily reported that it was
>>>>>>>>up. Could this be because the host and service status files were
>>>>>>>>available since the machine crashed?
>>>>>>>
>>>>>>>Yes, that's almost certainly it. There is no really good way of 
>>>>>>>detecting that nagios is actually running unless you're logged in as 
>>>>>>>root.
>>>>>>
>>>>>>Hmm, I am not sure I follow why you need to be logged in as root.
>>>>>
>>>>>Because otherwise you shouldn't have access to reading process 
>>>>>information about another users process.
>>>>>
>>>>>
>>>>>>Why not stat the status.log file and check to see if its (mtime)
>>>>>timestamp is less than the setting of:
>>>>>>
>>>>>>	status_update_interval*2
>>>>>>
>>>>>>if aggregate_status_updates is enabled? One could also allow a setting
>>>>>>"freshness_threshold" in cgi.cfg that is the number of seconds/minutes
>>>>>>old the status.dat file is allowed to be if aggregate_status_updates
>>>>>>isn't set.
>>>>>
>>>>>Good idea. Write the code for it and submit a patch.
>>>>
>>>>Actually not so much a good idea. There is actully a creation
>>>>datestamp in the status.dat file I was going to use, but I decided to
>>>>run an experiment first. I have my status_update_interval set to 3
>>>>seconds.
>>>>
>>>>I used check_fileage to warn me if the file's age was over 3 seconds
>>>>and ran it in a while loop. It failed often. The longest interval was
>>>>139 seconds between updates with a number of periods of 20-30 seconds.
>>>>
>>>>My guesses are: nagios only writes the status file when it needs to.
>>>
>>>This is correct. The status_update_interval is never checked, although 
>>>the status is updated every time a service changes either state or 
>>>output (or a host, for that matter).
>> 
>> 
>> Ideally nagios would provide a next_check_time in the status.dat, but
>> I wonder if that could be usefully intuited from:
>> 
>>   min(
>>       min(next_check time on services) + service_check_timeout),
>>       min(next_check time on hosts) + host_check_timeout)
>>      )
>> 
>> Possible problems: on demand host checks (if part of a network is
>> down) could screw up the timing since everything else stops.
>> 
>> Just because a service check is scheduled doesn't mean that it is
>> going to run (time period may be wrong etc), but if its determined to
>> be non-runnable the escheduled time for it should cause a re-write of
>> the status.dat file correct?
>> 
>> There has to be an easier way of determining if nagios is running
>> doesn't there?
>> 
>
>Easy isn't the problem. The trick is to get it to work from a different 
>and almost always less privileged user. Perhaps a simple neb-module can 
>touch some file every 10 seconds and if it's 30 seconds old the GUI 
>could then reasonably suspect that nagios has crashed.
>
>However, I haven't noticed nagios crashing on a modern system. It used 
>to, with glibc-2.0.35 and linuxthreads-0.7 (which was really buggy). 
>Since upgrading to glibc-2.3.30 (or some such) and linuxthreads-0.10 
>everything is running smoothly, so this isn't really a problem for me or 
>any of our customers.

Yeah, but its just bad that the gui will blithely go on even if there
is no nagios daemon running because somebody (maliciously?) killed it
manually and failed to restart it, or it failed to restart on boot
after a crash (maybe bad config files, need to check to see if rc
script will delete the status file as well as the command file)
etc. At this point I guess I'll just have to live with it.

				-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list