Unpredictable service check times fixed?
Mateo Carr
mcarr at apple.com
Tue Apr 15 04:30:06 CEST 2003
I have been experiencing this problem as well. We have 46 hosts running
330 services currently configured in nagios. I expect this to grow to
about 300 hosts and over 1800 service checks assuming we can get this
issue resolved.
I re-compiled with the --enable-DEBUG3 as suggested. Included is the
relevant output displayed to the screen (by running nagios w/out the -d
option) and associated nagios.log output.
nagios.log sample:
[1050363305] Warning: The check of service 'Root Volume Usage' on host
'webx02' could not be performed due to a fork() error. The check will
be rescheduled.
[1050363305] Warning: The check of service 'NFS' on host 'webx01' could
not be performed due to a fork() error. The check will be rescheduled.
[1050363305] Warning: The check of service 'Syslogd' on host 'webx03'
could not be performed due to a fork() error. The check will be
rescheduled.
[1050363305] Warning: The check of service 'CLOSE_WAITS' on host
'webx05' could not be performed due to a fork() error. The check will
be rescheduled.
[1050363305] Warning: The check of service 'Cron' on host '<snip> 01'
could not be performed due to a fork() error. The check will be
rescheduled.
[1050363305] Warning: The check of service 'NFS' on host 'webx09' could
not be performed due to a fork() error. The check will be rescheduled.
[1050363305] Warning: The check of service 'Clock_Drift' on host
'webx06' could not be performed due to a fork() error. The check will
be rescheduled.
[1050363305] Warning: The check of service 'Cron' on host '<snip>03'
could not be performed due to a fork() error. The check will be
rescheduled.
etc.....
output on the screen:
*** Event Check Loop ***
Current time: Mon Apr 14 16:35:05 2003
Next High Priority Event Time: Mon Apr 14 16:35:12 2003
Next Low Priority Event Time: Mon Apr 14 16:34:37 2003
Current/Max Outstanding Checks: 106/0
*** Event Details ***
Event type: 0 (service check)
Service Description: Root Volume Usage
Associated Host: webx02
Event time: Mon Apr 14 16:34:37 2003
Checking service 'Root Volume Usage' on host 'webx02'...
Input: check_nrpe!check_root_disk
Output: $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -to 60
Warning: The check of service 'Root Volume Usage' on host 'webx02'
could not be performed due to a fork() error. The check will be
rescheduled.
Preferred Time: 1050363305 --> Mon Apr 14 16:35:05 2003
Next Valid Time: 1050363305 --> Mon Apr 14 16:35:05 2003
*** Event Check Loop ***
Current time: Mon Apr 14 16:35:05 2003
Next High Priority Event Time: Mon Apr 14 16:35:12 2003
Next Low Priority Event Time: Mon Apr 14 16:34:37 2003
Current/Max Outstanding Checks: 107/0
*** Event Details ***
Event type: 0 (service check)
Service Description: NFS
Associated Host: webx01
Event time: Mon Apr 14 16:34:37 2003
Checking service 'NFS' on host 'webx01'...
Input: check_nrpe!check_nfs_hang
Output: $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -to 60
Warning: The check of service 'NFS' on host 'webx01' could not be
performed due to a fork() error. The check will be rescheduled.
Preferred Time: 1050363305 --> Mon Apr 14 16:35:05 2003
Next Valid Time: 1050363305 --> Mon Apr 14 16:35:05 2003
*** Event Check Loop ***
Current time: Mon Apr 14 16:35:05 2003
Next High Priority Event Time: Mon Apr 14 16:35:12 2003
Next Low Priority Event Time: Mon Apr 14 16:34:38 2003
Current/Max Outstanding Checks: 108/0
*** Event Details ***
Event type: 0 (service check)
Service Description: Syslogd
Associated Host: webx03
Event time: Mon Apr 14 16:34:38 2003
Checking service 'Syslogd' on host 'webx03'...
Input: check_nrpe!check_syslog
Output: $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -to 60
Warning: The check of service 'Syslogd' on host 'webx03' could not be
performed due to a fork() error. The check will be rescheduled.
Preferred Time: 1050363305 --> Mon Apr 14 16:35:05 2003
Next Valid Time: 1050363305 --> Mon Apr 14 16:35:05 2003
*** Event Check Loop ***
Current time: Mon Apr 14 16:35:05 2003
Next High Priority Event Time: Mon Apr 14 16:35:12 2003
Next Low Priority Event Time: Mon Apr 14 16:34:38 2003
Current/Max Outstanding Checks: 109/0
*** Event Details ***
Event type: 0 (service check)
Service Description: CLOSE_WAITS
Associated Host: webx05
Event time: Mon Apr 14 16:34:38 2003
Checking service 'CLOSE_WAITS' on host 'webx05'...
Input: check_nrpe!check_close_wait
Output: $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -to 60
Warning: The check of service 'CLOSE_WAITS' on host 'webx05' could not
be performed due to a fork() error. The check will be rescheduled.
Preferred Time: 1050363305 --> Mon Apr 14 16:35:05 2003
Next Valid Time: 1050363305 --> Mon Apr 14 16:35:05 2003
etc.....you get the point.
A restart appears to clear up the problem for about 3 to 4 hours.
Any light that could be shed on why this is happening would be very
much appreciated.
Thanks!
Mateo Carr
Systems Engineer
Apple Computer, Inc.
mcarr at apple.com
On Saturday, April 12, 2003, at 10:20 PM, Stanley Hopcroft wrote:
> Dear Sir,
>
> I am writing to thank you for your letter and say,
>
> 0 If you are not using Nagios-1.0 then please try that, otherwise
>
> 1 You may have found a bug as you say in the scheduler.
>
> However, there are _many_ Nag installations monitoring far more hosts
> and services without problems. (Here ~ 200 hosts and 350 services).
>
> If that is the case, the only way you can demonstrate the bug is by
> setting up a Test Nag environment - it could be your production
> environment since that is exhibiting the problem - and run Nagios in
> such a way that you can collect debug information.
>
> This is probably easiest done by rebuilding Nag with the appropriate
> debug config option
>
> (./configure --help
> ..
> --enable-DEBUG0 shows function entry and exit
> --enable-DEBUG1 shows general info messages
> --enable-DEBUG2 shows warning messages
> --enable-DEBUG3 shows scheduled events (service and host checks... etc)
> --enable-DEBUG4 shows service and host notifications
> --enable-DEBUG5 shows SQL queries
>
> so probably DEBUG3)
>
> then run Nag in foreground (no -d) and post the parts of the log that
> show scheduling anomalies.
>
> Alternatively, modify the plugin of the service that seems to be
> suffering the most severe scheduling delays to log it's invocation and
> exit.
>
> This probably means adding code like (to a C plugin)
>
> +time_t my_clock;
> +clock = time() ;
> +fprintf(stderr, "myPlugin started at %s." ctime(&clock)) ;
>
> ...
>
> +clock = time() ;
> +fprintf(stderr, "myPlugin finished at %s.", ctime(&clock)) ;
>
> recompiling it and installing it - probably under a new name - in the
> Nag libexec directory.
>
> 2 If you want a tactical dumpb solution, a cron job that sends a hangup
> signal to Nagios periodically (or restarts it).
>
> You probably want to post the relevant parts of nagios.cfg also.
>
> Yours sincerely.
>
> --
> -----------------------------------------------------------------------
> -
> Stanley Hopcroft
> -----------------------------------------------------------------------
> -
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list