Bug report/workaround -- (was Re: Nagios Performance Data shows checks aren't being completed)
Eli Stair
estair at ilm.com
Thu Dec 15 20:35:31 CET 2005
I've been trying to resolve this situation for over a week now without
taking drastic changes. 2.0b6, all retention data created new (not
continued from older versions), x86_64, perl cache enabled.
I've had a worsening problem recently, where my monitoring host (which
is controlling 1003 hosts/8543 services/5257 service dependencies) an
increasing number of service checks and event handlers were falling
through the scheduler. Even after stopping and starting nagios, and
doing a forced_host_svc_checks the relavent check/responses during the
several-minute execution pause, these were being skipped or not acted
upon. Showing the status in 'view config' confirmed that it was set up
properly, but events were missed and either not re-scheduled or
rescheduled but not executed.
The last step I took was to stop nagios a final time last night and zero
the state file retention.dat (as well as the objects.cache for good
measure, though it wasn't the problem). After starting nagios fresh
with no notion of previous states, within one hour (my threshhold for
service/host checks) the entire schedule was executed properly, all
services that had been in an unhandled 'bad' state for days were
checked, and the respective event handlers were run and the situation
rectified.
I have no idea of the cause of this, whether it will happen again or
not, etc. I'll be more than happy to provide more details. I have
backups of the config and retention files from several periods during
this period.
I'd really like to help resolve this, as losing the trending data is not
something I want to do again. My only concern with this setup is the
"Warning: Size of service_message struct (528 bytes) is >
POSIX-guaranteed atomic write size (512 bytes). Service checks results
may get lost or mangled!" I get when building 2.0betas on any system I
have available, I haven't seen this addressed/resolved in any searches
of archives I've done.
Cheers,
/eli
Eli Stair wrote:
>
> Corroboration here, I actually have a mail I'm compiling also on the
> same issue. 2.0b6
>
> I've got orphaned service checks enabled, unlimited parallel service
> checks, etc. If I force a host/svc check through the CGI's or the
> command file direct they get executed right away... the scheduler just
> is losing them.
>
> /eli
>
> sheeri kritzer wrote:
>
>> Hi all,
>>
>> My nagios 2.0 installation shows the following under performance
>> information. There are 99 service checks, and I can't imagine it
>> takes more than an hour to complete all 99. We've had problems where
>> nagios hasn't found and notified us of problems. The load on the box
>> is tiny. nagios -s has no suggestions. What did I do wrong?
>>
>> uptime
>> 17:38:38 up 81 days, 9:05, 4 users, load average: 0.00, 0.00, 0.00
>>
>> Nagios is running, and has been for a while:
>>
>> ps -ef | grep nagios
>> nagios 11160 1 0 Nov14 ? 00:12:32 /usr/bin/nagios -d
>> /etc/nagios/nagios.cfg
>> nagios 22947 1 0 Nov20 ? 00:00:00 nrpe -c
>> /etc/nagios/nrpe.cfg -d
>>
>> Performance Info:
>>
>> Program-Wide Performance Information
>> Active Service Checks:
>>
>> Time Frame Checks Completed
>> <= 1 minute: 1 (1.0%)
>> <= 5 minutes: 58 (58.6%)
>> <= 15 minutes: 60 (60.6%)
>> <= 1 hour: 60 (60.6%)
>> Since program start: 99 (100.0%)
>>
>> Metric Min. Max. Average
>> Check Execution Time: 0.01 sec 8.71 sec 1.286 sec
>> Check Latency: 0.01 sec 1.03 sec 0.488 sec
>> Percent State Change: 0.00% 0.00% 0.00%
>> Passive Service Checks:
>>
>> Time Frame Checks Completed
>> <= 1 minute: 0 (0.0%)
>> <= 5 minutes: 0 (0.0%)
>> <= 15 minutes: 0 (0.0%)
>> <= 1 hour: 0 (0.0%)
>> Since program start: 0 (0.0%)
>>
>> Metric Min. Max. Average
>> Percent State Change: 0.00% 0.00% 0.00%
>> Active Host Checks:
>>
>> Time Frame Checks Completed
>> <= 1 minute: 0 (0.0%)
>> <= 5 minutes: 0 (0.0%)
>> <= 15 minutes: 0 (0.0%)
>> <= 1 hour: 0 (0.0%)
>> Since program start: 19 (76.0%)
>>
>> Metric Min. Max. Average
>> Check Execution Time: 3.01 sec 4.01 sec 3.972 sec
>> Check Latency: 0.00 sec 0.00 sec 0.000 sec
>> Percent State Change: 0.00% 0.00% 0.00%
>> Passive Host Checks:
>>
>> Time Frame Checks Completed
>> <= 1 minute: 0 (0.0%)
>> <= 5 minutes: 0 (0.0%)
>> <= 15 minutes: 0 (0.0%)
>> <= 1 hour: 0 (0.0%)
>> Since program start: 0 (0.0%)
>>
>> Metric Min. Max. Average
>> Percent State Change: 0.00% 0.00% 0.00%
>>
>> ----------------------------------------------------------------------------------------------------------------------------
>>
>>
>> Nagios 2.0b4
>> Copyright (c) 1999-2005 Ethan Galstad (http://www.nagios.org)
>> Last Modified: 08-02-2005
>> License: GPL
>>
>> Projected scheduling information for host and service
>> checks is listed below. This information assumes that
>> you are going to start running Nagios with your current
>> config files.
>>
>> HOST SCHEDULING INFORMATION
>> ---------------------------
>> Total hosts: 25
>> Total scheduled hosts: 0
>> Host inter-check delay method: SMART
>> Average host check interval: 0.00 sec
>> Host inter-check delay: 0.00 sec
>> Max host check spread: 30 min
>> First scheduled check: N/A
>> Last scheduled check: N/A
>>
>>
>> SERVICE SCHEDULING INFORMATION
>> -------------------------------
>> Total services: 99
>> Total scheduled services: 99
>> Service inter-check delay method: SMART
>> Average service check interval: 300.00 sec
>> Inter-check delay: 3.03 sec
>> Interleave factor method: SMART
>> Average services per host: 3.96
>> Service interleave factor: 4
>> Max service check spread: 30 min
>> First scheduled check: Mon Dec 12 17:39:51 2005
>> Last scheduled check: Mon Dec 12 17:44:47 2005
>>
>>
>> CHECK PROCESSING INFORMATION
>> ----------------------------
>> Service check reaper interval: 10 sec
>> Max concurrent service checks: Unlimited
>>
>>
>> PERFORMANCE SUGGESTIONS
>> -----------------------
>> I have no suggestions - things look okay.
>>
>>
>> ---------------------------------------------------------------------------------------------------------------------------------
>>
>>
>> grep -v ^# /etc/nagios/nagios.cfg | grep -v ^$
>> Nagios.cfg params:
>>
>> log_file=/var/log/nagios/nagios.log
>> cfg_file=/etc/nagios/checkcommands.cfg
>> cfg_file=/etc/nagios/misccommands.cfg
>> cfg_file=/etc/nagios/contactgroups.cfg
>> cfg_file=/etc/nagios/contacts.cfg
>> cfg_file=/etc/nagios/dependencies.cfg
>> cfg_file=/etc/nagios/escalations.cfg
>> cfg_file=/etc/nagios/hostgroups.cfg
>> cfg_file=/etc/nagios/hosts.cfg
>> cfg_file=/etc/nagios/services.cfg
>> cfg_file=/etc/nagios/timeperiods.cfg
>> object_cache_file=/var/log/nagios/objects.cache
>> resource_file=/etc/nagios/resource.cfg
>> status_file=/var/log/nagios/status.dat
>> nagios_user=nagios
>> nagios_group=nagios
>> check_external_commands=1
>> command_check_interval=-1
>> command_file=/var/log/nagios/rw/nagios.cmd
>> comment_file=/var/log/nagios/comments.dat
>> downtime_file=/var/log/nagios/downtime.dat
>> lock_file=/var/run/nagios.pid
>> temp_file=/var/log/nagios/nagios.tmp
>> event_broker_options=-1
>> log_rotation_method=d
>> log_archive_path=/var/log/nagios/archives
>> use_syslog=1
>> log_notifications=1
>> log_service_retries=1
>> log_host_retries=1
>> log_event_handlers=1
>> log_initial_states=0
>> log_external_commands=1
>> log_passive_checks=1
>> service_inter_check_delay_method=s
>> max_service_check_spread=30
>> service_interleave_factor=s
>> host_inter_check_delay_method=s
>> max_host_check_spread=30
>> max_concurrent_checks=0
>> service_reaper_frequency=10
>> auto_reschedule_checks=0
>> auto_rescheduling_interval=30
>> auto_rescheduling_window=180
>> sleep_time=0.25
>> service_check_timeout=60
>> host_check_timeout=30
>> event_handler_timeout=30
>> notification_timeout=30
>> ocsp_timeout=5
>> perfdata_timeout=5
>> retain_state_information=1
>> state_retention_file=/var/log/nagios/retention.dat
>> retention_update_interval=60
>> use_retained_program_state=1
>> use_retained_scheduling_info=0
>> interval_length=60
>> use_aggressive_host_checking=0
>> execute_service_checks=1
>> accept_passive_service_checks=1
>> execute_host_checks=1
>> accept_passive_host_checks=1
>> enable_notifications=1
>> enable_event_handlers=1
>> process_performance_data=0
>> obsess_over_services=0
>> check_for_orphaned_services=0
>> check_service_freshness=1
>> service_freshness_check_interval=60
>> check_host_freshness=0
>> host_freshness_check_interval=60
>> aggregate_status_updates=1
>> status_update_interval=15
>> enable_flap_detection=0
>> low_service_flap_threshold=5.0
>> high_service_flap_threshold=20.0
>> low_host_flap_threshold=5.0
>> high_host_flap_threshold=20.0
>> date_format=us
>> p1_file=/usr/bin/p1.pl
>> illegal_object_name_chars=`~!$%^&*|'"<>?,()=
>> illegal_macro_output_chars=`~$&|'"<>
>> use_regexp_matching=0
>> use_true_regexp_matching=0
>> admin_email=nagios
>> admin_pager=pagenagios
>> daemon_dumps_core=0
>>
>> Any help is much appreciated.
>>
>> Thank you,
>>
>> Sheeri Kritzer
>>
>>
>> -------------------------------------------------------
>> This SF.net email is sponsored by: Splunk Inc. Do you grep through log
>> files
>> for problems? Stop! Download the new AJAX search engine that makes
>> searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
>> http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
>> _______________________________________________
>> Nagios-users mailing list
>> Nagios-users at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nagios-users
>> ::: Please include Nagios version, plugin version (-v) and OS when
>> reporting any issue. ::: Messages without supporting info will risk
>> being sent to /dev/null
>>
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log
> files
> for problems? Stop! Download the new AJAX search engine that makes
> searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue. ::: Messages without supporting info will risk
> being sent to /dev/null
>
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list