Optimising nagios
Jorgen Lundman
lundman at gmo.jp
Thu Dec 9 04:45:52 CET 2004
Take two, sent it as the wrong email the first time. Moderators, you can just
ignore it.
I do not know if we have a particularly large setup of Nagios, but I believe I
am starting to see effects of possibly having too many hosts and service checks.
The next-check events seems to lag behind more and more, and entering into pages
like "Status Summary" is very slow. (although, user responsiveness is not really
so important to me as the monitoring is.) Re-submitting a check immediately can
take 4-5 minutes before it takes effect.
Anyway, details are:
* Supermicro 6013, dual 2.4ghz, Solaris 9, 1G memory.
Load Avg generally between 2 and 3. (graph shows to be closer to 2, than 3, no
spikes). Which seem ideal on a dual system.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
Statistics:
* Hosts
* 46 Down 0 Unreachable 522 Up 2 Pending
* Services
* 62 Critical 26 Warning 1 Unknown 3370 Ok 0 Pending
I changed it from testing services every 5 minutes to 10 minutes yesterday in an
attempt to quiet things down. I would rather have it be every 5 minutes, but if
that is too frequently, then it is how it will be.
Currently we only use Active checks, no Passive at all. At a guess, check_nrpe
is the most used command, there are some perl checks, but should not be a
majority. (on the local monitor machine I mean). Perhaps I should grep out the
execution history to see which would be executed the most.
I have been reading the optimise documentation, and it seems we are already
doing some (maybe even most) of the items suggested. I have the --emabedded-perl
option to try if there is not anything obviously wrong with our setup.
There are still some devices to be added, in particular, the network devices are
still not present.
There have started being gaps in the graphs which could be due to checks being
delayed? Or that is something unrelated..
I restarted it entirely today, just to clean things out, making sure it isn't
running twice etc.
How bad does it look?
Lund
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
Nagios -s reports:
SERVICE SCHEDULING INFORMATION
-------------------------------
Total services: 3459
Total hosts: 570
Command check interval: -1 sec
Check reaper interval: 4 sec
Inter-check delay method: SMART
Average check interval: 600.867 sec
Inter-check delay: 0.174 sec
Interleave factor method: SMART
Average services per host: 6.068
Service interleave factor: 7
Initial service check scheduling info:
--------------------------------------
First scheduled check: 1102561317 -> Thu Dec 9 12:01:57 2004
Last scheduled check: 1102561918 -> Thu Dec 9 12:11:58 2004
Rough guidelines for max_concurrent_checks value:
-------------------------------------------------
Absolute minimum value: 24
Recommend value: 72
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
Current configuarion values are:
check_external_commands=1
command_check_interval=-1
command_file=/usr/local/nagios/var/rw/nagios.cmd
comment_file=/usr/local/nagios/var/comment.log
downtime_file=/usr/local/nagios/var/downtime.log
lock_file=/usr/local/nagios/var/nagios.lock
temp_file=/usr/local/nagios/var/nagios.tmp
log_rotation_method=m
log_archive_path=/usr/local/nagios/var/archives
use_syslog=0
log_notifications=0
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_initial_states=0
log_external_commands=1
log_passive_service_checks=1
inter_check_delay_method=s
service_interleave_factor=s
max_concurrent_checks=0
service_reaper_frequency=4
sleep_time=1
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=1
state_retention_file=/usr/local/nagios/var/status.sav
retention_update_interval=60
use_retained_program_state=0
interval_length=60
use_agressive_host_checking=0
execute_service_checks=1
accept_passive_service_checks=1
enable_notifications=1
enable_event_handlers=1
process_performance_data=1
service_perfdata_command=service-perf-data-handler
obsess_over_services=0
check_for_orphaned_services=0
check_service_freshness=1
freshness_check_interval=60
aggregate_status_updates=1
status_update_interval=15
enable_flap_detection=1
low_service_flap_threshold=5.0
high_service_flap_threshold=20.0
low_host_flap_threshold=5.0
high_host_flap_threshold=20.0
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
Typical template for hosts (actually, 100% all hosts):
name generic-host
notifications_enabled 1 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program
restarts
retain_nonstatus_information 1 ; Retain non-status information across
program restarts
max_check_attempts 10
notification_interval 120
notification_period 24x7
notification_options d,u,r
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
Template for services, 100%
name generic-service ; The 'name' of this service tem
plate, referenced in other service definitions
active_checks_enabled 1 ; Active service checks are enab
led
passive_checks_enabled 1 ; Passive service checks are ena
bled/accepted
parallelize_check 1 ; Active service checks should b
e parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this ser
vice (if necessary)
check_freshness 0 ; Default is to NOT check servic
e 'freshness'
notifications_enabled 1 ; Service notifications are enab
led
event_handler_enabled 1 ; Service event handler is enabl
ed
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information acro
ss program restarts
retain_nonstatus_information 1 ; Retain non-status information
across program restarts
is_volatile 0
check_period 24x7
max_check_attempts 5
normal_check_interval 10
retry_check_interval 3
++++++++++++++++++++++++++++++++++++++++++++++++++++
extinfo.cgi output
Program-Wide Performance Information
Active Checks:
Time Frame Checks Completed
<= 1 minute: 52 (1.5%)
<= 5 minutes: 52 (1.5%)
<= 15 minutes: 805 (23.3%)
<= 1 hour: 3458 (100.0%)
Since program start: 2443 (70.6%)
Metric Min. Max. Average
Check Execution Time: < 1 sec 60 sec 0.785 sec
Check Latency: < 1 sec 2097 sec 604.420 sec
Percent State Change: 0.00% 6.25% 0.00%
Passive Checks:
Time Frame Checks Completed
<= 1 minute: 0 (0.0%)
<= 5 minutes: 0 (0.0%)
<= 15 minutes: 0 (0.0%)
<= 1 hour: 0 (0.0%)
Since program start: 0 (0.0%)
--
Jorgen Lundman | <lundman at lundman.net>
Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell)
Japan | +81 (0)3 -3375-1767 (home)
--
Jorgen Lundman | <lundman at lundman.net>
Unix Administrator | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo | +81 (0)90-5578-8500 (cell)
Japan | +81 (0)3 -3375-1767 (home)
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://productguide.itmanagersjournal.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list