Nagios 2.6 still not draining command pipe fast enough (update with nagios 2.7)
John P. Rouillard
rouilj+nagiosdev at cs.umb.edu
Wed Feb 28 21:11:01 CET 2007
In message <45DB09B8.2010609 at nagios.org>,
Ethan Galstad writes:
>John P. Rouillard wrote:
>> In message <45D9C0C6.8030204 at nagios.org>,
>> Ethan Galstad writes:
>>
>>> John P. Rouillard wrote:
>>>> Hi all:
>>>>
>>>> I am trying to get my external correlation engine working with nagios
>>>> 2.x <http://www.cs.umb.edu/~rouilj/#secnagios>, and I just can't get
>>>> nagios to drain the command pipe fast enough. I see approx. 5% failure
>>>> rate on writing to the command pipe with an EAGAIN error.
>>>>
>>>> I have increased:
>>>>
>>>> nagios.h:#define COMMAND_BUFFER_SLOTS 20480
>>>> nagios.h:#define SERVICE_BUFFER_SLOTS 20480
>>>>
>>>> from the original 1024. In the increase of the settings from 10240 to
>>>> 20480, I may see a slight decrease (maybe .5%), but I think I just want
>to
>>>> see it. I don't think it's statistically viable.
>>> John - Does this problem still occur with Nagios 2.7 or the latest 2.x
>>> CVS code? A separate command file worker thread should be reading
>>> entries from the external command file as fast as it can read them (as
>>> long as their are free buffer slots).
>>>
>>> If there aren't any external commands, the thread waits 0.5 seconds
>>> before checking for new commands in the file. If you have occasional
>>> bursts of check results, this could be too long to wait. You could try
>>> experimenting with decreasing the 0.5 second delay. Around line 4948 of
>>> base/utils.c, you'll find...
>>>
>>> /* wait a bit */
>>> tv.tv_sec=0;
>>> tv.tv_usec=500000;
>>> select(0,NULL,NULL,NULL,&tv);
>>>
>>> You could try decreasing the value of tv.tv_usec to 100000 (0.1 seconds)
>>> and see if that helps at all.
I installed Nagios 2.7 last Thursday. Now the occurrence has dropped
from 5% to something in the neighborhood of .7%. But that may not be
the stable point as it is still growing, it was .5% a couple of days
ago. I haven't tried changing the sleep times mentioned above because
of a dramatic increase in average latency.
I am now seeing average latency in the 20 second range rather than 1
second as was occurring with my nagios 2.6 install. What is funny is
that the gui is showing:
Check Latency: 0.00 sec 109.37 sec 34.685 sec
that doesn't agree with what nagiostats reports. The max latency is
understandable as we have been having some network drops, but even in
a freshly started nagios with no network issues, the latency is in the
same range after a couple of hours. A 5 day old nagios process was
reporting the following from nagiostats:
Nagios Stats 2.7
Copyright (c) 2003-2007 Ethan Galstad (www.nagios.org)
Last Modified: 01-19-2007
License: GPL
CURRENT STATUS DATA
----------------------------------------------------
Status File: /var/log/nagios/status.dat
Status File Age: 0d 0h 0m 1s
Status File Version: 2.7
Program Running Time: 5d 21h 28m 58s
Nagios PID: 29914
Used/High/Total Command Buffers: 0 / 45 / 4096
Used/High/Total Check Result Buffers: 96 / 441 / 4096
Total Services: 1876
Services Checked: 1696
Services Scheduled: 1627
Active Service Checks: 1692
Passive Service Checks: 184
Total Service State Change: 0.000 / 73.420 / 2.913 %
Active Service Latency: 0.000 / 90.954 / 19.948 sec
Active Service Execution Time: 0.000 / 55.244 / 4.032 sec
Active Service State Change: 0.000 / 73.420 / 3.188 %
Active Services Last 1/5/15/60 min: 870 / 1353 / 1414 / 1450
Passive Service State Change: 0.000 / 16.780 / 0.381 %
Passive Services Last 1/5/15/60 min: 123 / 175 / 176 / 177
Services Ok/Warn/Unk/Crit: 1400 / 24 / 274 / 178
Services Flapping: 0
Services In Downtime: 0
Total Hosts: 118
Hosts Checked: 118
Hosts Scheduled: 0
Active Host Checks: 118
Passive Host Checks: 0
Total Host State Change: 0.000 / 57.630 / 3.628 %
Active Host Latency: 0.000 / 0.000 / 0.000 sec
Active Host Execution Time: 0.016 / 3.029 / 0.532 sec
Active Host State Change: 0.000 / 57.630 / 3.628 %
Active Hosts Last 1/5/15/60 min: 42 / 56 / 60 / 64
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 96 / 22 / 0
Hosts Flapping: 0
Hosts In Downtime: 0
>From these stat's it doesn't look like I am exceeding the ring buffer.
Top on nagios is showing it using a few percent of the CPU. It's not
running at 100% by any means. A sample from a restarted nagios
(running for 4 hours and 38 minutes) is:
top - 19:55:41 up 153 days, 20:57, 3 users, load average: 0.66, 1.03, 1.11
Tasks: 84 total, 1 running, 82 sleeping, 1 stopped, 0 zombie
Cpu0 : 1.7% us, 1.0% sy, 0.0% ni, 96.3% id, 1.0% wa, 0.0% hi, 0.0% si
Cpu1 : 0.0% us, 0.3% sy, 0.0% ni, 99.3% id, 0.3% wa, 0.0% hi, 0.0% si
Mem: 4151276k total, 3064692k used, 1086584k free, 153636k buffers
Swap: 8191992k total, 328k used, 8191664k free, 2779684k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18076 nagios 17 0 31748 7700 724 S 2 0.2 9:37.20 nagios
where 18076 is the main nagios process at this point (I restarted it
to see if the latency would creep back up to 30 seconds, sadly I
forgot to measure the original 5+day nagios). So I claim that the
nagios process has plenty of cycles available to process the
increased number of passive checks before it should start bogging down
and falling behind. Also is there any way to tell what the command
pipe thread's pid is (under linux)?
I believe that the scheduling really is falling behind as I have two
services defined:
SecReport - active service, runs every minute
SecAliveCheck - passive service, receives output from SecReport
via external correlator (sec). Has a 2
minute stale timer.
I am seeing a lot of stale checks being forced on SecAliveCheck. I
have added some additional rules to the SEC ruleset to detect and try
to characterize this.
So does anybody else see higher latency issues using 2.7 compared to
earlier versions? Would changing the sleep time affect thins (I can't
see how it would but...)?
-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
More information about the Developers
mailing list