<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<TITLE>Message</TITLE>
<META content="MSHTML 6.00.2716.2200" name=GENERATOR></HEAD>
<BODY>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2>Greetings,</FONT></SPAN></DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial size=2>I am wishing to
discuss with others their Nagios setup in a distributed
environment.</FONT></SPAN></DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial size=2>We have 3 servers
running Nagios - 1 x external and 2 x internal. The external server has outside
parties having limited views to their hosts/connections into our network. The 2
internal servers are currently setup in a distributed environment, with one
server sending results to the other (via nsca) due to it's geographic location
on our network. </FONT></SPAN><SPAN class=602541401-13082002><FONT face=Arial
size=2>The 'central server' not only collects the results from the other
distributed server, but also actively checks approximately half of the total
number of hosts and services. </FONT></SPAN></DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial size=2>My reading of the
Nagios documentation shows that it is assumed the central server only accepts
results from the distributed servers rather than actively checking hosts and
services itself. However, I see no reason as to why the central server can not
also actively check - there is no design issue that I am aware
of.</FONT></SPAN></DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial size=2>How do others run
their distributed setup ?</FONT></SPAN></DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial size=2>Also, we believe
that there are scheduling issues with Nagios under different Linux kernels. With
a 2.2.20 kernel in a distributed setup, we found that the number of Nagios
processes continued to grow - ie: there was no reaping. An strace of a child
process showed that it was waiting to write to the external command files, while
an strace of the parent process showed no errors and the reaping worked as
expected.</FONT></SPAN></DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial size=2>Therefore, we
modified the start script for Nagios to include an strace of the parent process
and ran fine with this for many months. This is with Nagios 1.0a7 through 1.0b3
and the previously undocumented 'command_check_interval=-1'.</FONT></SPAN></DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial size=2>Recently we upgraded
the monitoring hosts to a 2.4 kernel, and discovered an entirely different
problem. The number of Nagios processes grows exponentially until the load on
the box is so large that a hard reset is required. Again, the children processes
do not appear to be being reaped as would be expected. An strace of the child
processes shows that they are waiting on a write to the internal pipe (Nagios
parent process) after reading the results from the external command
file.</FONT></SPAN></DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial size=2>We have tried
numerous ways of trying to correct this problem, including upgrading to Nagios
1.0b4 and also including the latest base/checks.c from CVS but can not get
Nagios to sufficiently reap the children processes. So, until we can resolve
this problem we have been forced to downgrade back to the 2.2 kernel,
where Nagios 1.0b4 and base/checks.c works fine (though with the strace on
the parent process).</FONT></SPAN></DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial size=2>So, I would be
interested in discussing with others who are running Nagios in a distributed
setup under Linux as to whether or not they are experiencing similar
issues.</FONT></SPAN></DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2>Regards,</FONT></SPAN></DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=602541401-13082002><FONT face=Arial
size=2>Andrew</FONT></SPAN></DIV>
<DIV><SPAN class=602541401-13082002></SPAN> </DIV></BODY></HTML>