Processes hung on pipe writes (and other fun)
Jay 'Whip' Grizzard
elfchief at lupine.org
Wed Aug 20 20:00:07 CEST 2003
Greetings! My help was enlisted by a coworker recently to track down an
issue with his nagios install ... I think I've got a workable solution now,
but wanted to toss it out to the mailing list for a bit of a sanity check.
The problem: Initially, it appeared that a number of nagios subprocesses were
never exiting properly (it was reported to me as 'hung processes'). Turns
out the processes weren't -completely- hung, but they were blocking on a
write(2) call that was taking a long time (2 to 120 minutes) to complete.
IIRC, it was a function called from run_service_check that was actually
hanging. The fd being written to is, I believe, the IPC pipe being used to
report results back to the parent process.
After much investigation, the best conclusion I've been able to draw is that
the process scheduling on our system (RedHat 8.0) is behaving in such a way
that, after the parent process is able to clear out its pipe a bit (thus
freeing some buffer), some processes are always scheduled -last-, giving
other (newer) processes a chance to fill up the pipe's buffer again before
the 'hung' processes get a chance to run again.
Since I didn't feel like rewriting the linux process scheduler, I instead
opted to increase the size of the pipe buffer in the kernel (there's a
define for PIPE_SIZE that's normally set to PAGE_SIZE -- 4k on x86). I
increased it to (8 * PAGE_SIZE) and rebuilt my kernel, under the theory
that a larger buffer would give processes a much larger chance of being
able to get some data into the buffer before it filled ... and, indeed,
the 'hung' processes seem to have gone away -- After 24 hours, the oldest
nagios subprocesses on the box are, at worst, one minute old.
So my question is ... Is this a sane solution to the problem? I don't
immediately see any real possible negative repercussions, but I'd love
to get feedback from folks more familiar with both nagios (which I'm
fairly unfamiliar with) and linux (I'm a Solaris geek). It seems to work
in initial testing, but I'd rather not have my systems start blowing up
spectacularly in the future... :)
Thanks!
-jay
-------------------------------------------------------
This SF.net email is sponsored by Dice.com.
Did you know that Dice has over 25,000 tech jobs available today? From
careers in IT to Engineering to Tech Sales, Dice has tech jobs from the
best hiring companies. http://www.dice.com/index.epl?rel_code=104
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list