Problems with FreeBSD and Nagios
Andreas Ericsson
ae at op5.se
Thu Dec 14 10:26:04 CET 2006
Jonathan Call wrote:
> I scanned the mailing list trying to find a solution for this. I found a
> brief discussion where someone had the same problem but there was
> nothing really discussed what was potentially wrong.
>
> My system:
> Dual 2.8GHz P4 processors
> 4GB of RAM
> FreeBSD 6.1-RELEASE-p10
>
> Running processes:
> Nagios 2.6 (installed from ports without embedded perl or nanosleep)
> One mysqld process for the nagiosweb utility
> A few NSCA daemon processes for passive checking
> A backup tool daemon
> Apache+modssl (latest from ports)
> Basic FreeBSD services (sshd, sendmail, etc.)
>
> Problem:
> Random service and host check control processes will lock up and 'spin'
> on the CPU. This is really bad when a host check does it because it
> brings all checks to a halt. It doesn't seem to even notice that all
> checks have gone stale.
>
> It will look like this in top:
>
> PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU
> COMMAND
> 94068 nagios 1 116 0 7500K 6748K CPU2 0 727:37 30.15% nagios
> 94082 nagios 1 116 0 7500K 6748K CPU2 0 734:28 32.55% nagios
> 94104 nagios 1 116 0 7500K 6748K CPU2 0 845:21 37.42% nagios
> 75338 nagios 5 20 0 7500K 6776K kserel 0 91:33 0.00% nagios
>
> In this example the main nagios pid is 75338. The hung service and/or
> host processes are the other ones.
>
> The service checks are almost entirely custom scripts, but the host
> check is a standard check_ping that comes with the nagios program.
>
> Any ideas on how to figure out which service or host check is hung? Or
> how to deal with this problem at all?
>
Host and service checks going into infinite loops wouldn't show up as
Nagios processes in CPU spinlock, as the nagios check execution children
just sit around and wait for the child to finish (or 60 seconds to pass
in default config, before it kills it off).
You've found a bug in Nagios which most likely was either introduced in
the port of it, or is a result of library differences between FreeBSD
and Linux.
I wouldn't be all too surprised if it turns out that the FreeBSD pthread
implementation disallows something that the Linux version allows. Note
that this doesn't necessarily have to be a bug; Nagios doesn't use the
pthread ABI in a way that is explicitly stated as safe, but the pthread
implementation on Linux and most other unices are forgiving enough to
make it work anyway.
It's also possible that this bug only triggers on dual-CPU systems with
a particular library installed, as some kinds of timing and
race-conditions just doesn't happen on single-CPU systems.
What happens if you do
$ gdb --pid=$(pidof spinning-nagios-process)
(gdb) bt
?
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list