Error 126 and 127 on multiple commands
Morris, Patrick
patrick.morris at hp.com
Fri Oct 16 11:25:01 CEST 2009
I've been running Nagios for years, and today have run into an issue
that's got me banging my head against a wall.
I've got a distributed setup, basically with two Nagios 3.1.0 machines
on Red Hat EL4 running the same checks simultaneously. Today they both
started reporting a return code of 126 or 127 for various commands that
are not missing, and do not have permissions that would not allow Nagios
to run them.
For example, this happens whenever a notification is attempted:
[1255684343] Warning: Attempting to execute the command "/usr/bin/printf
"%b" "***** Nagios *****\n\nNotification Type: PROBLEM\nNotification
Number: 2\n\nService: MYSERVICE\nHost: myhost\nAddress:
myhost.edited.com\nState: CRITICAL\n\nDate/Time: Fri Oct 16 02:12:22 PDT
2009\n\nAdditional Info:\n\n(Return code of 127 is out of bounds -
plugin may be missing)\n\nComment: : \n\nWiki:
https://wiki.link\n\nNagios:
https://nagios/nagios/cgi-bin/extinfo.cgi?type=2&host=myhost&service=MYSERVICE"
| /bin/mail -s "PROBLEM: myhost/MYSERVICE CRITICAL **" noc at mydomain.com"
resulted in a return code of 127. Make sure the script or binary you
are trying to execute actually exists...
If I use "su - nagios" and copy and paste the failed command at a
command prompt, it works. The notification commands very consistently
return a 127, while various checks (but not all of them) will return a
126 or a 127.
Stranger, the same exact plugin (check_http, for example) may work fine
for one service, but return an error code for another.
Now, my installation on this instance of Nagios is pretty large: 548
hosts and about 8500 services. The same check configurations and
plugins, however, are synched across 24 other Nagios boxes and assigned
to different hosts, and those all work just fine. It's just this, my
biggest installation, where they've started failing.
This feels to me like I've hit some sort of capacity limitation. I've
pared down some things (like cutting a complicated escalation
configuration from 24,000 escalations to 3,500), but that didn't help.
I've offloaded half the checks to another system that submits passive
results over nsca, but that didn't help either.
I've played with a lot of tuning settings like limiting concurrent
checks, spacing out an aggressively tuned check schedule, and generally
just screwing with stuff, but nothing's worked, and I'm wondering if
someone's run into this sort of thing before, and might be able to point
me at something I haven't tried yet.
For the record, there's no SELinux involved, and nothing unusual in the
system logs.
------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list