Error 126 and 127 on multiple commands
Morris, Patrick
patrick.morris at hp.com
Fri Oct 16 12:44:52 CEST 2009
Morris, Patrick wrote:
> I've been running Nagios for years, and today have run into an issue
> that's got me banging my head against a wall.
>
> I've got a distributed setup, basically with two Nagios 3.1.0 machines
> on Red Hat EL4 running the same checks simultaneously. Today they both
> started reporting a return code of 126 or 127 for various commands that
> are not missing, and do not have permissions that would not allow Nagios
> to run them.
>
> For example, this happens whenever a notification is attempted:
>
> [1255684343] Warning: Attempting to execute the command "/usr/bin/printf
> "%b" "***** Nagios *****\n\nNotification Type: PROBLEM\nNotification
> Number: 2\n\nService: MYSERVICE\nHost: myhost\nAddress:
> myhost.edited.com\nState: CRITICAL\n\nDate/Time: Fri Oct 16 02:12:22 PDT
> 2009\n\nAdditional Info:\n\n(Return code of 127 is out of bounds -
> plugin may be missing)\n\nComment: : \n\nWiki:
> https://wiki.link\n\nNagios:
> https://nagios/nagios/cgi-bin/extinfo.cgi?type=2&host=myhost&service=MYSERVICE"
> | /bin/mail -s "PROBLEM: myhost/MYSERVICE CRITICAL **" noc at mydomain.com"
> resulted in a return code of 127. Make sure the script or binary you
> are trying to execute actually exists...
>
> If I use "su - nagios" and copy and paste the failed command at a
> command prompt, it works. The notification commands very consistently
> return a 127, while various checks (but not all of them) will return a
> 126 or a 127.
>
> Stranger, the same exact plugin (check_http, for example) may work fine
> for one service, but return an error code for another.
>
> Now, my installation on this instance of Nagios is pretty large: 548
> hosts and about 8500 services. The same check configurations and
> plugins, however, are synched across 24 other Nagios boxes and assigned
> to different hosts, and those all work just fine. It's just this, my
> biggest installation, where they've started failing.
>
> This feels to me like I've hit some sort of capacity limitation. I've
> pared down some things (like cutting a complicated escalation
> configuration from 24,000 escalations to 3,500), but that didn't help.
> I've offloaded half the checks to another system that submits passive
> results over nsca, but that didn't help either.
>
> I've played with a lot of tuning settings like limiting concurrent
> checks, spacing out an aggressively tuned check schedule, and generally
> just screwing with stuff, but nothing's worked, and I'm wondering if
> someone's run into this sort of thing before, and might be able to point
> me at something I haven't tried yet.
>
> For the record, there's no SELinux involved, and nothing unusual in the
> system logs.
>
A little more data after fighting with this all night:
That same notification command above? It works sometimes, but not
others. For example, I'm getting notifications from the passive checks I
offloaded to another machine, but not from some (but not all) of the
checks run locally, using the exact same notification command defined as
service_check_via_email in my configs.
I've tried bumping up the available file handles and setting a higher
'ulimit -n' in the startup script, clearing out all the saved status and
perfdata info, and threatening to replace the whole system with an old
dusty copy of Netsaint, but still no progress.
------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list