<meta charset="utf-8"><div>Hey,</div><div><br></div><div>Yesterday I added a servicegroup (called NRPEchecks) which contains 23.660 host-service entries to our configuration. After restarting the Nagios server, our Central Nagios server started complaining when it had to perform (forced) NRPE checks and/or send notifications. Everything responded with "(Return code of 127 is out of bounds - plugin may be missing)".</div>
<div>Oddly enough, this only seemed to happen with NRPE checks (and notification commands) which call upon a (bash) script. NRPE checks which do nothing more than '/usr/bin/nrpe --help | grep "Version"' were performed without any problems what so ever. SNMP check scripts and TCP checks (which run on the Nagios hosts) worked perfectly fine aswel.</div>
<div>Checks performed by the distributed servers we're still being processed without issues by the Central Nagios server.</div><div>Performing the NRPE check command by hand as user Nagios works perfectly fine. So the file exists and is executable for Nagios.</div>
<div>When I remove the servicegroup from the configs, everything runs smoothly again..</div><meta charset="utf-8"><div><br></div><div><div>Some background information of our distributed setup:</div><div><div>- Nagios Version 3.1.2</div>
<div>- Central servers: 1</div><div>- Distributed servers: 8</div><div><div>- Total Hosts: 2.798</div></div><div><div>- Total Services: 49.579</div>
</div><div>- Service groups: 10</div>- NRPEchecks servicegroup: 23.660 entries</div><div><br></div><div>Sizes of the objects.cache file on the Central Nagios server:<div><div>- Lines: 2.083.738 </div>
</div><div><div><div><div>- Words: 4.079.205 </div><div>- Characters: 51.916.032</div></div></div></div><div>- Size: 50M</div></div></div><div><br></div>
<div><br></div><div>Here is the some logfile output:</div><div>Mar 5 10:11:05 src@nagioshost nagios: Warning: Return code of 127 for check of service 'Disk' on host 'testhost' was out of bounds. Make sure the plugin you're trying to run actually exists.</div>
<div>Mar 5 10:21:07 src@nagioshost nagios: Warning: Return code of 127 for check of service 'Disk' on host 'testhost' was out of bounds. Make sure the plugin you're trying to run actually exists.</div>
<div>Mar 5 10:11:05 src@nagioshost nagios: SERVICE ALERT: testhost;Disk;CRITICAL;SOFT;1;(Return code of 127 is out of bounds - plugin may be missing)</div><div>Mar 5 10:21:07 src@nagioshost nagios: SERVICE ALERT: testhost;Disk;CRITICAL;SOFT;2;(Return code of 127 is out of bounds - plugin may be missing)</div>
<div><br></div><div><br></div><div>The relevant parts of the configuration look like this:</div><div><div class="gmail_quote">- NRPE.cfg:</div><div class="gmail_quote"><div class="gmail_quote">command[check_disk]=sh /usr/nagios/scripts/check_disk.sh</div>
<div><br></div><div>- Nagios cfgs:</div><div><div>define host{</div><div> use generic-host</div><div> host_name testhost</div><div> alias testhost</div><div>
address testhost.server</div><div>}</div></div><div><br></div><div><div>define service{</div><div> use Disks-check</div><div> hostgroup_name testhosts</div><div>
service_description Disks</div><div>}</div></div><div><br></div><div><div>define service{</div><div> use generic-service</div><div> name Disks-check</div><div>
check_command check_nrpe_disk</div><div> servicegroups NRPEchecks</div><div> contact_groups admins</div><div> register 0</div><div>}</div></div><div><br></div>
<div>
<div>define servicegroup{</div><div> servicegroup_name NRPEchecks</div><div> alias NRPEchecks</div><div>}</div></div><div><br></div><div><div>define command{</div><div> command_name check_nrpe_disk</div>
<div> command_line $USER1$/check_nrpe -t 30 -H $HOSTADDRESS$ -c check_disk</div><div>}</div></div><div><br></div><div>- resource.cfg:</div><div><div>$USER1$=/usr/lib/nagios/plugins</div></div><div><br></div><div>
<br></div><div>When forcing a check on the Central Nagios server the Nagios debugging output showed this:</div><div><div><div><div>[1267785907.716681] [008.0] [pid=24083] ** Timed Event ** Type: 0, Run Time: Fri Mar 5 11:37:31 2010</div>
<div>[1267785907.716686] [008.0] [pid=24083] ** Service Check Event ==> Host: 'testhost', Service: 'Disk', Options: 1, Latency: 456.716000 sec</div><div>[1267785907.716697] [001.0] [pid=24083] run_scheduled_service_check() start</div>
<div>[1267785907.716701] [016.0] [pid=24083] Attempting to run scheduled check of service 'Disk' on host 'testhost': check options=1, latency=456.716000</div><div>[1267785907.716708] [001.0] [pid=24083] run_async_service_check()</div>
<div>[1267785907.716712] [001.0] [pid=24083] check_service_check_viability()</div><div>[1267785907.716716] [016.0] [pid=24083] Checking service 'Disk' on host 'testhost'...</div><div>[1267785907.716732] [001.0] [pid=24083] get_raw_command_line()</div>
<div>[1267785907.716738] [2320.2] [pid=24083] Raw Command Input: $USER1$/check_nrpe -t 30 -H $HOSTADDRESS$ -c check_disk</div><div>[1267785907.716744] [2320.2] [pid=24083] Expanded Command Output: $USER1$/check_nrpe -t 30 -H $HOSTADDRESS$ -c check_disk</div>
<div>[1267785907.716748] [001.0] [pid=24083] process_macros()</div><div>[1267785907.716753] [2048.1] [pid=24083] **** BEGIN MACRO PROCESSING ***********</div><div>[1267785907.716756] [2048.1] [pid=24083] Processing: '$USER1$/check_nrpe -t 30 -H $HOSTADDRESS$ -c check_disk'</div>
<div>[1267785907.716760] [2048.2] [pid=24083] Processing part: ''</div><div>[1267785907.716766] [2048.2] [pid=24083] Not currently in macro. Running output (0): ''</div><div>[1267785907.716785] [2048.2] [pid=24083] Processing part: 'USER1'</div>
<div>[1267785907.716794] [2048.2] [pid=24083] Processed 'USER1', Clean Options: 0, Free: 0</div><div>[1267785907.716798] [2048.2] [pid=24083] Processed 'USER1', Clean Options: 0, Free: 0</div><div>[1267785907.716802] [2048.2] [pid=24083] Cleaning options: global=0, local=0, effective=0</div>
<div>[1267785907.716806] [2048.2] [pid=24083] Uncleaned macro. Running output (23): '/usr/lib/nagios/plugins'</div><div>[1267785907.716810] [2048.2] [pid=24083] Just finished macro. Running output (23): '/usr/lib/nagios/plugins'</div>
<div>[1267785907.716814] [2048.2] [pid=24083] Processing part: '/check_nrpe -t 30 -H '</div><div>[1267785907.716818] [2048.2] [pid=24083] Not currently in macro. Running output (44): '/usr/lib/nagios/plugins/check_nrpe -t 30 -H '</div>
<div>[1267785907.716821] [2048.2] [pid=24083] Processing part: 'HOSTADDRESS'</div><div>[1267785907.716825] [2048.2] [pid=24083] macro_x[2] (HOSTADDRESS) match.</div><div>[1267785907.716831] [2048.2] [pid=24083] Processed 'HOSTADDRESS', Clean Options: 0, Free: 1</div>
<div>[1267785907.716834] [2048.2] [pid=24083] Processed 'HOSTADDRESS', Clean Options: 0, Free: 1</div><div>[1267785907.716838] [2048.2] [pid=24083] Cleaning options: global=0, local=0, effective=0</div><div>[1267785907.716843] [2048.2] [pid=24083] Uncleaned macro. Running output (64): '/usr/lib/nagios/plugins/check_nrpe -t 30 -H testhost.server'</div>
<div>[1267785907.716847] [2048.2] [pid=24083] Just finished macro. Running output (64): '/usr/lib/nagios/plugins/check_nrpe -t 30 -H testhost.server'</div><div>[1267785907.716851] [2048.2] [pid=24083] Processing part: ' -c check_disk'</div>
<div>[1267785907.716856] [2048.2] [pid=24083] Not currently in macro. Running output (91): '/usr/lib/nagios/plugins/check_nrpe -t 30 -H testhost.server -c check_disk'</div><div>[1267785907.716860] [2048.1] [pid=24083] Done. Final output: '/usr/lib/nagios/plugins/check_nrpe -t 30 -H testhost.server -c check_disk'</div>
<div>[1267785907.716863] [2048.1] [pid=24083] **** END MACRO PROCESSING *************</div><div>[1267785907.716896] [016.1] [pid=24083] Check result output will be written to '/var/nagios/spool/checkresults/checkbmXvnd' (fd=8)</div>
<div>[1267785907.722094] [016.2] [pid=24083] Service check is executing in child process (pid=26053)</div><div>[1267785907.722949] [001.0] [pid=26053] process_macros()</div><div>[1267785907.722988] [001.0] [pid=26053] process_macros()</div>
<div>[1267785907.722996] [001.0] [pid=26053] process_macros()</div><div>[1267785907.723003] [001.0] [pid=26053] process_macros()</div><div>[1267785907.723010] [001.0] [pid=26053] process_macros()</div><div>[1267785907.723017] [001.0] [pid=26053] process_macros()</div>
<div>[1267785907.738768] [001.0] [pid=26053] process_macros()</div><div>[1267785907.738803] [001.0] [pid=26053] process_macros()</div><div>[1267785907.738817] [001.0] [pid=26053] process_macros()</div><div>[1267785907.738827] [001.0] [pid=26053] process_macros()</div>
<div>[1267785907.738855] [001.0] [pid=26053] process_macros()</div><div>[1267785907.738865] [001.0] [pid=26053] process_macros()</div><div>[1267785912.044580] [016.2] [pid=26056] Moving temp check result file '/var/nagios/spool/checkresults/checkbmXvnd' to queue file '/var/nagios/spool/checkresults/cvxmTpo'...</div>
<div>[1267785912.274671] [001.0] [pid=24083] handle_timed_event() end</div><div>[1267785912.274709] [008.1] [pid=24083] ** Event Check Loop</div></div></div><div><div>[1267785912.274743] [008.1] [pid=24083] Next High Priority Event Time: Fri Mar 5 11:45:08 2010</div>
<div>[1267785912.274753] [008.1] [pid=24083] Next Low Priority Event Time: Fri Mar 5 11:42:29 2010</div><div>[1267785912.274757] [008.1] [pid=24083] Current/Max Service Checks: 1/0</div><div>[1267785912.274763] [001.0] [pid=24083] handle_timed_event() start</div>
<div>[1267785912.274771] [008.0] [pid=24083] ** Timed Event ** Type: 5, Run Time: Fri Mar 5 11:45:08 2010</div><div>[1267785912.274776] [008.0] [pid=24083] ** Check Result Reaper</div><div>[1267785912.274780] [001.0] [pid=24083] reap_check_results() start</div>
<div>[1267785912.274783] [016.0] [pid=24083] Starting to reap check results.</div><div>[1267785912.274807] [016.1] [pid=24083] Starting to read check result queue '/var/nagios/spool/checkresults'...</div><div>[1267785912.274836] [016.1] [pid=24083] Processing check result file: '/var/nagios/spool/checkresults/cvxmTpo'</div>
<div>[1267785912.274961] [016.2] [pid=24083] Found a check result (#1) to handle...</div><div>[1267785912.274978] [016.1] [pid=24083] Handling check result for service 'Disk' on host 'testhost'...</div><div>
[1267785912.274983] [001.0] [pid=24083] handle_async_service_check_result()</div><div>[1267785912.274986] [016.0] [pid=24083] ** Handling check result for service 'Disk' on host 'testhost'...</div><div>[1267785912.274990] [016.1] [pid=24083] HOST: testhost, SERVICE: Disk, CHECK TYPE: Active, OPTIONS: 1, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 127, OUTPUT: (null)</div>
<div>[1267785912.275056] [016.2] [pid=24083] ST: SOFT CA: 3 MA: 3 CS: 2 LS: 2 LHS: 0</div><div>[1267785912.275062] [016.2] [pid=24083] Service had a HARD STATE CHANGE!!</div><div>[1267785912.275066] [016.1] [pid=24083] Service is in a non-OK state!</div>
<div>[1267785912.275070] [016.1] [pid=24083] Host is currently UP, so we'll recheck its state to make sure...</div><div>[1267785912.275074] [016.1] [pid=24083] * Using last known host state: 0</div><div>[1267785912.275078] [016.1] [pid=24083] Current/Max Attempt(s): 3/3</div>
<div>[1267785912.275082] [016.1] [pid=24083] Service has reached max number of rechecks, so we'll handle the error...</div><div>[1267785912.275093] [001.0] [pid=24083] process_macros()</div><div>[1267785912.275097] [2048.1] [pid=24083] **** BEGIN MACRO PROCESSING ***********</div>
<div>[1267785912.275101] [2048.1] [pid=24083] Processing: 'SERVICE ALERT: testhost;Disk;$SERVICESTATE$;$SERVICESTATETYPE$;$SERVICEATTEMPT$;(Return code of 127 is out of bounds - plugin may be missing)</div><div>'</div>
<div>[1267785912.275105] [2048.2] [pid=24083] Processing part: 'SERVICE ALERT: testhost;Disk;'</div><div>[1267785912.275109] [2048.2] [pid=24083] Not currently in macro. Running output (40): 'SERVICE ALERT: testhost;Disk;'</div>
<div>[1267785912.275114] [2048.2] [pid=24083] Processing part: 'SERVICESTATE'</div><div>[1267785912.275118] [2048.2] [pid=24083] macro_x[4] (SERVICESTATE) match.</div><div>[1267785912.275124] [2048.2] [pid=24083] Processed 'SERVICESTATE', Clean Options: 0, Free: 1</div>
<div>[1267785912.275128] [2048.2] [pid=24083] Processed 'SERVICESTATE', Clean Options: 0, Free: 1</div><div>[1267785912.275131] [2048.2] [pid=24083] Cleaning options: global=0, local=0, effective=0</div><div>
[1267785912.275136] [2048.2] [pid=24083] Uncleaned macro. Running output (48): 'SERVICE ALERT: testhost;Disk;CRITICAL'</div>
<div>[1267785912.275140] [2048.2] [pid=24083] Just finished macro. Running output (48): 'SERVICE ALERT: testhost;Disk;CRITICAL'</div><div>[1267785912.275144] [2048.2] [pid=24083] Processing part: ';'</div>
<div>[1267785912.275148] [2048.2] [pid=24083] Not currently in macro. Running output (49): 'SERVICE ALERT: testhost;Disk;CRITICAL;'</div><div>[1267785912.275152] [2048.2] [pid=24083] Processing part: 'SERVICESTATETYPE'</div>
<div>[1267785912.275156] [2048.2] [pid=24083] macro_x[42] (SERVICESTATETYPE) match.</div><div>[1267785912.275160] [2048.2] [pid=24083] Processed 'SERVICESTATETYPE', Clean Options: 0, Free: 1</div><div>[1267785912.275165] [2048.2] [pid=24083] Processed 'SERVICESTATETYPE', Clean Options: 0, Free: 1</div>
<div>[1267785912.275168] [2048.2] [pid=24083] Cleaning options: global=0, local=0, effective=0</div><div>[1267785912.275172] [2048.2] [pid=24083] Uncleaned macro. Running output (53): 'SERVICE ALERT: testhost;Disk;CRITICAL;HARD'</div>
<div>[1267785912.275176] [2048.2] [pid=24083] Just finished macro. Running output (53): 'SERVICE ALERT: testhost;Disk;CRITICAL;HARD'</div><div>[1267785912.275180] [2048.2] [pid=24083] Processing part: ';'</div>
<div>[1267785912.275183] [2048.2] [pid=24083] Not currently in macro. Running output (54): 'SERVICE ALERT: testhost;Disk;CRITICAL;HARD;'</div><div>[1267785912.275187] [2048.2] [pid=24083] Processing part: 'SERVICEATTEMPT'</div>
<div>[1267785912.275191] [2048.2] [pid=24083] macro_x[6] (SERVICEATTEMPT) match.</div><div>[1267785912.275196] [2048.2] [pid=24083] Processed 'SERVICEATTEMPT', Clean Options: 0, Free: 1</div><div>[1267785912.275200] [2048.2] [pid=24083] Processed 'SERVICEATTEMPT', Clean Options: 0, Free: 1</div>
<div>[1267785912.275204] [2048.2] [pid=24083] Cleaning options: global=0, local=0, effective=0</div><div>[1267785912.275208] [2048.2] [pid=24083] Uncleaned macro. Running output (55): 'SERVICE ALERT: testhost;Disk;CRITICAL;HARD;3'</div>
<div>[1267785912.275221] [2048.2] [pid=24083] Just finished macro. Running output (55): 'SERVICE ALERT: testhost;Disk;CRITICAL;HARD;3'</div><div>[1267785912.275226] [2048.2] [pid=24083] Processing part: ';(Return code of 127 is out of bounds - plugin may be missing)</div>
<div>'</div><div>[1267785912.275230] [2048.2] [pid=24083] Not currently in macro. Running output (118): 'SERVICE ALERT: testhost;Disk;CRITICAL;HARD;3;(Return code of 127 is out of bounds - plugin may be missing)</div>
<div>'</div><div>[1267785912.275234] [2048.1] [pid=24083] Done. Final output: 'SERVICE ALERT: testhost;Disk;CRITICAL;HARD;3;(Return code of 127 is out of bounds - plugin may be missing)</div><div>'</div><div>
[1267785912.275237] [2048.1] [pid=24083] **** END MACRO PROCESSING *************</div></div></div></div></div><div><br></div><div><br></div><div class="gmail_quote">We also ran an stace on the Nagios process while forcing a check:<br>
[pid 16879] clone(Process 16880 attached<br>
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,<br>
child_tidptr=0x7fd49a633780) = 16880<br>
[pid 16880] rt_sigaction(SIGPIPE, {SIG_DFL, [PIPE],<br>
SA_RESTORER|SA_RESTART, 0x7fd499ef03c0}, <unfinished ...><br>
[pid 16880] <... rt_sigaction resumed> NULL, 8) = 0<br>
[pid 16880] close(5 <unfinished ...><br>
[pid 16880] <... close resumed> ) = 0<br>
[pid 16880] dup2(8, 1 <unfinished ...><br>
[pid 16880] <... dup2 resumed> ) = 1<br>
[pid 16880] close(8) = 0<br>
[pid 16880] execve("/bin/sh", ["sh", "-c",<br>
"/usr/lib/nagios/plugins/check_nrpe -t 30 -H testhost.server -c<br>
check_disk"], [/* 197 vars */]) = -1 E2BIG (Argument list<br>
too long)<br>
[pid 16880] exit_group(127) = ?<br>
Process 16880 detached<br>
[pid 16879] wait4(16880, [{WIFEXITED(s) && WEXITSTATUS(s) == 127}], 0,<br>
NULL) = 16880<br><br></div><div><br></div><div>The problem seems to be related/similar to this one:</div><div><div><a href="https://sourceforge.net/mailarchive/message.php?msg_id=1234329173.3569.70.camel@localhost.localdomain">https://sourceforge.net/mailarchive/message.php?msg_id=1234329173.3569.70.camel@localhost.localdomain</a></div>
</div><div><br></div><div>My guess is that the "Argument list too long" is the $SERVICEGROUPMEMBERS$ macro...</div><div><br></div><div><br></div><div>Cheers,</div><div><br></div><div>Jeffrey Lensen</div><div>Hyves │ System Engineering<br>
<a href="mailto:jeffrey@hyves.nl">jeffrey@hyves.nl</a> │ <a href="http://skyler.hyves.nl">skyler.hyves.nl</a> │ <a href="http://www.hyves.nl">www.hyves.nl</a></div>