Nagios stop hangs in FUTEX_WAIT
Ethan Galstad
nagios at nagios.org
Thu Feb 22 22:09:13 CET 2007
Ethan Galstad wrote:
> Herbert Straub wrote:
>> If i try to stop nagios with /etc/init.d/nagios stop on Fedora Core 4/6
>> with Nagios 2.4 and 2.7 the message:|
>>
>> Warning - running nagios did not exit in time|
>> ||
>> |The nagios process hangs in futex wait - example:|
>> ||
>> ||
>> root at xen1 ~]# strace -p 11620
>> Process 11620 attached - interrupt to quit
>> futex(0x2aaaabf15980, FUTEX_WAIT, 2, NULL
>>
>> This happens not every stop, but 60% of the stop tries. I build nagios
>> with debugging info and attach to the hanging process with gdb and see
>> three threads with the following stack trace:
>>
>> thread 1:
>>
>> #0 0x0000003663ad9298 in __lll_mutex_lock_wait () from /lib64/libc.so.6
>> #1 0x0000003663a730e8 in _L_lock_14830 () from /lib64/libc.so.6
>> #2 0x0000003663a723ab in realloc () from /lib64/libc.so.6
>> #3 0x0000003663a66224 in _IO_mem_finish () from /lib64/libc.so.6
>> #4 0x0000003663a5e2ef in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
>> #5 0x0000003663ac9bf1 in __vsyslog_chk () from /lib64/libc.so.6
>> #6 0x0000003663aca120 in syslog () from /lib64/libc.so.6
>> #7 0x0000000000424227 in write_to_syslog (buffer=0x7fffa9aaaeb0 "Caught SIGTERM, shutting down...\n", data_type=64) at logging.c:229
>> #8 0x00000000004248c9 in write_to_all_logs (buffer=0x7fffa9aaaeb0 "Caught SIGTERM, shutting down...\n", data_type=64) at logging.c:123
>> #9 0x000000000042b09e in sighandler (sig=<value optimized out>) at utils.c:3410
>> #10 <signal handler called>
>> #11 0x0000003663a94809 in fork () from /lib64/libc.so.6
>> #12 0x000000000042f8b2 in my_system (cmd=0x7fffa9aac6b0 "/usr/local/share/nagios2/eventhandlers/process_perfdata.pl", timeout=5, early_timeout=0x7fffa9aacebc, exectime=0x7fffa9aaceb0, output=0x0, output_length=0) at utils.c:2699
>> #13 0x00000000004536a3 in xpddefault_run_service_performance_data_command (svc=0x14672c0) at ../xdata/xpddefault.c:469
>> #14 0x0000000000453729 in xpddefault_update_service_performance_data (svc=0x1200011) at ../xdata/xpddefault.c:400
>> #15 0x0000000000453305 in update_service_performance_data (svc=0x1200011) at perfdata.c:91
>> #16 0x0000000000413855 in reap_service_checks () at checks.c:1396
>> #17 0x0000000000421ad2 in handle_timed_event (event=0x778c30) at events.c:1254
>> #18 0x0000000000421e73 in event_execution_loop () at events.c:965
>> #19 0x000000000040efa7 in main (argc=<value optimized out>, argv=<value optimized out>, env=0x7fffa9aae280) at nagios.c:710
>>
>>
>> |thread 2:
>> |
>>
>> #0 0x0000003663ac4a36 in poll () from /lib64/libc.so.6
>> #1 0x0000000000429ace in service_result_worker_thread (arg=<value optimized out>) at utils.c:4775
>> #2 0x0000003664606305 in start_thread () from /lib64/libpthread.so.0
>> #3 0x0000003663acd50d in clone () from /lib64/libc.so.6
>>
>> thread 3:
>> #0 0x0000003663ac6ac2 in select () from /lib64/libc.so.6
>> #1 0x000000000042996e in command_file_worker_thread (arg=<value optimized out>) at utils.c:4943
>> #2 0x0000003664606305 in start_thread () from /lib64/libpthread.so.0
>> #3 0x0000003663acd50d in clone () from /lib64/libc.so.6
>>
>> Source part of thread 1:
>> else if(sig<16){
>>
>> sigshutdown=TRUE;
>>
>> sprintf(temp_buffer,"Caught SIG%s, shutting down...\n",sigs[sig]);
>> ---> write_to_all_logs(temp_buffer,NSLOG_PROCESS_INFO);
>>
>> Source part of thread 2:
>> while(1){
>>
>> /* should we shutdown? */
>> pthread_testcancel();
>>
>> /* wait for data to arrive */
>> /* select seems to not work, so we have to use poll instead */
>> pfd.fd=ipc_pipe[0];
>> pfd.events=POLLIN;
>> ---> pollval=poll(&pfd,1,500);
>>
>> Source part of thread 3:
>> while(1){
>>
>> /* should we shutdown? */
>> pthread_testcancel();
>>
>> /**** POLL() AND SELECT() DON'T SEEM TO WORK ****/
>> /* wait a bit */
>> tv.tv_sec=0;
>> tv.tv_usec=500000;
>> ---> select(0,NULL,NULL,NULL,&tv);
>>
>> /* should we shutdown? */
>>
>>
>> Next i remove the the call of write_to_all_logs in the signal handler routine:
>>
>> --- base/utils.c.orig 2007-02-05 21:16:13.000000000 +0100
>> +++ base/utils.c 2007-02-05 21:11:02.000000000 +0100
>> @@ -3406,8 +3406,10 @@
>>
>> sigshutdown=TRUE;
>>
>> + /* Straub
>> sprintf(temp_buffer,"Caught SIG%s, shutting down...\n",sigs[sig]);
>> write_to_all_logs(temp_buffer,NSLOG_PROCESS_INFO);
>> + */
>>
>> #ifdef DEBUG2
>> printf("%s\n",temp_buffer);
>>
>>
>> Now, the Nagios stop works every time. My question: Is this a known or new situation - or only on my system?
>>
>> Regards
>> Herbert Straub
>>
>
> Strange. I haven't heard reports of this happening before and I've
> never encountered this myself. I run FC4 on my development box, but its
> a 32-bit machine and it looks like you've got 64-bit hw. Correct? I'll
> try installing FC6 this weekend and see if I can replicate it.
>
> Has this always happened for you, or was there a recent update or some
> kind that caused this? Also, how much time passed between using the
> init script to stop Nagios and the error message appearing?
>
Just checked Google and found the following page:
http://www.meulie.net/portal_plugins/forum/forum_viewtopic.php?7706
Are you using the RPM install of Nagios? If so, can you try compiling
directly from the Nagios source code and seeing if the problem persists?
Ethan Galstad,
Nagios Developer
---
Email: nagios at nagios.org
Website: http://www.nagios.org
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
More information about the Developers
mailing list