SIGXFSZ causes nagios to exit silently with nagios 2.9
Ethan Galstad
nagios at nagios.org
Tue Aug 21 04:02:43 CEST 2007
John Rouillard wrote:
> Hi all:
>
> I am seeing the top level nagios daemon exiting shortly after startup
> (after it's first few scheduled service checks are started). When it
> exits it doesn't log anything or does it clear out the status files to
> indicate to the web interface that it has exited.
>
> When run under gdb I see:
>
> Program received signal SIGXFSZ, File size limit exceeded.
> (gdb) where
> #0 0x0060a7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> #1 0x006dc11b in __write_nocancel () from /lib/tls/libc.so.6
> #2 0x0068109f in _IO_new_file_write () from /lib/tls/libc.so.6
> #3 0x0067fafb in _IO_new_do_write () from /lib/tls/libc.so.6
> #4 0x006807a2 in _IO_new_file_sync () from /lib/tls/libc.so.6
> #5 0x00675af2 in fflush () from /lib/tls/libc.so.6
> #6 0x0808f8d9 in xpddefault_update_service_performance_data_file (
> svc=0x9da19d0) at ../xdata/xpddefault.c:677
> #7 0x0808f8fc in xpddefault_update_service_performance_data (svc=0x9da19d0)
> at ../xdata/xpddefault.c:403
> #8 0x0808e8a1 in update_service_performance_data (svc=0x9da19d0)
> at perfdata.c:91
> #9 0x08057b78 in reap_service_checks () at checks.c:1415
> #10 0x08063790 in handle_timed_event (event=0x9a41ca0) at events.c:1255
> #11 0x08063e51 in event_execution_loop () at events.c:966
> #12 0x08053ad5 in main (argc=2, argv=0xbfeead04) at nagios.c:715
>
> Now I am hitting the 2GB limit on the service perfdata file:
>
> [rouilj at ops01 ~]$ ls -lh /var/spool/nagios/tmp/service-perfdata
> -rw-rw-r-- 1 nagios nagios 2.0G Jun 2 09:21 /var/spool/nagios/tmp/service-perfdata
>
> (exact size 2147483647 bytes). The file size ulimit on the process is
> unlimited.
> [rouilj at ops01 ~]$ ulimit -a
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> file size (blocks, -f) unlimited
> pending signals (-i) 1024
> max locked memory (kbytes, -l) 32
> max memory size (kbytes, -m) unlimited
> open files (-n) 1024
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> stack size (kbytes, -s) 10240
> cpu time (seconds, -t) unlimited
> max user processes (-u) 73728
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
>
> It's a 32 bit kernel i686. uname -a reports:
>
> Linux ops01.renesys.com 2.6.9-42.0.10.ELsmp #1 SMP Tue Feb 27 10:11:19
> EST 2007 i686 i686 i386 GNU/Linux
>
> I think nagios can handle this case better by:
>
> 1) Trapping the SIGXFSZ signal so it doesn't exit
> 2) Log an error to nagios.log
> 3) (schedule a) close and reopen of host_perfdata_file and
> service_perfdata_file allowing the user to rotate the file on command,
> or re-enable perfdata logging by moving the files aside and
> having nagios recreate the files.
>
> 3 is kind of a hack, but there is no signal currently that closes and
> reopens the output files (host_perfdata_file, service_perfdata_file)
> without resetting all of the nagios daemon's internal state. With 3
> implemented, it is possible to rotate these files without resetting
> nagios's internal state (current scheduled services queue for example)
> on user demand.
>
> Alternatively the log rotation mechanism currently available for the
> main log file (nagios.log) could be extended to automatically rotate
> and archive these files. I would be happy where all the files were
> rotated/archived on the same schedule as the main log file, but people
> will probably want the following options in nagios.cfg:
>
> host_perfdata_rotation_method, service_perfdata_rotation_method:
> no rotation, hourly, daily, weekly, monthly.
>
> host_perfdata_archive_path, service_perfdata_archive_path:
> move host_perfdata_file, service_perfdata_file to the archive
> directory with a timestamped extension similar to nagios log file.
>
> Now this does bring up an interesting question, does anybody have a
> status.dat or retention.dat (or less likely comments.dat or
> downtime.dat) file that is approaching 2GB? What will happen to nagios
> when this limit is hit?
>
> As an alternative nagios could take the performance hit and use the
> 64-bit file-access and file-locking system calls instead of the
> regular calls for the files where this is liable to be an issue. Hmm,
> can you mix 32 bit and 64 bit file i/o in a single program?
>
> Since nagios exited on the signal, I just moved the service perfdata
> file aside and restarted nagios to get it operating again.
>
> -- rouilj
> John Rouillard
> ===========================================================================
> My employers don't acknowledge my existence much less my opinions.
Hmm, I've never heard of anyone else with this issue yet, but I guess
they're rotating perfdata files more often than you are. Or perhaps you
are monitoring a *very* large system. :-)
You can use these two config file options to run a command at a
specified interval to rotate the perfdata logs or do whatever you want.
host_perfdata_file_processing_interval=60
host_perfdata_file_processing_command=somecommand
Ethan Galstad,
Nagios Developer
---
Email: nagios at nagios.org
Website: http://www.nagios.org
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
More information about the Developers
mailing list