SIGXFSZ causes nagios to exit silently with nagios 2.9

John Rouillard rouilj+nagiosdev at cs.umb.edu
Mon Jun 4 17:28:00 CEST 2007


Hi all:

I am seeing the top level nagios daemon exiting shortly after startup
(after it's first few scheduled service checks are started). When it
exits it doesn't log anything or does it clear out the status files to
indicate to the web interface that it has exited.

When run under gdb I see:

  Program received signal SIGXFSZ, File size limit exceeded.
  (gdb) where
  #0  0x0060a7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
  #1  0x006dc11b in __write_nocancel () from /lib/tls/libc.so.6
  #2  0x0068109f in _IO_new_file_write () from /lib/tls/libc.so.6
  #3  0x0067fafb in _IO_new_do_write () from /lib/tls/libc.so.6
  #4  0x006807a2 in _IO_new_file_sync () from /lib/tls/libc.so.6
  #5  0x00675af2 in fflush () from /lib/tls/libc.so.6
  #6  0x0808f8d9 in xpddefault_update_service_performance_data_file (
      svc=0x9da19d0) at ../xdata/xpddefault.c:677
  #7  0x0808f8fc in xpddefault_update_service_performance_data (svc=0x9da19d0)
      at ../xdata/xpddefault.c:403
  #8  0x0808e8a1 in update_service_performance_data (svc=0x9da19d0)
      at perfdata.c:91
  #9  0x08057b78 in reap_service_checks () at checks.c:1415
  #10 0x08063790 in handle_timed_event (event=0x9a41ca0) at events.c:1255
  #11 0x08063e51 in event_execution_loop () at events.c:966
  #12 0x08053ad5 in main (argc=2, argv=0xbfeead04) at nagios.c:715

Now I am hitting the 2GB limit on the service perfdata file:

  [rouilj at ops01 ~]$ ls -lh /var/spool/nagios/tmp/service-perfdata 
  -rw-rw-r--  1 nagios nagios 2.0G Jun  2 09:21 /var/spool/nagios/tmp/service-perfdata

(exact size 2147483647 bytes). The file size ulimit on the process is
unlimited.
  [rouilj at ops01 ~]$ ulimit -a
  core file size          (blocks, -c) 0
  data seg size           (kbytes, -d) unlimited
  file size               (blocks, -f) unlimited
  pending signals                 (-i) 1024
  max locked memory       (kbytes, -l) 32
  max memory size         (kbytes, -m) unlimited
  open files                      (-n) 1024
  pipe size            (512 bytes, -p) 8
  POSIX message queues     (bytes, -q) 819200
  stack size              (kbytes, -s) 10240
  cpu time               (seconds, -t) unlimited
  max user processes              (-u) 73728
  virtual memory          (kbytes, -v) unlimited
  file locks                      (-x) unlimited

It's a 32 bit kernel i686. uname -a reports:

  Linux ops01.renesys.com 2.6.9-42.0.10.ELsmp #1 SMP Tue Feb 27 10:11:19
  EST 2007 i686 i686 i386 GNU/Linux

I think nagios can handle this case better by:

  1) Trapping the SIGXFSZ signal so it doesn't exit
  2) Log an error to nagios.log
  3) (schedule a) close and reopen of host_perfdata_file and
     service_perfdata_file allowing the user to rotate the file on command,
     or re-enable perfdata logging by moving the files aside and
     having nagios recreate the files.

3 is kind of a hack, but there is no signal currently that closes and
reopens the output files (host_perfdata_file, service_perfdata_file)
without resetting all of the nagios daemon's internal state.  With 3
implemented, it is possible to rotate these files without resetting
nagios's internal state (current scheduled services queue for example)
on user demand.

Alternatively the log rotation mechanism currently available for the
main log file (nagios.log) could be extended to automatically rotate
and archive these files. I would be happy where all the files were
rotated/archived on the same schedule as the main log file, but people
will probably want the following options in nagios.cfg:

  host_perfdata_rotation_method, service_perfdata_rotation_method:
     no rotation, hourly, daily, weekly, monthly.

  host_perfdata_archive_path, service_perfdata_archive_path:
    move host_perfdata_file, service_perfdata_file to the archive
    directory with a timestamped extension similar to nagios log file.

Now this does bring up an interesting question, does anybody have a
status.dat or retention.dat (or less likely comments.dat or
downtime.dat) file that is approaching 2GB? What will happen to nagios
when this limit is hit?

As an alternative nagios could take the performance hit and use the
64-bit file-access and file-locking system calls instead of the
regular calls for the files where this is liable to be an issue. Hmm,
can you mix 32 bit and 64 bit file i/o in a single program?

Since nagios exited on the signal, I just moved the service perfdata
file aside and restarted nagios to get it operating again.

				-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/




More information about the Developers mailing list