One cause of the 'Internal Server Errors' with nagios 2.3
Bill Ryder
bill.ryder.nz at gmail.com
Wed May 17 04:03:40 CEST 2006
HI All,
(First of all - hope the moderator hasn't approved my previous post on
this topic before I resend this one!).
This may or may not apply to anyone having Internal Server Errors but
it certainly fixed my nagios.
Summary:
=======
Make sure your status.dat file (config variable status_file) is on the
same filesystem as your temporary file (config variable temp_file).
In fact I think nagios should enforce this.
If the two files are on different filesystems and you have a lot of
services and/or hosts you will get this problem intermittently.
Essentially the status.dat file changes underneath the mmap'ed
status.dat file used by many of the cgis.
If you are only monitoriing a few hosts you'll probably get lucky
because the copies take a very short period of time and hence the
window for the fault to occur is very small.
Perhaps nagios's my_rename function should copy the file to a temp
name in the destination directory then rename the old to a new file if
the filesystems are different.
Long version:
=========
At Weta Digital I have just started using Nagios to monitor our
renderwall (currently around 1,500 machines - 9,000 services). We've
been using nagios for our production servers for years.
I was gettnig the 'Internal Server Error' quite often.
I could easily reproduce the problem by running status.cgi from a
debug script which looks like:
#!/bin/sh
REQUEST_METHOD="GET"
QUERY_STRING='host=all&servicestatustypes=28'
export QUERY_STRING REQUEST_METHOD
gdb ./status.cgi
I only had to run it about 10-20 times to get a crash like this:
Program received signal SIGBUS, Bus error.
mmap_fgets (temp_mmapfile=0x8600c68) at cgiutils.c:1195
1195 if(*(char *)(temp_mmapfile->mmap_buf+x)=='\n')
(gdb) p *temp_mmapfile
$3 = {path = 0x8074050 "/var/tmp/nagios_ramdisk/status.dat", mode =
1668573559, fd = 7, file_size = 11087473, current_position = 1570370,
current_line = 69104, mmap_buf = 0xb737c000}
(gdb)
At this point the file had changed size - in otherwords the file
changed under the mmap - which is a recipe for SIGBUS's
I then spent some time trying some different mmap options and thinking
about clever solutions to this and then decided I needed to figure out
exactly what the nagios core does with the status.dat file. (Which I
should have done to start with of course :-).
This is what i found:
{106} # strace -e trace=file -p 28007 |& grep status.dat
rename("/var/cache/nagios2/nagios.tmp18dGu2",
"/var/tmp/nagios_ramdisk/status.dat") = -1 EXDEV (Invalid cross-device
link)
open("/var/tmp/nagios_ramdisk/status.dat",
O_WRONLY|O_APPEND|O_CREAT|O_TRUNC|O_LARGEFILE, 0644) = 10
rename("/var/cache/nagios2/nagios.tmplQrHE5",
"/var/tmp/nagios_ramdisk/status.dat") = -1 EXDEV (Invalid cross-device
link)
open("/var/tmp/nagios_ramdisk/status.dat",
O_WRONLY|O_APPEND|O_CREAT|O_TRUNC|O_LARGEFILE, 0644) = 10
rename("/var/cache/nagios2/nagios.tmpACHJEk",
"/var/tmp/nagios_ramdisk/status.dat") = -1 EXDEV (Invalid cross-device
link)
open("/var/tmp/nagios_ramdisk/status.dat",
O_WRONLY|O_APPEND|O_CREAT|O_TRUNC|O_LARGEFILE, 0644) = 10
At this point it was obvious.
I had put status.dat on a ramdisk for performance reasons but didn't
move the temp_file. So nagios was creating the new status.dat (called
nagios.tmp.XXXXXXX) file in a different filesystem. The rename fell
back to copying the file between filesystems. This was causing the
SIGBUS because the file shrunk underneath the mmaped file.
I don't have these faults anymore now they are both on the same filesystem.
Hope this helps someone!
Bill Ryder
System Engineer
Weta Digital
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list