Alleviating Nagios i/o contention problem

Frost, Mark {PBC} mark.frost1 at pepsico.com
Mon Sep 27 14:45:39 CEST 2010



> -----Original Message-----
> From: Marc Powell [mailto:lists at xodus.org]
> Sent: Sunday, September 26, 2010 11:27 AM
> To: Nagios Users List
> Subject: Re: [Nagios-users] Alleviating Nagios i/o contention problem
> 
> 
> On Sep 25, 2010, at 10:53 AM, Max wrote:
> 
> > I like the suggestions Matthias makes; those suggestions have worked
> > well for us.
> >
> > RRD updates are very expensive - I am pretty sure without knowing
> > anything more about your system that the RRD writes are causing most
> > of the I/O load.
> 
> I no longer have access to this system but my experience has been
> otherwise. We were running a nagios install with nearly 10,000 services
> received by external pollers every 5 minutes, and a cricket install on
> the same machine polling/updating 100,000+ rrd files during the same
> interval. This was on a Poweredge 6850, 5 disk RAID-5.
> RRDtool itself writes very little data to disk. I think it's 8 Bytes
> per DS per RRA updated. Linux, though, wants to write 4KB chunks at a
> time so it performs a read-modify-write of 4KB just to update those 8
> Bytes.
> 
> The OP can reduce his IO load particularly for RRD updates and help
> Linux better organize it's writes to disk by ensuring that he has
> enough RAM to keep key information for each RRD file in the filesystem
> cache. The OP will need at least 8K * number of rrd files available to
> be used as filesystem buffer cache.
> 
> --
> Marc



Thanks very much to all who replied (Breandan, Marc, Max and Matthias, this means you! :-) ).

- I can't say exactly how many checks create perfdata (we have a very heterogeneous set of check types).  I can see 9K files in the graph data filesystem, so that would be about 4,500.

- I'm not running updates through syslog.  I don't have root on these machines so that would not be helpful to me.  I will have to double-check, but I don't believe that I have writing to the pnp4nagios turned on, except maybe for the lowest level.  I don't recall it logging much of anything at that level, but as I say,  I'll check.

- According to our performance analysis team, these servers have way more RAM that they're actually using so I wouldn't think I'm limited by the Linux disk cache here.  Perhaps it's just the hardware we have (the i/o rates on a 3-year-old Dell 2950 with a single RAID 5 set) that makes this particularly bad for us.  Perhaps on faster hardware we'd not even notice.

- I would assume that the rrdcached was built for a reason (i.e. this i/o issue was observed at least somewhere) so it's definitely an avenue I want to try out.

- The ramdisk idea is also interesting.   I'm curious though, about why one would want to rsync it back to the local disk periodically.  It's just a run-time status file, right?  Unless I misread the docs, it goes away when Nagios is shut down.  What would having a local disk copy of status.dat benefit me?  Also, nagios.log isn't written to that often in our case (we don't log passive check results, for example).  I'm not sure I'd see the benefit for us in putting that on ramdisk.  Although... we do have Splunk watch that file so that would be some additional read overhead I guess.


Thanks!

Mark

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list