Nagios World Conference
Andreas Ericsson
ae at op5.se
Wed Oct 5 16:46:31 CEST 2011
Hi all. I attended the Nagios World Conference North America last week
and though I'd dish out some kudos where such are due, and also dense
up the information to any newcomers that might get lucky when looking
for solutions to any particular problems.
Overall, the standard of the conference was very, very high. It was the
first Nagios conference I've gone to where I learned something new. A
rare occasion indeed, so many thanks to Ethan, Mary and Nagios Enterprises
for arranging such a high-quality event. I won't mention their talks,
since I don't want to inflate their egos too much, but check out the one
on visualizations by Mike Guthrie. Pretty cool stuff :)
Much of the focus was on scaling up Nagios. mod_gearman and livestatus
seem to be the most known and used projects for achieving that goal.
Reading status files is just too slow when viewing the UI, and a single
server just doesn't scale to enough checks (yet). DNX also seemed very
well investigated and used in some places, although a documentation
mishap seems to have lead many potential users away from it. For those
wondering, DNX can indeed distribute checks to workers based on host-
groups, just as mod_gearman can. It's just not well documented.
LivestatusSlave also got a lot of interest, although it didn't seem to
be as well used as either of the other three.
Kudos to Sven Nierlein (mod_gearman author/maintainer) Mathias Kettner
(mk_livestatus author/maintainer) and Lars Michelsen (LivestatusSlave
author). Your stuff is being used in production for positively *huge*
installs, so well done guys :) I sure hope you go to the conference
next year so you can talk about future development and gather even more
interest for your projects.
Merlin wasn't much discussed, although the DNX maintainers (and I)
recommend it as the only sane way to get redundancy and automagic
loadbalancing. Probably because of the misconception that you're
required to run a separate UI and a fork of Nagios when using it. At
least that's what my slightly hurt ego wants to believe ;)
General tips for running large installations is to offload the various
spool directories to ramdisk, along with status.dat and objects.cache
(since they're read quite frequently). Work is under way to make that
unnecessary by simply getting rid of disk I/O as much as possible. It
was pretty much headnodding when these tips were iterated in one talk
after another, so it seems the attending part of the Nagios community
have reach consensus that that's the best way to do it. Mounting all
disks with the noatime option is also a very good tip that'll get your
disk write operations (the slow ones) down to a fragment of what they
were before you latched that option on.
Many have large headaches with getting various graphing solutions to
scale properly. Some resorted to using Fusion I/O cards with exabyte
performance (quite expensive...), since using ramdisk to store the
tens or hundreds of gigabytes of rrd-files generated in large installs
isn't really an option. It would be nice to hear Joerg Linge's (author of
PNP4Nagios) take on other paths to increase performance next year. It
seems his project is the most widely used for graphing, so getting it to
perform exceptionally well would be time well spent.
Apart from that, there were plenty of other good presentations and very
awesome drinking^H^H^H^H^H^H^H mingle sessions. I highly recommend you
attend it next year if you're managing nagios install at $dayjob, or
if you're working on a Nagios addon project and want to get immediate
feedback on what users are looking for.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list