18000 services to check and Nagios just sits and waits.
Surya Rahmadiansyah
surya at divre1.telkom.co.id
Thu Dec 11 09:12:39 CET 2003
Hi, Martin
I would like to know, are you running nagios to monitor 900 hosts in one
map? Is there possible to make several map (status.cgi) with one nagios
process?
Thanks
-- Surya Rahmadiasnyah
At 12:00 PM 12/10/2003 -0600, Ric Moseley wrote:
>-----Original Message-----
>From: nagios-users-admin at lists.sourceforge.net
>[mailto:nagios-users-admin at lists.sourceforge.net] On Behalf Of Marc Powell
>Sent: Wednesday, December 10, 2003 9:35 AM
>To: martin at idefix.net; nagios-users at lists.sourceforge.net
>Subject: RE: [Nagios-users] 18000 services to check and Nagios just sits and
>waits.
>
>Congratulations. You have the largest single-host installation of nagios
>that I have heard of. You haven't given any hard information about how
>you have nagios configured so we're going to have to make a lot of
>guesses until you provide specific information. Your host and service
>check definitions as well as many options in nagios.cfg can have a
>profound affect on the performance of the program. Here are some
>suggestions from my personal experience which may or may not be
>redundant for you:
>
>Nagios.cfg -
> - command_check_interval=-1 (may have no affect in your setup)
> - max_concurrent_checks=xxx (You should run '/path/to/nagios -s
>/path/to/nagios.cfg) for a lower estimate on this number. Increasing it
>will not hurt to a point)
> - service_reaper_frequency=2 (or 1 if you want, I'd start with
>2)
> - use_agressive_host_checking=0
> - aggregate_status_updates=1
> - status_update_interval=xxx (I suggest at least 60). This one
>may actually be getting you. With 18000 services, it's going to take
>some time to update the status in the db, even if it is in ram. Nagios
>does a delete of the status tables, then an insert of the new
>information. If you have the interval set at 30 seconds and the process
>takes 29 seconds, that's all that nagios will be doing or it will only
>have 30 seconds to process several hundred or thousand results).
> - inter_check_delay_method. When using the smart option nagios
>will try to spread out your checks so that they all fit in your average
>check interval. If you don't have max_concurrent_checks set high enough
>or the service_reaper_frequency set low enough to allow this to happen
>the initial checks can get spread over a significant period of time.
>
>Host and service checks -
> - Use very simple host checks, single pings for example, with no
>retry or disable host checks entirely. If any service returns a state
>other than OK, nagios will aggressively check the status of the host and
>stop doing everything else until max_retries has been reached on the
>host check.
> - Use a sane check_interval. Don't expect nagios to be able to
>complete 18,000 checks at 1 minute intervals.
> - Your custom plugin should be written in C or if it's perl you
>should use the ePN. If it's written in perl it can be very expensive
>without ePN as you need to launch a copy of perl every check as well.
> - If you're utilizing parenting or service dependencies these
>may be problematic with large numbers of hosts/services (just guessing).
> - I'm not a programmer but I don't believe that just because
>linux understand hyperthreading that a program will take advantage of
>it.
>
>Ulimits - by default, Redhat linux 7.3 only allows a user to have 1024
>open files, a stack size of 8192 kbytes and 7168 concurrent processes.
>You may need to adjust these once you get things going.
>
>Presuming that you don't make any significant changes based on the
>suggestions above, is there anything in nagios.log that might indicate a
>problem? Have you tried running strace on any of the nagios processes to
>find out exactly what they are doing?
>
>Finally, since you've created your own front end I presume you've
>realized that nagios pre-2.0 had a hard time with large numbers of hosts
>and services, particularly for the cgis. 2.0 (will) incorporates several
>changes that reportedly make working with large numbers of hosts and
>services better. YMMV though and as far as I know the enhancements
>mostly benefit the cgis.
>
>--
>Marc
>
> > -----Original Message-----
> > From: martin at idefix.net [mailto:martin at idefix.net]
> > Sent: Wednesday, December 10, 2003 3:55 A
> > To: nagios-users at lists.sourceforge.net
> > Subject: [Nagios-users] 18000 services to check and Nagios just sits
>and
> > waits.
> >
> > Hi all,
> >
> > I'm trying to convince Nagios it should perform very aggressively
> > but somehow it won't work.
> > When reading the documentation it states everywhere that Nagios
> > will consume all CPU power you throw at it if you don't take care.
> > Well, with me it doesn't and I really want it to.
> >
> > The situation:
> > - All our machines send some email to the Nagios server which
> > we put in files and wrote a plugin to check those files.
> >
> > - There are a lot of machines (almost 900) and we want to do a lot
> > of checks (18000).
> >
> > - To make it worse, we forced Nagios to use MySQL for the
>service_status
> > and host_status data (as we created our own frontend and use MySQL
>as
> > the interface).
> >
> > To make sure Nagios will be able to abuse the hardware as much as it
>can
> > we threw in a dual xeon 3 GHz machine with 2GB memory and some 15k RPM
> > SCSI disks. To make it better, Linux understands hyperthreading and
> > makes it a total of 4 CPU's.
> > To prevent MySQL to abuse the arraycontroller to much we make the
> > service_status and host_status tables HEAP so they only use memory.
> >
> > I would assume that Nagios would at least try to fork something like
> > 40 to 100 processes and would consume at least one CPU but it doesn't.
> > It won't abuse the memory either as there is about 1GB of memory left.
> >
> > It only seems to be sitting there with 4 to 6 proccesses and allowing
> > the latency to go up and up like there's no tomorrow. Or at least
>there
> > won't be any checks tomorrow.
> >
> > We've tried both the smart Nagios options as the dumb options and
> > event tried to think ourselves and calculating the right configvalues
> > but nothing seems to work.
> >to /dev/null
>
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: SF.net Giveback Program.
>Does SourceForge.net help you be more productive? Does it
>help you create better code? SHARE THE LOVE, and help us help
>YOU! Click Here: http://sourceforge.net/donate/
>_______________________________________________
>Nagios-users mailing list
>Nagios-users at lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/nagios-users
>::: Please include Nagios version, plugin version (-v) and OS when reporting
>any issue.
>::: Messages without supporting info will risk being sent to /dev/null
>
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: SF.net Giveback Program.
>Does SourceForge.net help you be more productive? Does it
>help you create better code? SHARE THE LOVE, and help us help
>YOU! Click Here: http://sourceforge.net/donate/
>_______________________________________________
>Nagios-users mailing list
>Nagios-users at lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/nagios-users
>::: Please include Nagios version, plugin version (-v) and OS when
>reporting any issue.
>::: Messages without supporting info will risk being sent to /dev/null
-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive? Does it
help you create better code? SHARE THE LOVE, and help us help
YOU! Click Here: http://sourceforge.net/donate/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list