Problem with OCP_daemon in distributesenvironment
Craig Stewart
Craig.Stewart at corp.xplornet.com
Fri Aug 19 14:22:38 CEST 2011
Michel,
There might be something with the external buffers there. My set up has
four remote probes submitting over 12000 service checks as well as an
snmp trap server sending into Nagios with the send_nsca command as well.
I plot the performance of the central and probes as per this article:
http://nagios.sourceforge.net/docs/3_0/mrtggraphs.html
In the last 24 hours I used a maximum of 160 external command slots.
I'm not sure what your set up uses, but you could be starving it,
although I didn't think that would affect the execution of local checks.
I would say, if you aren't already, it would be worth graphing your
system's performance as per the above article. Might not solve your
problem, but it would help point you in the right direction.
Good luck, and let the rest of us know what happens. I'm sure you're
not the only person that's seen this behaviour.
Craig
--
Craig Stewart
Systems Integration Analyst
Craig.Stewart at corp.xplornet.com
Xplornet - Broadband, Everywhere
On 08/17/2011 03:29 PM, Michel van der Voort wrote:
> Hello again Graig,
>
> Thanks once more and yes you're making sense.
> But this is also the reason why I can't pinpoint on what's going wrong, it
> should just be able to work especially because nothing changes on the
> central server only the amount of check results from the OCP_daemon machine
> maybe.
> We indeed have a number of remote machines all doing their checks locally en
> sending them to our central server via nsca/nscad and so was (and now is
> again) the one I tried to configure with OCP_daemon.
> I know offcourse that OCP_daemon uses the underlying (and unchanged)
> send_nsca config and binary and that all works well.
> The latency's I had before on this OCP_daemon machine made me experiment
> with it OCP because a lag in messages/performance data appearing on the
> central server of about 10 - 15 minutes was unacceptable.
> This also caused WARNINGS on certain checks on the OCP machine because these
> have to run frequent enough to not be causing counter overflows on 64-bit
> counters for some network devices we monitor.
>
> Also, debugging nsca on the central receiving end shows everything working
> find, data coming in from the OCP_daemon machine as well as other still
> nsca/nscad'ing other remote machines.
> The only thing that stops working are the central servers own ACTIVE checks
> of which no other Nagios machine even knows about.
>
> I guess I have to do even more research and debugging on the central server.
> I would expect that machine would be getting high CPU, memory or I/O
> indications when something like to fast incoming data would be the issue,
> but there are no indications.
> Just that after 2 hours all local checks have a last execution timestamp of
> 2 hours and the check_ processes also really don't get fired anymore.
> I've switched back to standard obsessing with send_nsca again on the
> OCP_daemon machine, restarted nagios on the central server and everything's
> working again but unfortunately with high check intervals and latency on the
> OCP_daemon machine.
> The only thing I noticed was the higher number of buffer slots used on the
> external command file where process_perfdata reads from and nscad writes to.
>
> For now, thanks a lot.
> I'm not really familiar with closing a topic on the Nagios Users List but I
> will try to.
> Also, if and when I find out more I will inform you.
>
> Best regards,
>
> Michel
>
> -----Oorspronkelijk bericht-----
> Van: Craig Stewart [mailto:Craig.Stewart at corpxplornet.com
> <mailto:Craig.Stewart at corp.xplornet.com>]
> Verzonden: woensdag 17 augustus 2011 14:10
> Aan: Nagios Users List
> CC: michel.vdv at wxs.nl
> Onderwerp: Re: [Nagios-users] Problem with OCP_daemon in
> distributesenvironment
>
> Michel,
>
> Okay, I understand now.
>
> So, if I get this correctly, when you were using the obsessing method,
> everything was working fine from the central server's point of view, but
> when you moved one remote unit from the obsessing to the OCP_daemon, the
> central server stopped doing all active checks?
>
> The way I have it set up here for my central/probe configuration is that
> the central server accepts passive checks through the nscad process. On
> my remote servers they send in either via the OCP_daemon (which calls
> send_nsca) or a custom obsess script. There are no changes to my
> central server.
>
> So, unless you are doing something strange, you should be able to get it
> going and executing active checks as well as accepting passive checks on
> the central. The method the probe uses, as long as it's consistent with
> the way the central server picks up check (send_nsca/nscad in my case)
> is independent of the central server. If you get this working,
> switching the probe from the obsess method to the OCP_daemon method
> should not affect the central server, or even require a restart.
>
> Am I making any sense here or have I confused the issue?
>
> Craig
> --
> Craig Stewart
> Systems Integration Analyst
> Craig.Stewart at corp.xplornet.com
> Xplornet - Broadband, Everywhere
>
> On 08/16/2011 05:02 PM, michel.vdv at wxs.nl wrote:
>> Hello Craig,
>>
>> First of all thanks for the fast response.
>> Maybe i need to clear things out a bit more to why ACTIVE checks are
>> happening on the central server.
>> We have a distributed setup with a central machine in DMZ reachable for
>> all remote nagios machines we have out there.
>> One of those is the LAN machine i mentioned where OCP_daemon was setup
>> today.
>> The central Nagios machine in DMZ should/must perform active checks of
>> all our equipment in the same DMZ, the others hosts only send passive
> data.
>> The DMZ machine cannot perform ACTIVE checks on the services monitored
>> by 1 or more of the remote machines.
>> So, this is why there is a problem when the central server does not
>> perform it's own checks.
>>
>> I've been testing around with repear frequencies on the central server
>> because i saw reaper frequency exceeded messages in the nagios.debug
>> (-1) output.
>> These now stay away but the result is still te same.
>> Also lowered the frequency of all template related check_interval's on
>> the OCP_daemon remote machine but that does not help either.
>>
>> If you have any more suggestions, please let me know.
>>
>> Regards,
>>
>> Michel
>> ------------------------------------------------------------------------
>> *Van:* Craig Stewart [mailto:Craig.Stewart at corpxplornet.com
> <mailto:Craig.Stewart at corp.xplornet.com>]
>> *Verzonden:* di 16-8-2011 21:47
>> *Aan:* Nagios Users List
>> *CC:* michel.vdv at wxs.nl
>> *Onderwerp:* Re: [Nagios-users] Problem with OCP_daemon in distributes
>> environment
>>
>> Michel,
>>
>> I just did the same thing for my set up and I didn't see this issue.
>> That being said, I don't *want* the central master to execute service
>> checks at all unless it's stale.
>>
>> What may be happening is that the remote passive check may be getting
>> inserted while the system is waiting to execute the next check. This is
>> probably resetting the clock as it were and the count down starts over.
>>
>> For example:
>>
>> - NOW is an arbitrary point in time.
>> - Nagios schedules the check to be executed at NOW + 5 min. (recheck
>> interval)
>> - The passive check comes in at NOW + 3 min. Nagios resets the clock to
>> NOW + 3 min + check interval.
>>
>> If the remote is submitting checks at a frequency less than the
>> central's recheck interval, I can see this happening. The clock never
>> runs out, unless the remote system doesn't submit a check.
>>
>> A couple things to check are the check intervals on both the central and
>> the probe, and if you can tolerate the hit, shut down the probe and see
>> if the central server starts executing checks on it's own.
>>
>> I may be out in left field as well.
>>
>> Cheers!
>>
>> Craig
>> --
>> Craig Stewart
>> Systems Integration Analyst
>> Craig.Stewart at corp.xplornet.com
>> Xplornet - Broadband, Everywhere
>>
>> On 08/16/2011 04:22 PM, michel.vdv at wxs.nl wrote:
>>> Dear readers,
>>>
>>> I have a strange problem related to the use of OCP_daemon.
>>> I've implemented this today on a "remote" nagios machine responsible for
>>> monitoring our LAN hosts.
>>> Until now all messages and performance data was sent from that machine
>>> to our Central Nagios machine via obsess_over_hosts and
>>> obsess_over_services.
>>> But because a lot of services on the remote host combined with relative
>>> short check_interval periods caused high service and host check
>>> latencies i've started looking for an alternative and read about
>> OCP_daemon.
>>> I followed the install instructions and sending data via OCP_daemon
>>> works fine and very fast, also the remote nagios machine's latencies
>>> stay low.
>>> However, the central server stays processing all passive service and
>>> host check results (also from other send_nsca based hosts) but no longer
>>> executes it's own ACTIVE checks.
>>> Is soon as i stop nagios on the remote monitor and restart nagios on the
>>> central server it starts executing ACTIVE checks again.
>>> The load on both servers remained about the same since OCP_daemon and
>>> the only thing i noticed is that the number of buffers/slots used for
>>> the external command file (nagios.cmd) on the central server
>>> reaches rather higher values than before but no more than 30 - 40% of
>>> the available 4096 slots.
>>>
>>> Please advice me.
>>>
>>> Michel
>>>
>>>
>>> --
>>> This message has been scanned for viruses and
>>> dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is
>>> believed to be clean.
>>
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is
>> believed to be clean.
>
>
> --
> This message has been scanned by MailScanner
>
------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
user administration capabilities and model configuration. Take
the hassle out of deploying and managing Subversion and the
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list