DNS down and false alerts...
Andrew Davis
nccomp at gmail.com
Tue Jun 9 19:26:01 CEST 2009
Hey... I'm the OP. We're using a mix of client tools. For Windows
systems (which aren't affected by this) we use nsclient++. For our Linux
servers, NRPE... for UNIX (Solaris) and OS X we're using check_by_ssh.
Both the NRPE and check_by_ssh clients are affected by this.
I'm willing to give the caching nameserver on the server a try, but as
others have noted, I don't think it will make a difference as its the
local test on the client that's failing to resolv. I surely cannot do a
caching nameserver setup on all clients...
A. Davis
Email: nccomp at gmail.com
"There is no limit to what a man can accomplish
if he doesn't care who gets the credit." - Ronald Reagan
Martin Melin wrote:
> I don't know if I'm misreading the OP, but if the plugins start timing
> out on only the boxes whose primary DNS is being rebooted, would
> adding a caching DNS server to the Nagios box really make a difference?
>
> I think the root cause to these timeouts is that the Nagios plugin
> timeout is happening before the connection to the primary DNS on the
> target machine has a chance to time out and then connect to the
> secondary DNS.
>
> The correct course of action to resolve this would be to either make
> sure that the DNS connection on the target machines fail quicker, or
> that Nagios/the plugin waits longer for a result from the check. The
> DNS failover is working as designed here but you're not giving it
> enough time to kick in.
>
> On Tue, Jun 9, 2009 at 5:37 PM, Russell Adams
> <RLAdams at adamsinfoserv.com <mailto:RLAdams at adamsinfoserv.com>> wrote:
>
> Really the best choice is to using caching DNS on the Nagios
> server. I'd recommend dnsmasq, it just does caching locally without
> needing to do big zone transfers. It has low overhead and simple
> configuration as a result.
>
> Enjoy.
>
> On Tue, Jun 09, 2009 at 11:19:20AM -0400, Andrew Davis wrote:
> > I've observed an interesting issue with Nagios. Our environment
> is a mix
> > of UNIX, Linux, Apple, and Windows. The core of the network is
> Active
> > Directory including two AD servers that are both our primary,
> internal
> > DNS servers. All non-Windows systems have a resolv.conf that
> looks like:
> >
> > *nameserver 10.1.1.13
> > nameserver 10.1.1.14
> > domain int.our.domain
> > search int.our.domain*
> >
> > About half of the servers have the nameserver entries inverted
> (ie: .14
> > first, .13 second).
> >
> > The issue is that anytime one of the nameservers is rebooted (at
> least
> > once a month if staying current on patches thanks to Black
> Tuesdays),
> > whichever hosts have that nameserver listed first in its resolv.conf
> > start throwing the following errors:
> >
> > *CRITICAL - Plugin timed out while executing system call.*
> >
> > This occurs for multiple tests for each host. Obviously, there's
> a name
> > resolution correlation here. If the nameserver with .13 is
> rebooted, all
> > hosts (about half of them) that list this IP first in their
> resolve.conf
> > then timeout for multiple tests. If the .14 server is rebooted,
> all the
> > other hosts timeout. Interestingly, none of the Windows clients
> issue
> > errors... only UNIX, Linux, and Mac's... only those with an
> > /etc/resolv.conf. The end result is a host of "false positives", but
> > more importantly it looks bad on availability reports and causes
> > phones/pagers to go ballistic with unneeded emails.
> >
> > I'm trying to find a solution and I can't find one that I like:
> >
> > Solution 1) is to cluster the DNS servers. We have lots of clusters
> > here. This isn't good, though, as you don't normally cluster DNS
> > servers... they're meant to be redundant for a reason... one
> fails and
> > it uses the next one.
> >
> > Solution 2) is to setup a service/host dependency. My thought
> would be
> > either a host dependency that says if either .13 or .14 are
> down, then
> > don't alert for any other host that uses them. Or a service to host
> > dependency... if the DNS service is down, then don't alert on any of
> > these dependent hosts. Honestly, I'm not sure if you can mix
> host and
> > service dependencies like this... plus... if the DNS server is
> actually
> > down, then the DNS service is down, so better to use a host
> dependency.
> > The problem is that now we're not alerting on any dependent
> hosts which
> > themselves could have a legitimate issue we want to know about.
> Plus,
> > what happens if the DNS server actually dies and take a few
> hours/days
> > to rebuild/restore? At this point, the dependent hosts aren't
> watched
> > for a very long time.
> >
> > Solution 3) is to setup a UNIX/Linux DNS server that slaves all
> zones
> > from the AD servers and have all UNIX/Linux/Apple clients query from
> > this server. This would work except that A) I need two of them
> to keep
> > redundancy and B) I've now added an extra layer of complication to
> > resolve an application (Nagios)... not exactly good practice.
> >
> > Solution 4) is to set the timeout value of a host querying a DNS
> server.
> > Perhaps adjust the client to timeout on the first listed nameserver
> > after only 10 seconds, then try the next one? Since most Nagios
> tests
> > have a minimum timeout value of 30 seconds, if the first DNS
> query timed
> > out after 10 seconds, it would go to the next one with, hopefully,
> > enough time to respond. The downside is having to adjust every
> single
> > server.
> >
> > Has anyone else seen this? Anyone else using Windows AD servers to
> > provide DNS for *nix servers?
> >
> > --
> >
> >
> > A. Davis
> > Email: nccomp at gmail.com <mailto:nccomp at gmail.com>
> >
> > "There is no limit to what a man can accomplish
> > if he doesn't care who gets the credit." - Ronald Reagan
> >
>
> >
> ------------------------------------------------------------------------------
> > Crystal Reports - New Free Runtime and 30 Day Trial
> > Check out the new simplified licensing option that enables unlimited
> > royalty-free distribution of the report engine for externally facing
> > server and web deployment.
> > http://p.sf.net/sfu/businessobjects
> > _______________________________________________
> > Nagios-users mailing list
> > Nagios-users at lists.sourceforge.net
> <mailto:Nagios-users at lists.sourceforge.net>
> > https://lists.sourceforge.net/lists/listinfo/nagios-users
> > ::: Please include Nagios version, plugin version (-v) and OS
> when reporting any issue.
> > ::: Messages without supporting info will risk being sent to
> /dev/null
>
>
> ------------------------------------------------------------------
> Russell Adams RLAdams at AdamsInfoServ.com
>
> PGP Key ID: 0x1160DCB3 http://www.adamsinfoserv.com/
>
> Fingerprint: 1723 D8CA 4280 1EC9 557F 66E8 1154 E018 1160 DCB3
>
> ------------------------------------------------------------------------------
> Crystal Reports - New Free Runtime and 30 Day Trial
> Check out the new simplified licensing option that enables unlimited
> royalty-free distribution of the report engine for externally facing
> server and web deployment.
> http://p.sf.net/sfu/businessobjects
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> <mailto:Nagios-users at lists.sourceforge.net>
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>
>
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> Crystal Reports - New Free Runtime and 30 Day Trial
> Check out the new simplified licensing option that enables unlimited
> royalty-free distribution of the report engine for externally facing
> server and web deployment.
> http://p.sf.net/sfu/businessobjects
> ------------------------------------------------------------------------
>
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20090609/eae92392/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing
server and web deployment.
http://p.sf.net/sfu/businessobjects
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list