Log monitoring with Nagios - recommendations?
Risto Vaarandi
risto.vaarandi at seb.ee
Wed Aug 29 13:18:49 CEST 2007
hi all,
few weeks ago I posted a question to this list about passive service
checks - I was actually experimenting with Nagios as an event log
monitoring GUI. I am tracking event logs with SEC and also sending out
alerts with it, but I would still like to see correlated log messages in
Nagios web interface as well.
During the experimentation, I created a volatile service definition for
a host group of Linux servers which looks similar to the example in
Nagios documentation:
http://nagios.sourceforge.net/docs/2_0/int-snmptrap.html
I have also host checks enabled for the Linux host group, since I'd like
to exploit the Nagios capability of suppressing service alerts when the
host is down (I have also a number of active service checks enabled for
these hosts like web server monitoring).
However, when a lot of correlated log messages are written to Nagios
command pipe with a CRITICAL severity in a short time period, a host
check is run for each such message that creates a lag between reading
and displaying a message (the lag could be several minutes long for the
last message).
I could use several tricks to avoid this:
1) disable host checks altogether (i.e., remove 'check_command' from
host definitions)
2) create a dummy host without 'check_command' that would have a special
service (e.g. LogMessages) for displaying log messages from all servers
Still, is there a way to have the LogMessages service associated with
each host, and also have host checks enabled? In other words, can I
prevent Nagios from running a host check when a certain service goes to
non-OK state?
If someone has other clever ideas for setting up log monitoring in
Nagios, please be so kind and comment :)
br,
risto
Marc Powell wrote:
>
>> -----Original Message-----
>> From: nagios-users-bounces at lists.sourceforge.net [mailto:nagios-users-
>> bounces at lists.sourceforge.net] On Behalf Of Risto Vaarandi
>> Sent: Friday, August 10, 2007 6:43 AM
>> To: nagios-users at lists.sourceforge.net
>> Subject: [Nagios-users] passive service checks with 1 second interval
>>
>
>
>> However, then the service goes to a critical state:
>>
>> [1186719373] EXTERNAL COMMAND:
>> PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at
> 1186719373
>> and starting from this moment, external checks are read from command
>> file with 9-10 second intervals, with a "service alert" and
> notification
>> at the end of each activity burst:
>
> This is probably a result of your host check. When a service initially
> returns a non-ok state, nagios stops everything to perform the host
> check, up to max_check_attempts for that host. Once that is complete,
> nagios will start performing other tasks again. You'll most like want to
> remove your host's check_command entirely.
>
>> Then the service goes up, and the after a while I am seeing the
>> following log entries:
>>
>> [1186719447] EXTERNAL COMMAND:
>> PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;node03 up at 1186719447
>> [1186719447] Warning: The results of service 'NodeState' on host
>> 'node03' are stale by 11 seconds (threshold=60 seconds). I'm forcing
> an
>> immediate check of the service.
>
> I don't know about this one.
>
>> Is there a way to speed up the processing of CRITICAL service checks?
>> I'd like to get a notification within the same second.
>
> I won't say it's not possible but it feels very aggressive to me based
> on my experience. I know there are/were others on the list trying to
> monitor at or close to that resolution but I don't know how successful
> they've been. Perhaps they'll chime in if they're still around.
>
> --
> Marc
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Users
mailing list