Log monitoring with Nagios - recommendations?

Risto Vaarandi risto.vaarandi at seb.ee
Wed Aug 29 13:18:49 CEST 2007
Previous message: passive service checks with 1 second interval
Next message: Log monitoring with Nagios - recommendations?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
hi all,

few weeks ago I posted a question to this list about passive service 
checks - I was actually experimenting with Nagios as an event log 
monitoring GUI. I am tracking event logs with SEC and also sending out 
alerts with it, but I would still like to see correlated log messages in 
Nagios web interface as well.

During the experimentation, I created a volatile service definition for 
a host group of Linux servers which looks similar to the example in 
Nagios documentation: 
http://nagios.sourceforge.net/docs/2_0/int-snmptrap.html
I have also host checks enabled for the Linux host group, since I'd like 
to exploit the Nagios capability of suppressing service alerts when the 
host is down (I have also a number of active service checks enabled for 
these hosts like web server monitoring).

However, when a lot of correlated log messages are written to Nagios 
command pipe with a CRITICAL severity in a short time period, a host 
check is run for each such message that creates a lag between reading 
and displaying a message (the lag could be several minutes long for the 
last message).

I could use several tricks to avoid this:
1) disable host checks altogether (i.e., remove 'check_command' from 
host definitions)
2) create a dummy host without 'check_command' that would have a special 
service (e.g. LogMessages) for displaying log messages from all servers

Still, is there a way to have the LogMessages service associated with 
each host, and also have host checks enabled? In other words, can I 
prevent Nagios from running a host check when a certain service goes to 
non-OK state?

If someone has other clever ideas for setting up log monitoring in 
Nagios, please be so kind and comment :)

br,
risto


Marc Powell wrote:
> 
>> -----Original Message-----
>> From: nagios-users-bounces at lists.sourceforge.net [mailto:nagios-users-
>> bounces at lists.sourceforge.net] On Behalf Of Risto Vaarandi
>> Sent: Friday, August 10, 2007 6:43 AM
>> To: nagios-users at lists.sourceforge.net
>> Subject: [Nagios-users] passive service checks with 1 second interval
>>
> 
> 
>> However, then the service goes to a critical state:
>>
>> [1186719373] EXTERNAL COMMAND:
>> PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;2;node03 DOWN at
> 1186719373
>> and starting from this moment, external checks are read from command
>> file with 9-10 second intervals, with a "service alert" and
> notification
>> at the end of each activity burst:
> 
> This is probably a result of your host check. When a service initially
> returns a non-ok state, nagios stops everything to perform the host
> check, up to max_check_attempts for that host. Once that is complete,
> nagios will start performing other tasks again. You'll most like want to
> remove your host's check_command entirely.
> 
>> Then the service goes up, and the after a while I am seeing the
>> following log entries:
>>
>> [1186719447] EXTERNAL COMMAND:
>> PROCESS_SERVICE_CHECK_RESULT;node03;NodeState;node03 up at 1186719447
>> [1186719447] Warning: The results of service 'NodeState' on host
>> 'node03' are stale by 11 seconds (threshold=60 seconds).  I'm forcing
> an
>> immediate check of the service.
> 
> I don't know about this one.
> 
>> Is there a way to speed up the processing of CRITICAL service checks?
>> I'd like to get a notification within the same second.
> 
> I won't say it's not possible but it feels very aggressive to me based
> on my experience. I know there are/were others on the list trying to
> monitor at or close to that resolution but I don't know how successful
> they've been. Perhaps they'll chime in if they're still around.
> 
> --
> Marc
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>  http://get.splunk.com/
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
> 


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: passive service checks with 1 second interval
Next message: Log monitoring with Nagios - recommendations?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list