<div class="gmail_quote">Hi all,<br><br>At work we have a Nagios setup with about 40 hosts and 150 services.<br><br>Of those 40 hosts, about 15 are workstations which operators use dependent on their shift and whilst an operator is working on a given machine performing particular tasks, specific software needs to be running (Third party client software etc. that feeds data into the systems we use). I monitor things like how many instances are running and if the particular piece of software is generating the expected output, whether expected services are running, if there's enough free disk space and CPU utilisation etc. etc.....<br>
<br>If an operator accidentally starts multiple copies of some of the software, or a phantom copy is running in the background (occasionally GUIs crash leaving background processes running causing all sorts of gremlins), it's handy to know that they're running outside of normal bounds and allows me help diagnose any problems. That or if they're about to run out of disk space due to some rogue logging process.<br>
<br>On the days where a given operator is not working, their particular system may be switched off or if it's on, certain services may not need to be running.<br><br>To overcome firewall issues (the systems are spread across several states) they all tend to push passive test results back to the central Nagios server.<br>
<br>This means, on any one day, it's likely that a particular host is either switched off or not running all its services that it would be during an active day, as its operator is not rostered on that day... and I get a sea of red in Nagios which leads to Chernobyl issues (the important alarms not standing out above the ones that are "ok to be critical")..<br>
<br>Now, service check time periods only apply to active service checks, not passive service checks.<br><br>How does one get around this situation of variable periods of relevance for passively monitored services?<br><br>
My thoughts were that perhaps I needed to create an additional web interface for operators to say when they were using a particular machine and what for, and behind the scenes this would send the relevant external commands to Nagios to do things like setting an OK state and disabling further passive checks across the host.. or doing this to individual services... but I wondered if there was a cleaner way to do this?<br>
<br>That or perhaps somehow creating a service controlled by users somehow which indicated whether they were active or not, and then dependent on the state of this service, not caring about the state of "dependent services".<br>
<br>I know generally Nagios is geared towards monitoring the traditional concept of a server and service - always on 24x7 or at otherwise fixed, inflexible intervals.. but unfortunately the environment I work in is presently a lot more dynamic than that.<br>
</div>