Hi all,<br><br>This is a long post, but (I think) it's an interesting problem... <br><br>I'm struggling a little with a check procedure I'm trying to create. I'm a long time nagios user and I *know* there's a way to do this, but I'm having problems wrapping my head around the best way to achieve the desired effect.<br>
<br>I have a data center that i monitor with nagios. I have a number of database servers (about 40) with a number of customers (about 500).<br><br>I have a number of application checks that I need to do on these databases - things like a data import queue, number of active threads, etc. These checks need to be run for each database and there are a number of checks to perform for each (about 10). In other words, there's about 10 checks x 500 customer database checks (5,000) checks in total.<br>
<br>The first problem is that I need to move a customer's database from one server to another from time to time for various reasons (capacity / performance, etc). This means all the checks for that customer database have to move from one database host to another.<br>
<br>What I've done so far is to create dynamic service checks for each of these application counters. The checks do the following:<br><br>execute a query against the current host's master database and retrieve a list of customer database instances on this host.<br>
for each database, query the relevant application counter.<br>If there are any problems (warn or crit thresholds surpassed), the check returns warn or crit and lists only the databases that are in trouble.<br><br>e.g.: WARN: at least one database is in trouble. \n Customer1: import queue is > 500.<br>
<br>OK, so far so good. When I move a customer database, the check on the old server just doesn't get that customer's database in the list anymore and the check on the new server begins checking it. Great. First problem solved.<br>
<br>Now the tricky part. I'm using PNP to graph performance data. The check script described above returns a LONG perfdata string with perfdata for each database. the way PNP works, it creates one big RRD file for each check - in other words, it creates one rrd file with data sources for each customer database on that server. When a customer database moves to a new database server, the rrd file is not - can not be - updated, so the perfdata just stops for that customer database. It is not easy to move a data source from one rrd file to another, so i have a conundrum.<br>
<br>One way to fix this is to simply create a check on nagios for each customer database. If a database moves, just delete the check on the old server and create it on the new server. move the relevant rrd file to it's new home under the new server's PNP directory and we're done. but that means maintaining 5000 or so check commands on the nagios server and all the associated overhead of running so many checks. The way I have it now, there's only 10 * 40 (400) checks - which is much more manageable.<br>
<br>I looked at using check_multi, but it suffers the same problem - the perfdata is returned for all child checks in one perfdata string.<br><br>What I need is a way to dynamically build service checks for each database server. I'm thinking about a check command does:<br>
<br>for each db in `cat db-server-host-name.txt`; do<br> check_nrpe -H db-server-host-name -c check_app -a customer1<br>done<br><br>but I'm not sure how this would work in terms of nagios service check definitions. One possibility is to use another script outside of nagios that does something like:<br>
<br>in nagios.cfg:<br>cfg_file=/etc/nagios/database-checks.cfg<br><br>for each serverin `cat db-servers.txt`; do<br>
' query master database, get a list of customer databases<br>
' dump list to db-server-host-name.txt<br>
done<br><br>for each db in `cat db-server-host-name.txt`; do<br> ' modify database-checks.cfg<br> ' create service definitions for each database on each database server<br>done<br>service nagios reload<br><br>or something like that.<br>
<br>I'm open to suggestions if anyone has a better way to do this... maybe i'm over-complicating this - i have been buried in this conundrum for a few days and may not be seeing the trees anymore... :/<br><br>Thank you all!<br>
<br>J<br>