<div class="gmail_quote">On Tue, Nov 3, 2009 at 9:35 PM, Frost, Mark {PBG} <<a href="mailto:mark.frost1@pepsi.com">mark.frost1@pepsi.com</a>> wrote: <blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> <div><div></div><div class="h5"> >-----Original Message----- >From: Andreas Ericsson [mailto:<a href="mailto:ae@op5.se">ae@op5.se</a>] >Sent: Monday, November 02, 2009 7:02 AM >To: Frost, Mark {PBG} >Cc: <a href="mailto:nagios-users@lists.sourceforge.net">nagios-users@lists.sourceforge.net</a> >Subject: Re: [Nagios-users] Restarts resetting soft critical states > >On 10/29/2009 08:50 PM, Frost, Mark {PBG} wrote: >> You think you know an application and every once in a while you get a >surprise... >> >> Both the reporting server and the distributed node share the same >attributes for retention and soft states: >> >> soft_state_dependencies=0 >> passive_host_checks_are_soft=1 >> retain_state_information=1 >> use_retained_program_state=1 >> use_retained_scheduling_info=1 >> retained_host_attribute_mask=0 >> retained_service_attribute_mask=0 >> retained_process_host_attribute_mask=0 >> retained_process_service_attribute_mask=0 >> retained_contact_host_attribute_mask=0 >> retained_contact_service_attribute_mask=0 >> >> While I would assume the restarts would disrupt Nagios a bit what with >> having to do start-time tasks again, I would not have expected that it >> would "start over" with the status of some checks. >> >> What am I missing here? >> > >It seems you haven't grasped how bitmasks work. When you set the mask to >0, >you essentially tell it to not let anything through. Set them to -1, or >leave them at the default values and you'll get the kind of state >retention >you want. > </div></div>Thanks, Andreas. Unfortunately, I'm still puzzled. The mask values you refer to are already set to the defaults (they're all 0's). I've never touched those or paid much attention to them until now. I'm actually confused by 2 aspects of this. It seems to me that the thing I'm trying to retain across a restart are soft check states (those are what are being reset). Looking at the MODATTR arguments in include/common.h (3.0.6) I don't see which of those attributes would govern this. There's the *ENABLED attributes which really aren't changing here (and are retained). All the other MODATTR's are (it seems to me) not changing in this case either. The second thing that confuses me here is the verbage used to describe the mask functionality: # RETAINED ATTRIBUTE MASKS (ADVANCED FEATURE) # The following variables are used to specify specific host and # service attributes that should *not* be retained by Nagios during # program restarts. So if MODATTR is set to none, based on the comment doesn't this mean that "NONE" of the attributes are NOT retained? I.e. all are retained (double-negative)? The on-line doc for these masks say "By default, all host and service attributes are retained." </blockquote><div> I don't know the source code behavior, but I agree with this and a default nagios.cfg has all of the masks set to zero, presumably to not mask anything, i.e. to not affect what's retained. </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> I do get masks, I just didn't see how these applied here. Your help is greatly appreciated. </blockquote><div> I just did a quick experiment with the default values for *retain* variables in nagios.cfg - which are exactly what you quote: [1257281477] SERVICE ALERT: localhost;File age;CRITICAL;SOFT;1;FILE_AGE CRITICAL: File not found - /tmp/nagios [1257281597] SERVICE ALERT: localhost;File age;CRITICAL;SOFT;2;FILE_AGE CRITICAL: File not found - /tmp/nagios [1257281604] Caught SIGTERM, shutting down... [1257281604] Successfully shutdown... (PID=9617) [1257281605] Nagios 3.0.6 starting... (PID=9721) [1257281605] Local time is Tue Nov 03 21:53:25 CET 2009 [1257281605] LOG VERSION: 2.0 [1257281605] Finished daemonizing... (New PID=9722) [1257281715] SERVICE ALERT: localhost;File age;CRITICAL;HARD;3;FILE_AGE CRITICAL: File not found - /tmp/nagios Everything works as expected. I'm guessing you have some other issue that's affecting Nagios' ability to save retention data. What's the value of state_retention_file and retention_update_interval for you? Have you checked that state_retention_file is updated when Nagios runs, that you're not close to capacity of the disk or that something basic like that is going on? Open up the file and grab the definition for the service in question, see what values are being saved. HTH, Regards, Martin Melin </div></div>