<div class="gmail_quote">On Tue, Nov 3, 2009 at 9:35 PM, Frost, Mark {PBG} <span dir="ltr"><<a href="mailto:mark.frost1@pepsi.com">mark.frost1@pepsi.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div><div></div><div class="h5"><br>
<br>
>-----Original Message-----<br>
>From: Andreas Ericsson [mailto:<a href="mailto:ae@op5.se">ae@op5.se</a>]<br>
>Sent: Monday, November 02, 2009 7:02 AM<br>
>To: Frost, Mark {PBG}<br>
>Cc: <a href="mailto:nagios-users@lists.sourceforge.net">nagios-users@lists.sourceforge.net</a><br>
>Subject: Re: [Nagios-users] Restarts resetting soft critical states<br>
><br>
>On 10/29/2009 08:50 PM, Frost, Mark {PBG} wrote:<br>
>> You think you know an application and every once in a while you get a<br>
>surprise...<br>
>><br>
>> Both the reporting server and the distributed node share the same<br>
>attributes for retention and soft states:<br>
>><br>
>> soft_state_dependencies=0<br>
>> passive_host_checks_are_soft=1<br>
>> retain_state_information=1<br>
>> use_retained_program_state=1<br>
>> use_retained_scheduling_info=1<br>
>> retained_host_attribute_mask=0<br>
>> retained_service_attribute_mask=0<br>
>> retained_process_host_attribute_mask=0<br>
>> retained_process_service_attribute_mask=0<br>
>> retained_contact_host_attribute_mask=0<br>
>> retained_contact_service_attribute_mask=0<br>
>><br>
>> While I would assume the restarts would disrupt Nagios a bit what with<br>
>> having to do start-time tasks again, I would not have expected that it<br>
>> would "start over" with the status of some checks.<br>
>><br>
>> What am I missing here?<br>
>><br>
><br>
>It seems you haven't grasped how bitmasks work. When you set the mask to<br>
>0,<br>
>you essentially tell it to not let anything through. Set them to -1, or<br>
>leave them at the default values and you'll get the kind of state<br>
>retention<br>
>you want.<br>
><br>
<br>
</div></div>Thanks, Andreas. Unfortunately, I'm still puzzled. The mask values you refer to are already set to the defaults (they're all 0's). I've never touched those or paid much attention to them until now.<br>
<br>
I'm actually confused by 2 aspects of this. It seems to me that the thing I'm trying to retain across a restart are soft check states (those are what are being reset). Looking at the MODATTR arguments in include/common.h (3.0.6) I don't see which of those attributes would govern this. There's the *ENABLED attributes which really aren't changing here (and are retained). All the other MODATTR's are (it seems to me) not changing in this case either.<br>
<br>
The second thing that confuses me here is the verbage used to describe the mask functionality:<br>
<br>
# RETAINED ATTRIBUTE MASKS (ADVANCED FEATURE)<br>
# The following variables are used to specify specific host and<br>
# service attributes that should *not* be retained by Nagios during<br>
# program restarts.<br>
<br>
So if MODATTR is set to none, based on the comment doesn't this mean that "NONE" of the attributes are NOT retained? I.e. all are retained (double-negative)? The on-line doc for these masks say "By default, all host and service attributes are retained."<br>
</blockquote><div><br>I don't know the source code behavior, but I agree with this and a default nagios.cfg has all of the masks set to zero, presumably to not mask anything, i.e. to not affect what's retained.<br>
</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br>
I do get masks, I just didn't see how these applied here.<br>
<br>
Your help is greatly appreciated.<br>
<br></blockquote><div><br>I just did a quick experiment with the default values for *retain* variables in nagios.cfg - which are exactly what you quote:<br><br>[1257281477] SERVICE ALERT: localhost;File age;CRITICAL;SOFT;1;FILE_AGE CRITICAL: File not found - /tmp/nagios<br>
[1257281597] SERVICE ALERT: localhost;File age;CRITICAL;SOFT;2;FILE_AGE CRITICAL: File not found - /tmp/nagios<br>[1257281604] Caught SIGTERM, shutting down...<br>[1257281604] Successfully shutdown... (PID=9617)<br>[1257281605] Nagios 3.0.6 starting... (PID=9721)<br>
[1257281605] Local time is Tue Nov 03 21:53:25 CET 2009<br>[1257281605] LOG VERSION: 2.0<br>[1257281605] Finished daemonizing... (New PID=9722)<br>[1257281715] SERVICE ALERT: localhost;File age;CRITICAL;HARD;3;FILE_AGE CRITICAL: File not found - /tmp/nagios<br>
<br>Everything works as expected.<br><br>I'm guessing you have some other issue that's affecting Nagios' ability to save retention data.<br><br>What's the value of state_retention_file and retention_update_interval for you?<br>
<br>Have you checked that state_retention_file is updated when Nagios runs, that you're not close to capacity of the disk or that something basic like that is going on?<br><br>Open up the file and grab the definition for the service in question, see what values are being saved.<br>
<br>HTH,<br><br>Regards,<br>Martin Melin<br><br></div></div>