Generating % error when generating reports?
Greg Vickers
g.vickers at qut.edu.au
Wed May 18 01:54:13 CEST 2005
Hi all,
When thinking about SLAs and reporting, I had a thought:
When generating reports, say for an active service, the point in time
that the service actually changed states is not the same as when Nagios
detects that state change. Therefore there is a margin of error (fairly
small for a state duration that is long relative to the regular
check_interval of that service) in the reports that Nagios generates.
If there were to be a patch to allow a % margin of error to be
calculated for a given report, would the pseudo code look something like
this (at a high level - only accounting for HARD state changes):
(time0) found HARD service state change (e.g 0 sec)
... get the regular check_interval for that service (e.g. 5 min or 300 sec)
(time1) found HARD service state change (e.g 100 min or 6000 sec after
time0)
calculate % of error in report: 2*300/(6000 - 0)*100 = 10% margin of
error (ow)
The above calculation assumes the worst possible timing (300 secs)
between a state change and Nagios actually detecting that change (2
times 300s because there may be 300 sec time for detection of the first
state change and 300 sec later for the detection of the second state
change) and does not account for a manually re-scheduled service check.
(The responsible contact may fix the service then schedule a check for
now - there would be a small time window.)
Obviously you could reduce this % of error by reducing the check times
for critical services or by using passive checks. (One will increase the
load on the monitoring server and the monitored hosts, the other may not
be suitable.)
Generating this % value is not terribly realistic as the check will
probably happen less than 300 seconds after the state changes state.
However, if this % value is available, Nagios administrators could then
give more certainty to the PHBs about the report values (some PHBs
actually RTFM the Nagios doco, damnit,) rather than have a PHB say
"Service blah went down at time x, but your report shows it as down at
time y."
Anyway, just a thought I had and an idea I had that I wanted to share
with -devel, get your machine guns out...
--
Greg Vickers
Computer Systems Officer
Teaching and Learning Support Services, Systems and Architecture
Queensland University of Technology
Kelvin Grove Campus, E409
Phone: (07) 3864 8276
Mobile: 0416 001 674, Speed Dial #6 6147
Email: g.vickers at qut.edu.au
TALSS web site: http://www.talss.qut.edu.au/
CRICOS No. 00213J
-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click
More information about the Developers
mailing list