Bug report: downtimes beyond 2038 cause event queue errors

Andreas Ericsson ae at op5.se
Thu Apr 4 23:55:20 CEST 2013


On 04/04/2013 06:32 PM, Ton Voon wrote:
> Hi!
>
> We've come across a problem in an upgrade of Nagios 3 to Nagios 4 which we can't work out where the fix is. It occurs when an event is scheduled in the future beyond 2038.
>

Why on earth would you want to schedule something to end beyond 2038?
It sounds like you're using a patch on a workaround for something that
was the wrong solution in the first place.

> Recreation steps:
>    * Set a downtime on a service to end next day
>    * Stop Nagios
>    * Edit the retention.dat so that the end_date=4514791088 (some other values seem to work)
>    * Start Nagios
>
> When Nagios starts, it will not run any scheduled events in the events queue.
>

Ouch. That's pretty bad.

> This fails on CentOS 5 64bit, though appears to work on Debian Squeeze 32bit, so it maybe a 64 bit only issue.
>
> We think this is an issue when the event is scheduled via squeue_add(). We've managed to get the test-squeue to fail by changing the time value to be greater than 2038 with the following:
>
> Index: test-squeue.c
> ===================================================================
> --- test-squeue.c	(revision 2716)
> +++ test-squeue.c	(working copy)
> @@ -116,7 +116,7 @@
>   	sq_test_random(sq);
>   	t(squeue_size(sq) == 0, "Size should be 0 after first sq_test_random");
>
> -	t((a.evt = squeue_add(sq, time(NULL) + 9, &a)) != NULL);
> +	t((a.evt = squeue_add(sq, time(NULL)*2, &a)) != NULL);
>   	t(squeue_size(sq) == 1);
>   	t((b.evt = squeue_add(sq, time(NULL) + 3, &b)) != NULL);
>   	t(squeue_size(sq) == 2);
>
> This gives the test result of:
>
> ### squeue tests
>    FAIL max <= *d @test-squeue.c:86
>    FAIL x == &b @test-squeue.c:133
>    FAIL x->id == b.id @test-squeue.c:134
>    FAIL x == &c @test-squeue.c:141
> about to fail pretty fucking hard...
> ea: 0xbfe065e0; &b: 0xbfe065d8; &c: 0xbfe065d0; ed: 0xbfe065c8; x: 0xbfde9b80
>    FAIL x == &b @test-squeue.c:152
>    FAIL x->id == b.id @test-squeue.c:153
>    FAIL x == &b @test-squeue.c:160
>    FAIL x->id == b.id @test-squeue.c:161
>    FAIL x == &c @test-squeue.c:166
>    FAIL x->id == c.id @test-squeue.c:167
> Test results: 390637 passed, 10 failed
>
> Changing to a factor of 1.1 instead of 2 passes:
>

I'm not surprised. 1.1 would mean it's still within the unix timeframe.

What's the size of time_t, long and struct timeval on systems where it 
fails?
What's the sizes on systems where it succeeds?
Does time_t differ in signedness on them?

I think a runtime check based on those sizes should work just fine, and
also be optimized away so we don't actually have to pay for it, but I'm
curious to see where it actually goes wrong. If it's before we get to
see the number in squeue.c we're pretty much fscked, as the only option
then is a macro which does voodoo-casting so the squeue api sees the
right number.

> ### squeue tests
> Test results: 390647 passed, 0 failed
>
> This worked in Nagios 3, so we're guessing that the change to use the squeue library for events is probably where this limitation has come in.
>
> Any thoughts?
>

Well, modifying the evt_compute_pri() algorithm to discard
everything but the 21 least significant bits of the tv->tv_usec
would allow us to use 43 bits for the seconds value. That would
land us somewhere in the year 141234 before we run out of seconds.
It's not a real fix though, since we could live with discarding
events that are patently absurd, but blocking the entire scheduler
because we get a bogus date is just plain wrong.

Besides, with 43 bits for the seconds we could still get too
large a number for us to handle and we'd still be back at square 1.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html




More information about the Developers mailing list