Bug report: downtimes beyond 2038 cause event queue errors
Ton Voon
ton.voon at opsview.com
Mon Apr 8 14:12:20 CEST 2013
On 4 Apr 2013, at 22:55, Andreas Ericsson wrote:
>> This fails on CentOS 5 64bit, though appears to work on Debian Squeeze 32bit, so it maybe a 64 bit only issue.
>>
>> We think this is an issue when the event is scheduled via squeue_add(). We've managed to get the test-squeue to fail by changing the time value to be greater than 2038 with the following:
>>
>> Index: test-squeue.c
>> ===================================================================
>> --- test-squeue.c (revision 2716)
>> +++ test-squeue.c (working copy)
>> @@ -116,7 +116,7 @@
>> sq_test_random(sq);
>> t(squeue_size(sq) == 0, "Size should be 0 after first sq_test_random");
>>
>> - t((a.evt = squeue_add(sq, time(NULL) + 9, &a)) != NULL);
>> + t((a.evt = squeue_add(sq, time(NULL)*2, &a)) != NULL);
>> t(squeue_size(sq) == 1);
>> t((b.evt = squeue_add(sq, time(NULL) + 3, &b)) != NULL);
>> t(squeue_size(sq) == 2);
>>
>> This gives the test result of:
>>
>> ### squeue tests
>> FAIL max <= *d @test-squeue.c:86
>> FAIL x == &b @test-squeue.c:133
>> FAIL x->id == b.id @test-squeue.c:134
>> FAIL x == &c @test-squeue.c:141
>> about to fail pretty fucking hard...
>> ea: 0xbfe065e0; &b: 0xbfe065d8; &c: 0xbfe065d0; ed: 0xbfe065c8; x: 0xbfde9b80
>> FAIL x == &b @test-squeue.c:152
>> FAIL x->id == b.id @test-squeue.c:153
>> FAIL x == &b @test-squeue.c:160
>> FAIL x->id == b.id @test-squeue.c:161
>> FAIL x == &c @test-squeue.c:166
>> FAIL x->id == c.id @test-squeue.c:167
>> Test results: 390637 passed, 10 failed
>>
>> Changing to a factor of 1.1 instead of 2 passes:
>>
>
> I'm not surprised. 1.1 would mean it's still within the unix timeframe.
>
> What's the size of time_t, long and struct timeval on systems where it
> fails?
> What's the sizes on systems where it succeeds?
With the recreation steps, Nagios 4 works fine on rhel5 32bit, but fails on rhel5 64bit.
sizes.c:
#include <string.h>
#include <stdio.h>
#include <assert.h>
#include <sys/types.h>
#include <signal.h>
#include <unistd.h>
#include <sys/time.h>
#include "pqueue.h"
int main(int argc, char **argv)
{
struct timeval tv;
printf("long = %d\n", sizeof(long));
printf("time_t = %d\n", sizeof(time_t));
printf("tv = %d\n", sizeof(tv));
printf("pqueue_pri_t = %d\n", sizeof(pqueue_pri_t));
return 0;
}
RHEL5 32 bit:
long = 4
time_t = 4
tv = 8
pqueue_pri_t = 8
RHEL5 64 bit:
long = 8
time_t = 8
tv = 16
pqueue_pri_t = 8
> Does time_t differ in signedness on them?
Not sure how to check this.
> I think a runtime check based on those sizes should work just fine, and
> also be optimized away so we don't actually have to pay for it, but I'm
> curious to see where it actually goes wrong. If it's before we get to
> see the number in squeue.c we're pretty much fscked, as the only option
> then is a macro which does voodoo-casting so the squeue api sees the
> right number.
>
>> ### squeue tests
>> Test results: 390647 passed, 0 failed
>>
>> This worked in Nagios 3, so we're guessing that the change to use the squeue library for events is probably where this limitation has come in.
>>
>> Any thoughts?
>>
>
> Well, modifying the evt_compute_pri() algorithm to discard
> everything but the 21 least significant bits of the tv->tv_usec
> would allow us to use 43 bits for the seconds value. That would
> land us somewhere in the year 141234 before we run out of seconds.
> It's not a real fix though, since we could live with discarding
> events that are patently absurd, but blocking the entire scheduler
> because we get a bogus date is just plain wrong.
I've changed the code so it now looks like this:
static pqueue_pri_t evt_compute_pri(struct timeval *tv)
{
pqueue_pri_t ret;
/* keep weird compilers on 32-bit systems from doing wrong */
if(sizeof(pqueue_pri_t) < 8) {
ret = tv->tv_sec;
ret += !!tv->tv_usec;
} else {
ret = (pqueue_pri_t) tv->tv_sec;
ret <<= 43;
ret |= (tv->tv_usec & 0x1FFFFF);
}
return ret;
}
For the same recreation steps, the event queue is now working properly.
The changes I made to test-squeue.c to change the multiplication factor now works up to a factor of 1,000,000 on a 64 bit system. These tests fail on 32 bit, but that's to be expected since the time_t part is 32 bit.
So 43 bits for seconds + 21 bits for usec seem fine.
> Besides, with 43 bits for the seconds we could still get too
> large a number for us to handle and we'd still be back at square 1.
I notice that in pqueue.h that pqueue_pri_t is changed from a double to unsigned long long:
/*
* Altered for Nagios by Andreas Ericsson <ae at op5.se> with the excplicit
* consent of Volkan Yazici <volkan.yazici at gmail.com>. Many thanks.
* Changed as follows:
*
* - pqueue_pri_t is an unsigned long long instead of a double
* ull comparisons are 107 times faster than double comparisons
* on my 64-bit laptop
*/
Would it be better to leave it as a double, so that all values will work properly, and take the performance hit?
Ton
------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire
the most talented Cisco Certified professionals. Visit the
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html
More information about the Developers
mailing list