Specifying the retention period

Rahul Amaram rahul.amaram at vizury.com
Thu Sep 11 07:15:46 CEST 2014

Also, let us say, that the current time is 2.30 and that I want the 
average of all the values between 2.00 and 3.00 the previous day, I'd 
probably have to use


rather than


Am I right ?


On Thursday 11 September 2014 10:39 AM, Rahul Amaram wrote:
> Ok. So would 
> $$HOSTNAME$$-$$SERVICENAME$$/H/avg-$$SERVICEITEMNAME$$[24] refer to 
> the average of the all the values ONLY in the 24th hour before the 
> current time?
> On Thursday 11 September 2014 10:30 AM, Anders Håål wrote:
>> Hi Amaram,
>> I think you just need to remove the minus sign when using the 
>> aggregated. Minus is used for time, like back in time, and just a 
>> integer without minus and a time indicator is an index. Check out 
>> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_configuration_guide.html#toc-Chapter-4. 
>> You can also use redis-cli to explore the data in the cache. The key 
>> in the redis is the same as the service definition.
>> Anders
>> On 09/11/2014 06:38 AM, Rahul Amaram wrote:
>>> Ok. I am facing another issue. I have been running bischeck with the 
>>> aggregate function for more than a day. I am using the below 
>>> threshold function.
>>> <threshold>avg($$HOSTNAME$$-$$SERVICENAME$$/H/avg-$$SERVICEITEMNAME$$[-24],$$HOSTNAME$$-$$SERVICENAME$$/H/avg-$$SERVICEITEMNAME$$[-168],$$HOSTNAME$$-$$SERVICENAME$$/H/avg-$$SERVICEITEMNAME$$[-336])</threshold> 
>>> and it doesn't seem to work. I am expecting that the first aggregate 
>>> value should be available.
>>> Instead if I use the below threshold function (I know this is not 
>>> related to aggregate)
>>> the threshold is calcuated fine, which is just the first value as 
>>> the remaining two values are not in cache.
>>> How can I debug why aggregate is not working?
>>> Thanks,
>>> Rahul.
>>> On Wednesday 10 September 2014 04:53 PM, Anders Håål wrote:
>>>> Thanks - got the ticket.
>>>> I will update progress on the bug ticket, but its good that the 
>>>> work around works.
>>>> Anders
>>>> On 09/10/2014 01:20 PM, Rahul Amaram wrote:
>>>>> That indeed seems to be the problem. Using count rather than period
>>>>> seems to address the issue. Raised a ticket -
>>>>> http://gforge.ingby.com/gf/project/bischeck/tracker/?action=TrackerItemEdit&tracker_item_id=259 
>>>>> .
>>>>> Thanks,
>>>>> Rahul.
>>>>> On Wednesday 10 September 2014 04:02 PM, Anders Håål wrote:
>>>>>> This looks like a bug. Could you please report it on
>>>>>> http://gforge.ingby.com/gf/project/bischeck/tracker/ in the Bugs
>>>>>> tracker. You need a account but its just a sign up and you get an
>>>>>> email confirmation.
>>>>>> Can you try to use maxcount for purging instead as a work around? 
>>>>>> Just
>>>>>> calculate your maxcount based on the scheduling interval you use.
>>>>>> Anders
>>>>>> On 09/10/2014 12:17 PM, Rahul Amaram wrote:
>>>>>>> Following up on the earlier topic, I am seeing the below errors 
>>>>>>> related
>>>>>>> to cache purge. Any idea on what might be causing this? I don't 
>>>>>>> see any
>>>>>>> other errors in log related to metrics.
>>>>>>> 2014-09-10 12:12:00.001 ; INFO ; DefaultQuartzScheduler_Worker-5 ;
>>>>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob ; CachePurge
>>>>>>> purging 180
>>>>>>> 2014-09-10 12:12:00.003 ; INFO ; DefaultQuartzScheduler_Worker-5 ;
>>>>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob ; CachePurge
>>>>>>> executed in 1 ms
>>>>>>> 2014-09-10 12:12:00.003 ; ERROR ; DefaultQuartzScheduler_Worker-5 ;
>>>>>>> org.quartz.core.JobRunShell ; Job DailyMaintenance.CachePurge 
>>>>>>> threw an
>>>>>>> unhandled Exception: java.lang.NullPointerException: null
>>>>>>>          at
>>>>>>> com.ingby.socbox.bischeck.cache.provider.redis.LastStatusCache.trim(LastStatusCache.java:1250) 
>>>>>>>          at
>>>>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob.execute(CachePurgeJob.java:140) 
>>>>>>> 2014-09-10 12:12:00.003 ; ERROR ; DefaultQuartzScheduler_Worker-5 ;
>>>>>>> org.quartz.core.ErrorLogger ; Job (DailyMaintenance.CachePurge 
>>>>>>> threw an
>>>>>>> exception.org.quartz.SchedulerException: Job threw an unhandled
>>>>>>> exception.
>>>>>>>          at org.quartz.core.JobRunShell.run(JobRunShell.java:224)
>>>>>>>          at
>>>>>>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) 
>>>>>>> Caused by: java.lang.NullPointerException: null
>>>>>>>          at
>>>>>>> com.ingby.socbox.bischeck.cache.provider.redis.LastStatusCache.trim(LastStatusCache.java:1250) 
>>>>>>>          at
>>>>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob.execute(CachePurgeJob.java:140) 
>>>>>>> Here is my cache configuration:
>>>>>>>      <cache>
>>>>>>>        <aggregate>
>>>>>>>          <method>avg</method>
>>>>>>>          <useweekend>true</useweekend>
>>>>>>>          <retention>
>>>>>>>            <period>H</period>
>>>>>>>            <offset>720</offset>
>>>>>>>          </retention>
>>>>>>>          <retention>
>>>>>>>            <period>D</period>
>>>>>>>            <offset>30</offset>
>>>>>>>          </retention>
>>>>>>>        </aggregate>
>>>>>>>        <purge>
>>>>>>>          <offset>30</offset>
>>>>>>>          <period>D</period>
>>>>>>>        </purge>
>>>>>>>      </cache>
>>>>>>> Regards,
>>>>>>> Rahul.
>>>>>>> On Monday 08 September 2014 08:39 PM, Anders Håål wrote:
>>>>>>>> Great if you can make a debian package, and I understand that 
>>>>>>>> you can
>>>>>>>> not commit. The best thing would be integrated to our build 
>>>>>>>> process
>>>>>>>> where we use ant.
>>>>>>>> if the purging is based on time then it could happen that data is
>>>>>>>> removed from the cache since the logic is based on time 
>>>>>>>> relative to
>>>>>>>> now. To avoid it you should increase the purge time before you 
>>>>>>>> start
>>>>>>>> bischeck. And just a comment on your last sentence Redis TTl is 
>>>>>>>> never
>>>>>>>> used :)
>>>>>>>> Anders
>>>>>>>> On 09/08/2014 02:09 PM, Rahul Amaram wrote:
>>>>>>>>> I would be more than happy to give you guys a testimonial. 
>>>>>>>>> However, we
>>>>>>>>> have just taken this live and would like to see its performance
>>>>>>>>> before I
>>>>>>>>> give a testimonial.
>>>>>>>>> Also, if time permits, I'll try to bundle this for Debian (I'm a
>>>>>>>>> Debian
>>>>>>>>> maintainer). I can't commit on a timeline right away though :).
>>>>>>>>> Also, just to make things explicitly clear. I understand that the
>>>>>>>>> below
>>>>>>>>> service item ttl has nothing to do with Redis TTL. But If I 
>>>>>>>>> stop my
>>>>>>>>> bischeck server for a day or two, then would any of my metrics 
>>>>>>>>> get
>>>>>>>>> lost?
>>>>>>>>> Or would I have to increase th Redis TTL for this.
>>>>>>>>> Regards,
>>>>>>>>> Rahul.
>>>>>>>>> On Monday 08 September 2014 04:09 PM, Anders Håål wrote:
>>>>>>>>>> Glad that it clarified how to configure the cache section. I 
>>>>>>>>>> will
>>>>>>>>>> make
>>>>>>>>>> a blog post on this in the mean time, until we have a updated
>>>>>>>>>> documentation. I agree with you that the structure of the
>>>>>>>>>> configuration is a bit "heavy", so ideas and input is 
>>>>>>>>>> appreciated.
>>>>>>>>>> Regarding redis ttl, this is a redis feature we do not use. 
>>>>>>>>>> The ttl
>>>>>>>>>> mentioned in my mail is managed by bischeck. Redis ttl on 
>>>>>>>>>> linked list
>>>>>>>>>> do not work on individual nodes in a redis linked list.
>>>>>>>>>> Currently the bischeck installer should work for ubuntu,
>>>>>>>>>> redhat/centos
>>>>>>>>>> and debian. There is currently no plans to make distribution 
>>>>>>>>>> packages
>>>>>>>>>> like rpm or deb. I know op5 (www.op5.com) that bundles Bischeck
>>>>>>>>>> make a
>>>>>>>>>> bischeck rpm. It would be super if there is any one that like 
>>>>>>>>>> to do
>>>>>>>>>> this for the project.
>>>>>>>>>> When it comes to packaging we have done a bit of work to create
>>>>>>>>>> docker
>>>>>>>>>> containers, but its still experimental.
>>>>>>>>>> I also encourage you, if you think bischeck support your 
>>>>>>>>>> monitoring
>>>>>>>>>> effort, to write a small testimony that we can put on the site.
>>>>>>>>>> Regards
>>>>>>>>>> Anders
>>>>>>>>>> On 09/08/2014 11:30 AM, Rahul Amaram wrote:
>>>>>>>>>>> Thanks Anders. This explains precisely why my data was getting
>>>>>>>>>>> purged
>>>>>>>>>>> after 16 hours (30 values per hour * 1 hours = 480). It 
>>>>>>>>>>> would be
>>>>>>>>>>> great
>>>>>>>>>>> if you could update the documentation with this info. The 
>>>>>>>>>>> entire
>>>>>>>>>>> setup
>>>>>>>>>>> and configuration itself takes time to get a hold on and 
>>>>>>>>>>> detailed
>>>>>>>>>>> documentation would be very helpful.
>>>>>>>>>>> Also, another quick question? Right now, I believe the Redis 
>>>>>>>>>>> TTL is
>>>>>>>>>>> set
>>>>>>>>>>> to 2000 seconds. Does this mean that if I don't receive data 
>>>>>>>>>>> for a
>>>>>>>>>>> particular serviceitem (or service or host) for a 2000 
>>>>>>>>>>> seconds, the
>>>>>>>>>>> data
>>>>>>>>>>> related to it is lost?
>>>>>>>>>>> Also, any plans for bundling this with distributions such as 
>>>>>>>>>>> Debian?
>>>>>>>>>>> Regards,
>>>>>>>>>>> Rahul.
>>>>>>>>>>> On Monday 08 September 2014 02:04 PM, Anders Håål wrote:
>>>>>>>>>>>> Hi Rahul,
>>>>>>>>>>>> Thanks for the question and feedback on the documentation. 
>>>>>>>>>>>> Great to
>>>>>>>>>>>> hear that you think Bischeck is awesome. If you do not
>>>>>>>>>>>> understand how
>>>>>>>>>>>> it works by reading the documentation you are probably not
>>>>>>>>>>>> alone, and
>>>>>>>>>>>> we should consider it a documentation bug.
>>>>>>>>>>>> In 1.0.0 we introduce the concept that you asking about and it
>>>>>>>>>>>> really
>>>>>>>>>>>> two different independent features.
>>>>>>>>>>>> Lets start with cache purging.
>>>>>>>>>>>> Collected monitoring data, metrics, are kept in the cache 
>>>>>>>>>>>> (redis
>>>>>>>>>>>> from
>>>>>>>>>>>> 1.0.0) as a linked lists. There is one linked list per service
>>>>>>>>>>>> definition, like host1-service1-serviceitem1. Prior to 1.0.0
>>>>>>>>>>>> all the
>>>>>>>>>>>> linked lists had the same size that was defined with the 
>>>>>>>>>>>> property
>>>>>>>>>>>> lastStatusCacheSize. But in 1.0.0 we made that configurable 
>>>>>>>>>>>> so it
>>>>>>>>>>>> could be defined per service definition.
>>>>>>>>>>>> To enable individual cache configurations we added a 
>>>>>>>>>>>> section called
>>>>>>>>>>>> <cache> in the serviceitem section of the bischeck.xml. 
>>>>>>>>>>>> Like many
>>>>>>>>>>>> other configuration options in 1.0.0 the cache section could
>>>>>>>>>>>> have the
>>>>>>>>>>>> specific values or point to a template that could be shared.
>>>>>>>>>>>> To manage the size of the cache , or to be more specific 
>>>>>>>>>>>> the linked
>>>>>>>>>>>> list size, we defined the <purge> section. The purge 
>>>>>>>>>>>> section can
>>>>>>>>>>>> have
>>>>>>>>>>>> two different configurations. The first is defining the max 
>>>>>>>>>>>> size of
>>>>>>>>>>>> the cache linked list.
>>>>>>>>>>>> <cache>
>>>>>>>>>>>>   <purge>
>>>>>>>>>>>>    <maxcount>1000</maxcount>
>>>>>>>>>>>>   </purge>
>>>>>>>>>>>> </cache>
>>>>>>>>>>>> The second options is to define the “time to live” for the
>>>>>>>>>>>> metrics in
>>>>>>>>>>>> the cache.
>>>>>>>>>>>> <cache>
>>>>>>>>>>>>   <purge>
>>>>>>>>>>>>    <offset>10</offset>
>>>>>>>>>>>>    <period>D</period>
>>>>>>>>>>>>   </purge>
>>>>>>>>>>>> </cache>
>>>>>>>>>>>> In the above example we set the time to live to 10 days. So 
>>>>>>>>>>>> any
>>>>>>>>>>>> metrics older then this period will be removed. The period 
>>>>>>>>>>>> can have
>>>>>>>>>>>> the following values:
>>>>>>>>>>>> H - hours
>>>>>>>>>>>> D - days
>>>>>>>>>>>> W - weeks
>>>>>>>>>>>> Y - year
>>>>>>>>>>>> The two option are mutual exclusive. You have to chose one 
>>>>>>>>>>>> for each
>>>>>>>>>>>> serviceitem or cache template.
>>>>>>>>>>>> If no cache directive is define for a serviceitem the property
>>>>>>>>>>>> lastStatusCacheSize will be used. It's default value is 500.
>>>>>>>>>>>> Hopefully this explains the cache purging.
>>>>>>>>>>>> The next question was related to aggregations which has 
>>>>>>>>>>>> nothing
>>>>>>>>>>>> to do
>>>>>>>>>>>> with purging, but it's configured in the same <cache> 
>>>>>>>>>>>> section. The
>>>>>>>>>>>> idea with aggregations was to create an automatic way to 
>>>>>>>>>>>> aggregate
>>>>>>>>>>>> metrics on the level of an hour, day, week and month. The
>>>>>>>>>>>> aggregation
>>>>>>>>>>>> functions current supported is average, max and min.
>>>>>>>>>>>> Lets say you have a service definition of the format
>>>>>>>>>>>> host1-service1-serviceitem1. When you enable an average (avg)
>>>>>>>>>>>> aggregation you will automatically get the following new 
>>>>>>>>>>>> service
>>>>>>>>>>>> definitions
>>>>>>>>>>>> host1-service1/H/avg-serviceitem1
>>>>>>>>>>>> host1-service1/D/avg-serviceitem1
>>>>>>>>>>>> host1-service1/W/avg-serviceitem1
>>>>>>>>>>>> host1-service1/M/avg-serviceitem1
>>>>>>>>>>>> The configuration you need to achive the above average
>>>>>>>>>>>> aggregations is:
>>>>>>>>>>>> <cache>
>>>>>>>>>>>>   <aggregate>
>>>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>>>   </aggregate>
>>>>>>>>>>>> </cache>
>>>>>>>>>>>> If you like to combine it with the above descibed purging your
>>>>>>>>>>>> configuration would look like:
>>>>>>>>>>>> <cache>
>>>>>>>>>>>>   <aggregate>
>>>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>>>   </aggregate>
>>>>>>>>>>>>   <purge>
>>>>>>>>>>>>    <offset>10</offset>
>>>>>>>>>>>>    <period>D</period>
>>>>>>>>>>>>   </purge>
>>>>>>>>>>>> </cache>
>>>>>>>>>>>> The new aggregated service definitions,
>>>>>>>>>>>> host1-service1/H/avg-serviceitem1, etc, will have their own 
>>>>>>>>>>>> cache
>>>>>>>>>>>> entries and can be used in threshold configurations and 
>>>>>>>>>>>> virtual
>>>>>>>>>>>> services like any other service definitions. For example in a
>>>>>>>>>>>> threshold hours section we could define
>>>>>>>>>>>> <hours hoursID="2">
>>>>>>>>>>>>   <hourinterval>
>>>>>>>>>>>>     <from>09:00</from>
>>>>>>>>>>>>     <to>12:00</to>
>>>>>>>>>>>> <threshold>host1-service1/H/avg-serviceitem1[0]*0.8</threshold> 
>>>>>>>>>>>>   </hourinterval>
>>>>>>>>>>>>   ...
>>>>>>>>>>>> This would mean that we use the average value for
>>>>>>>>>>>> host1-service1-serviceitem1  for the period of the last hour.
>>>>>>>>>>>> Aggregations are calculated hourly, daily, weekly and monthly.
>>>>>>>>>>>> By default weekends metrics are not included in the aggrgation
>>>>>>>>>>>> calculation. This can be enabled by setting the
>>>>>>>>>>>> <useweekend>true</useweekend>:
>>>>>>>>>>>> <cache>
>>>>>>>>>>>>   <aggregate>
>>>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>>> <useweekend>true</useweekend>
>>>>>>>>>>>>   </aggregate>
>>>>>>>>>>>>   ….
>>>>>>>>>>>> </cache>
>>>>>>>>>>>> This will create aggregated service definitions with the 
>>>>>>>>>>>> following
>>>>>>>>>>>> name standard:
>>>>>>>>>>>> host1-service1/H/avg/weekend-serviceitem1
>>>>>>>>>>>> host1-service1/D/avg/weekend-serviceitem1
>>>>>>>>>>>> host1-service1/W/avg/weekend-serviceitem1
>>>>>>>>>>>> host1-service1/M/avg/weekend-serviceitem1
>>>>>>>>>>>> You can also have multiple entries like:
>>>>>>>>>>>> <cache>
>>>>>>>>>>>>   <aggregate>
>>>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>>> <useweekend>true</useweekend>
>>>>>>>>>>>>   </aggregate>
>>>>>>>>>>>>   <aggregate>
>>>>>>>>>>>>     <method>max</method>
>>>>>>>>>>>>   </aggregate>
>>>>>>>>>>>>   ….
>>>>>>>>>>>> </cache>
>>>>>>>>>>>> So how long time will the aggregated values be kept in the
>>>>>>>>>>>> cache? By
>>>>>>>>>>>> default we save
>>>>>>>>>>>> Hour aggregation for 25 hours
>>>>>>>>>>>> Daily aggregations for 7 days
>>>>>>>>>>>> Weekly aggregations for 5 weeks
>>>>>>>>>>>> Monthly aggregations for 1 month
>>>>>>>>>>>> These values can be override but they can not be lower then 
>>>>>>>>>>>> the
>>>>>>>>>>>> default. Below you have an example where we save the 
>>>>>>>>>>>> aggregation
>>>>>>>>>>>> for
>>>>>>>>>>>> 168 hours, 60 days and 53 weeks.
>>>>>>>>>>>> <cache>
>>>>>>>>>>>>   <aggregate>
>>>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>>> <useweekend>true</useweekend>
>>>>>>>>>>>>     <retention>
>>>>>>>>>>>>       <period>H</period>
>>>>>>>>>>>>       <offset>168</offset>
>>>>>>>>>>>>     </retention>
>>>>>>>>>>>>     <retention>
>>>>>>>>>>>>      <period>D</period>
>>>>>>>>>>>>       <offset>60</offset>
>>>>>>>>>>>>     </retention>
>>>>>>>>>>>>     <retention>
>>>>>>>>>>>>       <period>W</period>
>>>>>>>>>>>>       <offset>53</offset>
>>>>>>>>>>>>     </retention>
>>>>>>>>>>>> </aggregate>
>>>>>>>>>>>>   ….
>>>>>>>>>>>> </cache>
>>>>>>>>>>>> I hope this makes it a bit less confusing. What is clear to 
>>>>>>>>>>>> me is
>>>>>>>>>>>> that
>>>>>>>>>>>> we need to improve the documentation in this area.
>>>>>>>>>>>> Looking forward to your feedback.
>>>>>>>>>>>> Anders
>>>>>>>>>>>> On 09/08/2014 06:02 AM, Rahul Amaram wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> I am trying to setup the bischeck plugin for our 
>>>>>>>>>>>>> organization. I
>>>>>>>>>>>>> have
>>>>>>>>>>>>> configured most part of it except for the cache retention 
>>>>>>>>>>>>> period.
>>>>>>>>>>>>> Here
>>>>>>>>>>>>> is what I want - I want to store every value which has been
>>>>>>>>>>>>> generated
>>>>>>>>>>>>> during the past 1 month. The reason being my threshold is
>>>>>>>>>>>>> currently
>>>>>>>>>>>>> calculated as the average of the metric value during the 
>>>>>>>>>>>>> past 4
>>>>>>>>>>>>> weeks at
>>>>>>>>>>>>> the same time of the day.
>>>>>>>>>>>>> So, how do I define the cache template for this? If I don't
>>>>>>>>>>>>> define any
>>>>>>>>>>>>> cache template, for how many days is the data kept?
>>>>>>>>>>>>> Also, how does the aggregrate function work and and what 
>>>>>>>>>>>>> does the
>>>>>>>>>>>>> purge
>>>>>>>>>>>>> Maxitems signify?
>>>>>>>>>>>>> I've gone through the documentation but it wasn't clear. 
>>>>>>>>>>>>> Looking
>>>>>>>>>>>>> forward
>>>>>>>>>>>>> to a response.
>>>>>>>>>>>>> Bischeck is one awesome plugin. Keep up the great work.
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Rahul.

[image: adtech_mailer]

More information about the Bischeck-users mailing list