Specifying the retention period
Rahul Amaram
rahul.amaram at vizury.com
Thu Sep 11 06:38:58 CEST 2014
Ok. I am facing another issue. I have been running bischeck with the
aggregate function for more than a day. I am using the below threshold
function.
<threshold>avg($$HOSTNAME$$-$$SERVICENAME$$/H/avg-$$SERVICEITEMNAME$$[-24],$$HOSTNAME$$-$$SERVICENAME$$/H/avg-$$SERVICEITEMNAME$$[-168],$$HOSTNAME$$-$$SERVICENAME$$/H/avg-$$SERVICEITEMNAME$$[-336])</threshold>
and it doesn't seem to work. I am expecting that the first aggregate
value should be available.
Instead if I use the below threshold function (I know this is not
related to aggregate)
avg($$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-24H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-168H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-336H])
the threshold is calcuated fine, which is just the first value as the
remaining two values are not in cache.
How can I debug why aggregate is not working?
Thanks,
Rahul.
On Wednesday 10 September 2014 04:53 PM, Anders Håål wrote:
> Thanks - got the ticket.
> I will update progress on the bug ticket, but its good that the work
> around works.
> Anders
>
> On 09/10/2014 01:20 PM, Rahul Amaram wrote:
>> That indeed seems to be the problem. Using count rather than period
>> seems to address the issue. Raised a ticket -
>> http://gforge.ingby.com/gf/project/bischeck/tracker/?action=TrackerItemEdit&tracker_item_id=259
>>
>> .
>>
>> Thanks,
>> Rahul.
>>
>> On Wednesday 10 September 2014 04:02 PM, Anders Håål wrote:
>>> This looks like a bug. Could you please report it on
>>> http://gforge.ingby.com/gf/project/bischeck/tracker/ in the Bugs
>>> tracker. You need a account but its just a sign up and you get an
>>> email confirmation.
>>> Can you try to use maxcount for purging instead as a work around? Just
>>> calculate your maxcount based on the scheduling interval you use.
>>> Anders
>>>
>>> On 09/10/2014 12:17 PM, Rahul Amaram wrote:
>>>> Following up on the earlier topic, I am seeing the below errors
>>>> related
>>>> to cache purge. Any idea on what might be causing this? I don't see
>>>> any
>>>> other errors in log related to metrics.
>>>>
>>>> 2014-09-10 12:12:00.001 ; INFO ; DefaultQuartzScheduler_Worker-5 ;
>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob ; CachePurge
>>>> purging 180
>>>> 2014-09-10 12:12:00.003 ; INFO ; DefaultQuartzScheduler_Worker-5 ;
>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob ; CachePurge
>>>> executed in 1 ms
>>>> 2014-09-10 12:12:00.003 ; ERROR ; DefaultQuartzScheduler_Worker-5 ;
>>>> org.quartz.core.JobRunShell ; Job DailyMaintenance.CachePurge threw an
>>>> unhandled Exception: java.lang.NullPointerException: null
>>>> at
>>>> com.ingby.socbox.bischeck.cache.provider.redis.LastStatusCache.trim(LastStatusCache.java:1250)
>>>>
>>>>
>>>>
>>>> at
>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob.execute(CachePurgeJob.java:140)
>>>>
>>>>
>>>>
>>>>
>>>> 2014-09-10 12:12:00.003 ; ERROR ; DefaultQuartzScheduler_Worker-5 ;
>>>> org.quartz.core.ErrorLogger ; Job (DailyMaintenance.CachePurge
>>>> threw an
>>>> exception.org.quartz.SchedulerException: Job threw an unhandled
>>>> exception.
>>>> at org.quartz.core.JobRunShell.run(JobRunShell.java:224)
>>>> at
>>>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
>>>>
>>>>
>>>>
>>>> Caused by: java.lang.NullPointerException: null
>>>> at
>>>> com.ingby.socbox.bischeck.cache.provider.redis.LastStatusCache.trim(LastStatusCache.java:1250)
>>>>
>>>>
>>>>
>>>> at
>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob.execute(CachePurgeJob.java:140)
>>>>
>>>>
>>>>
>>>>
>>>> Here is my cache configuration:
>>>>
>>>> <cache>
>>>> <aggregate>
>>>> <method>avg</method>
>>>> <useweekend>true</useweekend>
>>>> <retention>
>>>> <period>H</period>
>>>> <offset>720</offset>
>>>> </retention>
>>>> <retention>
>>>> <period>D</period>
>>>> <offset>30</offset>
>>>> </retention>
>>>> </aggregate>
>>>>
>>>> <purge>
>>>> <offset>30</offset>
>>>> <period>D</period>
>>>> </purge>
>>>> </cache>
>>>>
>>>> Regards,
>>>> Rahul.
>>>> On Monday 08 September 2014 08:39 PM, Anders Håål wrote:
>>>>> Great if you can make a debian package, and I understand that you can
>>>>> not commit. The best thing would be integrated to our build process
>>>>> where we use ant.
>>>>>
>>>>> if the purging is based on time then it could happen that data is
>>>>> removed from the cache since the logic is based on time relative to
>>>>> now. To avoid it you should increase the purge time before you start
>>>>> bischeck. And just a comment on your last sentence Redis TTl is never
>>>>> used :)
>>>>> Anders
>>>>>
>>>>> On 09/08/2014 02:09 PM, Rahul Amaram wrote:
>>>>>> I would be more than happy to give you guys a testimonial.
>>>>>> However, we
>>>>>> have just taken this live and would like to see its performance
>>>>>> before I
>>>>>> give a testimonial.
>>>>>>
>>>>>> Also, if time permits, I'll try to bundle this for Debian (I'm a
>>>>>> Debian
>>>>>> maintainer). I can't commit on a timeline right away though :).
>>>>>>
>>>>>> Also, just to make things explicitly clear. I understand that the
>>>>>> below
>>>>>> service item ttl has nothing to do with Redis TTL. But If I stop my
>>>>>> bischeck server for a day or two, then would any of my metrics get
>>>>>> lost?
>>>>>> Or would I have to increase th Redis TTL for this.
>>>>>>
>>>>>> Regards,
>>>>>> Rahul.
>>>>>>
>>>>>> On Monday 08 September 2014 04:09 PM, Anders Håål wrote:
>>>>>>> Glad that it clarified how to configure the cache section. I will
>>>>>>> make
>>>>>>> a blog post on this in the mean time, until we have a updated
>>>>>>> documentation. I agree with you that the structure of the
>>>>>>> configuration is a bit "heavy", so ideas and input is appreciated.
>>>>>>>
>>>>>>> Regarding redis ttl, this is a redis feature we do not use. The ttl
>>>>>>> mentioned in my mail is managed by bischeck. Redis ttl on linked
>>>>>>> list
>>>>>>> do not work on individual nodes in a redis linked list.
>>>>>>>
>>>>>>> Currently the bischeck installer should work for ubuntu,
>>>>>>> redhat/centos
>>>>>>> and debian. There is currently no plans to make distribution
>>>>>>> packages
>>>>>>> like rpm or deb. I know op5 (www.op5.com) that bundles Bischeck
>>>>>>> make a
>>>>>>> bischeck rpm. It would be super if there is any one that like to do
>>>>>>> this for the project.
>>>>>>> When it comes to packaging we have done a bit of work to create
>>>>>>> docker
>>>>>>> containers, but its still experimental.
>>>>>>>
>>>>>>> I also encourage you, if you think bischeck support your monitoring
>>>>>>> effort, to write a small testimony that we can put on the site.
>>>>>>> Regards
>>>>>>> Anders
>>>>>>>
>>>>>>> On 09/08/2014 11:30 AM, Rahul Amaram wrote:
>>>>>>>> Thanks Anders. This explains precisely why my data was getting
>>>>>>>> purged
>>>>>>>> after 16 hours (30 values per hour * 1 hours = 480). It would be
>>>>>>>> great
>>>>>>>> if you could update the documentation with this info. The entire
>>>>>>>> setup
>>>>>>>> and configuration itself takes time to get a hold on and detailed
>>>>>>>> documentation would be very helpful.
>>>>>>>>
>>>>>>>> Also, another quick question? Right now, I believe the Redis
>>>>>>>> TTL is
>>>>>>>> set
>>>>>>>> to 2000 seconds. Does this mean that if I don't receive data for a
>>>>>>>> particular serviceitem (or service or host) for a 2000 seconds,
>>>>>>>> the
>>>>>>>> data
>>>>>>>> related to it is lost?
>>>>>>>>
>>>>>>>> Also, any plans for bundling this with distributions such as
>>>>>>>> Debian?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Rahul.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Monday 08 September 2014 02:04 PM, Anders Håål wrote:
>>>>>>>>> Hi Rahul,
>>>>>>>>> Thanks for the question and feedback on the documentation.
>>>>>>>>> Great to
>>>>>>>>> hear that you think Bischeck is awesome. If you do not
>>>>>>>>> understand how
>>>>>>>>> it works by reading the documentation you are probably not
>>>>>>>>> alone, and
>>>>>>>>> we should consider it a documentation bug.
>>>>>>>>>
>>>>>>>>> In 1.0.0 we introduce the concept that you asking about and it
>>>>>>>>> really
>>>>>>>>> two different independent features.
>>>>>>>>>
>>>>>>>>> Lets start with cache purging.
>>>>>>>>> Collected monitoring data, metrics, are kept in the cache (redis
>>>>>>>>> from
>>>>>>>>> 1.0.0) as a linked lists. There is one linked list per service
>>>>>>>>> definition, like host1-service1-serviceitem1. Prior to 1.0.0
>>>>>>>>> all the
>>>>>>>>> linked lists had the same size that was defined with the property
>>>>>>>>> lastStatusCacheSize. But in 1.0.0 we made that configurable so it
>>>>>>>>> could be defined per service definition.
>>>>>>>>> To enable individual cache configurations we added a section
>>>>>>>>> called
>>>>>>>>> <cache> in the serviceitem section of the bischeck.xml. Like many
>>>>>>>>> other configuration options in 1.0.0 the cache section could
>>>>>>>>> have the
>>>>>>>>> specific values or point to a template that could be shared.
>>>>>>>>> To manage the size of the cache , or to be more specific the
>>>>>>>>> linked
>>>>>>>>> list size, we defined the <purge> section. The purge section can
>>>>>>>>> have
>>>>>>>>> two different configurations. The first is defining the max
>>>>>>>>> size of
>>>>>>>>> the cache linked list.
>>>>>>>>> <cache>
>>>>>>>>> <purge>
>>>>>>>>> <maxcount>1000</maxcount>
>>>>>>>>> </purge>
>>>>>>>>> </cache>
>>>>>>>>>
>>>>>>>>> The second options is to define the “time to live” for the
>>>>>>>>> metrics in
>>>>>>>>> the cache.
>>>>>>>>> <cache>
>>>>>>>>> <purge>
>>>>>>>>> <offset>10</offset>
>>>>>>>>> <period>D</period>
>>>>>>>>> </purge>
>>>>>>>>> </cache>
>>>>>>>>> In the above example we set the time to live to 10 days. So any
>>>>>>>>> metrics older then this period will be removed. The period can
>>>>>>>>> have
>>>>>>>>> the following values:
>>>>>>>>> H - hours
>>>>>>>>> D - days
>>>>>>>>> W - weeks
>>>>>>>>> Y - year
>>>>>>>>>
>>>>>>>>> The two option are mutual exclusive. You have to chose one for
>>>>>>>>> each
>>>>>>>>> serviceitem or cache template.
>>>>>>>>>
>>>>>>>>> If no cache directive is define for a serviceitem the property
>>>>>>>>> lastStatusCacheSize will be used. It's default value is 500.
>>>>>>>>>
>>>>>>>>> Hopefully this explains the cache purging.
>>>>>>>>>
>>>>>>>>> The next question was related to aggregations which has nothing
>>>>>>>>> to do
>>>>>>>>> with purging, but it's configured in the same <cache> section.
>>>>>>>>> The
>>>>>>>>> idea with aggregations was to create an automatic way to
>>>>>>>>> aggregate
>>>>>>>>> metrics on the level of an hour, day, week and month. The
>>>>>>>>> aggregation
>>>>>>>>> functions current supported is average, max and min.
>>>>>>>>> Lets say you have a service definition of the format
>>>>>>>>> host1-service1-serviceitem1. When you enable an average (avg)
>>>>>>>>> aggregation you will automatically get the following new service
>>>>>>>>> definitions
>>>>>>>>> host1-service1/H/avg-serviceitem1
>>>>>>>>> host1-service1/D/avg-serviceitem1
>>>>>>>>> host1-service1/W/avg-serviceitem1
>>>>>>>>> host1-service1/M/avg-serviceitem1
>>>>>>>>>
>>>>>>>>> The configuration you need to achive the above average
>>>>>>>>> aggregations is:
>>>>>>>>> <cache>
>>>>>>>>> <aggregate>
>>>>>>>>> <method>avg</method>
>>>>>>>>> </aggregate>
>>>>>>>>> </cache>
>>>>>>>>>
>>>>>>>>> If you like to combine it with the above descibed purging your
>>>>>>>>> configuration would look like:
>>>>>>>>> <cache>
>>>>>>>>> <aggregate>
>>>>>>>>> <method>avg</method>
>>>>>>>>> </aggregate>
>>>>>>>>>
>>>>>>>>> <purge>
>>>>>>>>> <offset>10</offset>
>>>>>>>>> <period>D</period>
>>>>>>>>> </purge>
>>>>>>>>> </cache>
>>>>>>>>>
>>>>>>>>> The new aggregated service definitions,
>>>>>>>>> host1-service1/H/avg-serviceitem1, etc, will have their own cache
>>>>>>>>> entries and can be used in threshold configurations and virtual
>>>>>>>>> services like any other service definitions. For example in a
>>>>>>>>> threshold hours section we could define
>>>>>>>>>
>>>>>>>>> <hours hoursID="2">
>>>>>>>>>
>>>>>>>>> <hourinterval>
>>>>>>>>> <from>09:00</from>
>>>>>>>>> <to>12:00</to>
>>>>>>>>> <threshold>host1-service1/H/avg-serviceitem1[0]*0.8</threshold>
>>>>>>>>> </hourinterval>
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>> This would mean that we use the average value for
>>>>>>>>> host1-service1-serviceitem1 for the period of the last hour.
>>>>>>>>> Aggregations are calculated hourly, daily, weekly and monthly.
>>>>>>>>>
>>>>>>>>> By default weekends metrics are not included in the aggrgation
>>>>>>>>> calculation. This can be enabled by setting the
>>>>>>>>> <useweekend>true</useweekend>:
>>>>>>>>>
>>>>>>>>> <cache>
>>>>>>>>> <aggregate>
>>>>>>>>> <method>avg</method>
>>>>>>>>> <useweekend>true</useweekend>
>>>>>>>>> </aggregate>
>>>>>>>>> ….
>>>>>>>>> </cache>
>>>>>>>>>
>>>>>>>>> This will create aggregated service definitions with the
>>>>>>>>> following
>>>>>>>>> name standard:
>>>>>>>>> host1-service1/H/avg/weekend-serviceitem1
>>>>>>>>> host1-service1/D/avg/weekend-serviceitem1
>>>>>>>>> host1-service1/W/avg/weekend-serviceitem1
>>>>>>>>> host1-service1/M/avg/weekend-serviceitem1
>>>>>>>>>
>>>>>>>>> You can also have multiple entries like:
>>>>>>>>> <cache>
>>>>>>>>> <aggregate>
>>>>>>>>> <method>avg</method>
>>>>>>>>> <useweekend>true</useweekend>
>>>>>>>>> </aggregate>
>>>>>>>>> <aggregate>
>>>>>>>>> <method>max</method>
>>>>>>>>> </aggregate>
>>>>>>>>> ….
>>>>>>>>> </cache>
>>>>>>>>>
>>>>>>>>> So how long time will the aggregated values be kept in the
>>>>>>>>> cache? By
>>>>>>>>> default we save
>>>>>>>>> Hour aggregation for 25 hours
>>>>>>>>> Daily aggregations for 7 days
>>>>>>>>> Weekly aggregations for 5 weeks
>>>>>>>>> Monthly aggregations for 1 month
>>>>>>>>>
>>>>>>>>> These values can be override but they can not be lower then the
>>>>>>>>> default. Below you have an example where we save the aggregation
>>>>>>>>> for
>>>>>>>>> 168 hours, 60 days and 53 weeks.
>>>>>>>>> <cache>
>>>>>>>>> <aggregate>
>>>>>>>>> <method>avg</method>
>>>>>>>>> <useweekend>true</useweekend>
>>>>>>>>> <retention>
>>>>>>>>> <period>H</period>
>>>>>>>>> <offset>168</offset>
>>>>>>>>> </retention>
>>>>>>>>> <retention>
>>>>>>>>> <period>D</period>
>>>>>>>>> <offset>60</offset>
>>>>>>>>> </retention>
>>>>>>>>> <retention>
>>>>>>>>> <period>W</period>
>>>>>>>>> <offset>53</offset>
>>>>>>>>> </retention>
>>>>>>>>> </aggregate>
>>>>>>>>> ….
>>>>>>>>> </cache>
>>>>>>>>>
>>>>>>>>> I hope this makes it a bit less confusing. What is clear to me is
>>>>>>>>> that
>>>>>>>>> we need to improve the documentation in this area.
>>>>>>>>>
>>>>>>>>> Looking forward to your feedback.
>>>>>>>>> Anders
>>>>>>>>>
>>>>>>>>> On 09/08/2014 06:02 AM, Rahul Amaram wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> I am trying to setup the bischeck plugin for our organization. I
>>>>>>>>>> have
>>>>>>>>>> configured most part of it except for the cache retention
>>>>>>>>>> period.
>>>>>>>>>> Here
>>>>>>>>>> is what I want - I want to store every value which has been
>>>>>>>>>> generated
>>>>>>>>>> during the past 1 month. The reason being my threshold is
>>>>>>>>>> currently
>>>>>>>>>> calculated as the average of the metric value during the past 4
>>>>>>>>>> weeks at
>>>>>>>>>> the same time of the day.
>>>>>>>>>>
>>>>>>>>>> So, how do I define the cache template for this? If I don't
>>>>>>>>>> define any
>>>>>>>>>> cache template, for how many days is the data kept?
>>>>>>>>>> Also, how does the aggregrate function work and and what does
>>>>>>>>>> the
>>>>>>>>>> purge
>>>>>>>>>> Maxitems signify?
>>>>>>>>>>
>>>>>>>>>> I've gone through the documentation but it wasn't clear. Looking
>>>>>>>>>> forward
>>>>>>>>>> to a response.
>>>>>>>>>>
>>>>>>>>>> Bischeck is one awesome plugin. Keep up the great work.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Rahul.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
--
[image: adtech_mailer]
More information about the Bischeck-users
mailing list