Specifying the retention period
Rahul Amaram
rahul.amaram at vizury.com
Mon Sep 8 14:09:09 CEST 2014
I would be more than happy to give you guys a testimonial. However, we
have just taken this live and would like to see its performance before I
give a testimonial.
Also, if time permits, I'll try to bundle this for Debian (I'm a Debian
maintainer). I can't commit on a timeline right away though :).
Also, just to make things explicitly clear. I understand that the below
service item ttl has nothing to do with Redis TTL. But If I stop my
bischeck server for a day or two, then would any of my metrics get lost?
Or would I have to increase th Redis TTL for this.
Regards,
Rahul.
On Monday 08 September 2014 04:09 PM, Anders Håål wrote:
> Glad that it clarified how to configure the cache section. I will make
> a blog post on this in the mean time, until we have a updated
> documentation. I agree with you that the structure of the
> configuration is a bit "heavy", so ideas and input is appreciated.
>
> Regarding redis ttl, this is a redis feature we do not use. The ttl
> mentioned in my mail is managed by bischeck. Redis ttl on linked list
> do not work on individual nodes in a redis linked list.
>
> Currently the bischeck installer should work for ubuntu, redhat/centos
> and debian. There is currently no plans to make distribution packages
> like rpm or deb. I know op5 (www.op5.com) that bundles Bischeck make a
> bischeck rpm. It would be super if there is any one that like to do
> this for the project.
> When it comes to packaging we have done a bit of work to create docker
> containers, but its still experimental.
>
> I also encourage you, if you think bischeck support your monitoring
> effort, to write a small testimony that we can put on the site.
> Regards
> Anders
>
> On 09/08/2014 11:30 AM, Rahul Amaram wrote:
>> Thanks Anders. This explains precisely why my data was getting purged
>> after 16 hours (30 values per hour * 1 hours = 480). It would be great
>> if you could update the documentation with this info. The entire setup
>> and configuration itself takes time to get a hold on and detailed
>> documentation would be very helpful.
>>
>> Also, another quick question? Right now, I believe the Redis TTL is set
>> to 2000 seconds. Does this mean that if I don't receive data for a
>> particular serviceitem (or service or host) for a 2000 seconds, the data
>> related to it is lost?
>>
>> Also, any plans for bundling this with distributions such as Debian?
>>
>> Regards,
>> Rahul.
>>
>>
>> On Monday 08 September 2014 02:04 PM, Anders Håål wrote:
>>> Hi Rahul,
>>> Thanks for the question and feedback on the documentation. Great to
>>> hear that you think Bischeck is awesome. If you do not understand how
>>> it works by reading the documentation you are probably not alone, and
>>> we should consider it a documentation bug.
>>>
>>> In 1.0.0 we introduce the concept that you asking about and it really
>>> two different independent features.
>>>
>>> Lets start with cache purging.
>>> Collected monitoring data, metrics, are kept in the cache (redis from
>>> 1.0.0) as a linked lists. There is one linked list per service
>>> definition, like host1-service1-serviceitem1. Prior to 1.0.0 all the
>>> linked lists had the same size that was defined with the property
>>> lastStatusCacheSize. But in 1.0.0 we made that configurable so it
>>> could be defined per service definition.
>>> To enable individual cache configurations we added a section called
>>> <cache> in the serviceitem section of the bischeck.xml. Like many
>>> other configuration options in 1.0.0 the cache section could have the
>>> specific values or point to a template that could be shared.
>>> To manage the size of the cache , or to be more specific the linked
>>> list size, we defined the <purge> section. The purge section can have
>>> two different configurations. The first is defining the max size of
>>> the cache linked list.
>>> <cache>
>>> <purge>
>>> <maxcount>1000</maxcount>
>>> </purge>
>>> </cache>
>>>
>>> The second options is to define the “time to live” for the metrics in
>>> the cache.
>>> <cache>
>>> <purge>
>>> <offset>10</offset>
>>> <period>D</period>
>>> </purge>
>>> </cache>
>>> In the above example we set the time to live to 10 days. So any
>>> metrics older then this period will be removed. The period can have
>>> the following values:
>>> H - hours
>>> D - days
>>> W - weeks
>>> Y - year
>>>
>>> The two option are mutual exclusive. You have to chose one for each
>>> serviceitem or cache template.
>>>
>>> If no cache directive is define for a serviceitem the property
>>> lastStatusCacheSize will be used. It's default value is 500.
>>>
>>> Hopefully this explains the cache purging.
>>>
>>> The next question was related to aggregations which has nothing to do
>>> with purging, but it's configured in the same <cache> section. The
>>> idea with aggregations was to create an automatic way to aggregate
>>> metrics on the level of an hour, day, week and month. The aggregation
>>> functions current supported is average, max and min.
>>> Lets say you have a service definition of the format
>>> host1-service1-serviceitem1. When you enable an average (avg)
>>> aggregation you will automatically get the following new service
>>> definitions
>>> host1-service1/H/avg-serviceitem1
>>> host1-service1/D/avg-serviceitem1
>>> host1-service1/W/avg-serviceitem1
>>> host1-service1/M/avg-serviceitem1
>>>
>>> The configuration you need to achive the above average aggregations is:
>>> <cache>
>>> <aggregate>
>>> <method>avg</method>
>>> </aggregate>
>>> </cache>
>>>
>>> If you like to combine it with the above descibed purging your
>>> configuration would look like:
>>> <cache>
>>> <aggregate>
>>> <method>avg</method>
>>> </aggregate>
>>>
>>> <purge>
>>> <offset>10</offset>
>>> <period>D</period>
>>> </purge>
>>> </cache>
>>>
>>> The new aggregated service definitions,
>>> host1-service1/H/avg-serviceitem1, etc, will have their own cache
>>> entries and can be used in threshold configurations and virtual
>>> services like any other service definitions. For example in a
>>> threshold hours section we could define
>>>
>>> <hours hoursID="2">
>>>
>>> <hourinterval>
>>> <from>09:00</from>
>>> <to>12:00</to>
>>> <threshold>host1-service1/H/avg-serviceitem1[0]*0.8</threshold>
>>> </hourinterval>
>>> ...
>>>
>>> This would mean that we use the average value for
>>> host1-service1-serviceitem1 for the period of the last hour.
>>> Aggregations are calculated hourly, daily, weekly and monthly.
>>>
>>> By default weekends metrics are not included in the aggrgation
>>> calculation. This can be enabled by setting the
>>> <useweekend>true</useweekend>:
>>>
>>> <cache>
>>> <aggregate>
>>> <method>avg</method>
>>> <useweekend>true</useweekend>
>>> </aggregate>
>>> ….
>>> </cache>
>>>
>>> This will create aggregated service definitions with the following
>>> name standard:
>>> host1-service1/H/avg/weekend-serviceitem1
>>> host1-service1/D/avg/weekend-serviceitem1
>>> host1-service1/W/avg/weekend-serviceitem1
>>> host1-service1/M/avg/weekend-serviceitem1
>>>
>>> You can also have multiple entries like:
>>> <cache>
>>> <aggregate>
>>> <method>avg</method>
>>> <useweekend>true</useweekend>
>>> </aggregate>
>>> <aggregate>
>>> <method>max</method>
>>> </aggregate>
>>> ….
>>> </cache>
>>>
>>> So how long time will the aggregated values be kept in the cache? By
>>> default we save
>>> Hour aggregation for 25 hours
>>> Daily aggregations for 7 days
>>> Weekly aggregations for 5 weeks
>>> Monthly aggregations for 1 month
>>>
>>> These values can be override but they can not be lower then the
>>> default. Below you have an example where we save the aggregation for
>>> 168 hours, 60 days and 53 weeks.
>>> <cache>
>>> <aggregate>
>>> <method>avg</method>
>>> <useweekend>true</useweekend>
>>> <retention>
>>> <period>H</period>
>>> <offset>168</offset>
>>> </retention>
>>> <retention>
>>> <period>D</period>
>>> <offset>60</offset>
>>> </retention>
>>> <retention>
>>> <period>W</period>
>>> <offset>53</offset>
>>> </retention>
>>> </aggregate>
>>> ….
>>> </cache>
>>>
>>> I hope this makes it a bit less confusing. What is clear to me is that
>>> we need to improve the documentation in this area.
>>>
>>> Looking forward to your feedback.
>>> Anders
>>>
>>> On 09/08/2014 06:02 AM, Rahul Amaram wrote:
>>>> Hi,
>>>> I am trying to setup the bischeck plugin for our organization. I have
>>>> configured most part of it except for the cache retention period. Here
>>>> is what I want - I want to store every value which has been generated
>>>> during the past 1 month. The reason being my threshold is currently
>>>> calculated as the average of the metric value during the past 4
>>>> weeks at
>>>> the same time of the day.
>>>>
>>>> So, how do I define the cache template for this? If I don't define any
>>>> cache template, for how many days is the data kept?
>>>> Also, how does the aggregrate function work and and what does the
>>>> purge
>>>> Maxitems signify?
>>>>
>>>> I've gone through the documentation but it wasn't clear. Looking
>>>> forward
>>>> to a response.
>>>>
>>>> Bischeck is one awesome plugin. Keep up the great work.
>>>>
>>>> Regards,
>>>> Rahul.
>>>>
>>>
>>>
>>
>>
>
>
--
More information about the Bischeck-users
mailing list