High CPU consumption by java and redis-server
Rahul Amaram
rahul.amaram at vizury.com
Sat Sep 20 07:46:59 CEST 2014
Hi Anders,
I have raised a ticket -
http://gforge.ingby.com/gf/project/bischeck/tracker/?action=TrackerItemEdit&tracker_item_id=260
.
Really appreciate the prompt responses and the quick action to fix the
issue :).
Regards,
Rahul.
On Friday 19 September 2014 07:38 PM, Anders Håål wrote:
> I have looked at the logic for query by time and the current
> implementation works really bad for large list and just bad for
> smaller lists. The searching is just a brute force sequential search
> to find the closest. So the longer the list is and if the query is
> time that is up in the end of the list performance will just suck. To
> my only defense this is remaining code from when we did not have the
> cache in redis, but Im not sure its an good excuse.
> Okay enough with remorse, what can be done? I have tested using a
> binary search. On a list with 15000 items we get a hit after 13-14
> iterations. So for the same "heavy" test query I got the time down
> from 6684 ms to 55 ms.
> The plan is to fix this and then release a 1.2.0 beta that hopefully
> you and others can test. Hopefully I can have a beta in the beginning
> of the week.
>
> @Rahul - Thanks for finding this issue. If you have the time please
> make a bug report on this so we have the history. I also think I
> misunderstood your question about "if data is missing in-between" and
> you are of course correct that this will not work well using a query
> by index.
>
> Anders
>
>
> On 09/18/2014 10:05 AM, anders.haal at ingby.com wrote:
>> Hi Rahul,
>> I really like numbers. Can you describe how much it changed to the
>> better, bischeck utilization and redis utilization, etc? I will as
>> soon as possible start testing to verify the problem you encounter so
>> data to compare with would be great. If possible it would be great to
>> have a look of your bischeck configuration files if you can share
>> them. Of course you can send all that stuff directly to me if you do
>> not want it to be shared on the list.
>>
>> When I take about index for time(stamp) it will use a different redis
>> storage collection in. In the current version we use a linked list to
>> store metrics data. That makes finding a metric item by its position,
>> in bischeck called "query by index", easy and fairly quick. But when
>> when we search for timestamp, which is data on the metrics stored in
>> the the list we need to sort the the list to find the item closest to
>> the time directive, like -24H- When list grow this search becomes
>> heavier and heavier. So the idea we have is to store the metrics data
>> in a sorted set with timestamp as the score, see redis.io, and inte
>> linked list only store the timestamp/score. With the finding a time
>> range will be fast since its by score, like [-12M:-30M]. Searching
>> for a index position, like [3] will require two reads, find it in the
>> linked list, get the score and read the full item from the sorted
>> set. Finding range by index, [0:9] will finding score for index 0 and
>> 9 and then get all item from the sorted set by the score.
>> When it comes to query by time [-12H] our definition is to find the
>> metric with closest to 12 hours ago. As you say this is a bit more
>> tricky. What do closes mean and how much should the span be? Like you
>> say if there is no data for a period of hours, well then the are now
>> data and maybe thats would be returned. The same goes for a time
>> range, [-12H, -24H]. If there are no data for that period the current
>> version will return null. So for the the query by time, [-12H], we
>> should maybe use some sort of tolerance so the search is always a
>> range of +- X minutes around the request value, and what is returned
>> is the median of the result set.
>> Input and feedback is welcome, but this change, I think, will come in
>> a major version of bischeck. The question is now in the short term we
>> could improve the search speed for a large list, if that is the problem.
>>
>> When it comes to using multiple instance of redis Im not sure its a
>> great idea. That would mean that bischeck must implement some sort of
>> partioning/sharding. Redis has proven it self to be an excellent
>> storage so I strongly believe we first need to implement our usage of
>> it in the right way before we start thinking about this type of
>> solutions.
>>
>>
>>
>> On 2014-09-18 05:47, Rahul Amaram wrote:
>>> Using indices has got down CPU usage of the java process
>>> considerably. But won't indices create problem if data is missing
>>> in-between? For ex. if data for 2-3 hours is missing, then won't
>>> indices be offset by that time? Also, does bischeck support using
>>> multiple redis-instances so that multi-core be exploited?
>>>
>>> Regards,
>>> Rahul.
>>>
>>> On Thursday 18 September 2014 01:58 AM, Anders Håål wrote:
>>>> Okay. 4-6 to seconds is absolutely to much and it may be related
>>>> how the query by time is implement. The current search is pretty
>>>> brute force finding the "right" time. Its not searching linearly
>>>> but there are no "index" on time. Searching by index is much
>>>> quicker and query by time will be related to the size. With your 6
>>>> values the search has to be done 6 times over a list that are 5000
>>>> items. The future idea I mentioned will be a sort of index for the
>>>> timestamp by using a sorted set.
>>>> What I would recommend you to do is to use index instead and see
>>>> how that effect the performance. Since you use a interval of 120
>>>> sec, the -24H will be the same as index 720, -96H will be the same
>>>> as index 2880, etc.
>>>> I will try to get the time to set up an equivalent test
>>>> environment. Keep me updated of your investigation
>>>> Anders
>>>>
>>>> On 09/17/2014 09:18 PM, Rahul Amaram wrote:
>>>>> /When it comes to your last finding I have no explanation. Just to
>>>>> understand you compare using -24H with -10080M (-168H). Would it
>>>>> not be better to compare -24H and -1440M. I have to get back to
>>>>> you on this but I would need to get the result when running in
>>>>> cacheCli since you get the time it takes,
>>>>> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Section-4.4.//
>>>>> ///
>>>>> This was a typo. I was talking about -168H and -10080M. Also, I
>>>>> used "bischeck cli.CacheCli" to check this. And I re-ran this now,
>>>>> but not finding much difference between both of them (it takes
>>>>> about 4-6 seconds to retrieve the value).
>>>>>
>>>>> Reg. other points, I have to get back to you. On a side note, I
>>>>> have upgraded from redis-server 2.6 to 2.8, just to rule out any
>>>>> version performance issues.
>>>>>
>>>>> Thanks,
>>>>> Rahul.
>>>>>
>>>>>
>>>>> On Thursday 18 September 2014 12:19 AM, Anders Håål wrote:
>>>>>> Hi Rahul,
>>>>>> Looking at your threshold this means that you will retrieve max 6
>>>>>> values, which should not be that "hard" even if its a time based
>>>>>> query - using index is faster and is something we will look into
>>>>>> in the future.
>>>>>> Since you run the query every 120 sec it means that you currently
>>>>>> have at lest 5040 items in the cache for this each service, which
>>>>>> does not sound to bad. 10 services at least 50000 in total.
>>>>>> What I like you to check is the following:
>>>>>> - If you connect with some JMX client against bischeck you can
>>>>>> see all the different timers
>>>>>> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Chapter-5.
>>>>>> The once that are related to threshold are inserting to start
>>>>>> with but check all the different timers if some one have long
>>>>>> execution time.
>>>>>> - Since the its the redis-server that are consume a high level of
>>>>>> CPU its interesting to see the configuration for redis - like the
>>>>>> amount of memory allocated. If redis need to swap its not good.
>>>>>> - Please check the redis log files.
>>>>>> - You can also connect to redis with redis-cli and run command
>>>>>> "monitor" to get a real time listing on the commands executed
>>>>>> against redis.
>>>>>> - Also check with top the percentage of %wa, waiting for io. How
>>>>>> much memory do you have on the server? Only running bischeck and
>>>>>> redis?
>>>>>> - How much cpu is bischeck consuming? Do you see any peaks?
>>>>>> - Also check the bischeck log to see any ERROR or WARN.
>>>>>> - And finally - has this been the behavior from the beginning or
>>>>>> has it increased over time? What happen if you restart bischeck
>>>>>> (not reload)?
>>>>>>
>>>>>> Try to collect some more info so we can try to determine where
>>>>>> the issue is related.
>>>>>>
>>>>>> When it comes to your last finding I have no explanation. Just to
>>>>>> understand you compare using -24H with -10080M (-168H). Would it
>>>>>> not be better to compare -24H and -1440M. I have to get back to
>>>>>> you on this but I would need to get the result when running in
>>>>>> cacheCli since you get the time it takes,
>>>>>> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Section-4.4.
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Anders
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 09/17/2014 07:13 PM, Rahul Amaram wrote:
>>>>>>> Hi,
>>>>>>> I am observing very high CPU consumption by the java process and
>>>>>>> redis-server. redis-server being single threaded it self is
>>>>>>> taking 100% CPU. I have about 10 hosts, with about 10 services
>>>>>>> each (with one service item per service). The time interval for
>>>>>>> generation of value is 120s. The threshold that I have defined is:
>>>>>>>
>>>>>>> avg($$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-24H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-96H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-168H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-336H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-504H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-672H])
>>>>>>>
>>>>>>>
>>>>>>> However, currently, not more than 3 values, are available.
>>>>>>>
>>>>>>> I am already running this on a c3.xlarge machine (4 cores) and
>>>>>>> the load average is quite often > 4 resulting in delay of
>>>>>>> generation of values. Any pointers in what could be causing the
>>>>>>> high load would be much appreciated.
>>>>>>>
>>>>>>> On a slightly different note, while using cli.CacheCli,
>>>>>>> retrieving the value of a service item one week back using hours
>>>>>>> (-24H) is considerably faster than retrieving it using minutes
>>>>>>> (-10080M). Again, why does bischeck behave this way?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ingby<http://www.ingby.com>
>>>>
>>>> IngbyForge<http://gforge.ingby.com>
>>>>
>>>> bischeck - dynamic and adaptive thresholds for Nagios<http://www.bischeck.org>
>>>>
>>>> anders.haal at ingby.com<mailto:anders.haal at ingby.com>
>>>>
>>>> Mjukvara genom ingenjörsmässig kreativitet och kompetens
>>>>
>>>> Ingenjörsbyn
>>>> Box 531
>>>> 101 30 Stockholm
>>>> Sweden
>>>> www.ingby.com <http://www.ingby.com/>
>>>> Mobil: +46 70 575 35 46
>>>> Tele: +46 75 75 75 090
>>>> Fax: +46 75 75 75 091
>>>>
>>>
>>>
>>
>> --
>>
>>
>> Ingby<http://www.ingby.com>
>>
>> IngbyForge<http://gforge.ingby.com>
>>
>> bischeck - dynamic and adaptive monitoring for Nagios<http://www.bischeck.org>
>>
>> anders.haal at ingby.com<mailto:anders.haal at ingby.com>
>>
>> Mjukvara genom ingenjörsmässig kreativitet och kompetens
>>
>> Ingenjörsbyn
>> Box 531
>> 101 30 Stockholm
>> Sweden
>> www.ingby.com <http://www.ingby.com/>
>> Mobil: +46 70 575 35 46
>> Tele: +46 75 75 75 090
>> Fax: +46 75 75 75 091
>
>
> --
>
> Ingby<http://www.ingby.com>
>
> IngbyForge<http://gforge.ingby.com>
>
> bischeck - dynamic and adaptive thresholds for Nagios<http://www.bischeck.org>
>
> anders.haal at ingby.com<mailto:anders.haal at ingby.com>
>
> Mjukvara genom ingenjörsmässig kreativitet och kompetens
>
> Ingenjörsbyn
> Box 531
> 101 30 Stockholm
> Sweden
> www.ingby.com <http://www.ingby.com/>
> Mobil: +46 70 575 35 46
> Tele: +46 75 75 75 090
> Fax: +46 75 75 75 091
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/bischeck-users/attachments/20140920/494c0071/attachment-0001.html>
More information about the Bischeck-users
mailing list