Possibility to avoid certain values which are way too deviant while calculating threshold

Anders Håål anders.haal at ingby.com
Wed Dec 17 21:57:35 CET 2014


Sorry for the link - 
http://stats.stackexchange.com/questions/38001/detecting-outliers-using-standard-deviations


The problem is not to write the code, the problem is to find a logic to 
determine which numbers to remove from the data set. What is a deviation 
from the normal difference in the set?

Googling a bit more I found these definitions that may be applicable 
using stdev for your use case:

*Mean and Standard Deviation Method**
*For this outlier detection method, the mean and standard deviation of 
the residuals are calculated and compared. If a value is a certain 
number of standard deviations away from the mean, that data point is 
identified as an outlier. The specified number of standard deviations is 
called the threshold. The default value is 3.

This method can fail to detect outliers because the outliers increase 
the standard deviation. The more extreme the outlier, the more the 
standard deviation is affected.

*Median and Median Absolute Deviation Method (MAD)**
*
For this outlier detection method, the median of the residuals is 
calculated. Then, the difference is calculated between each historical 
value and this median. These differences are expressed as their absolute 
values, and a new median is calculated and multiplied by an empirically 
derived constant to yield the median absolute deviation (MAD). If a 
value is a certain number of MAD away from the median of the residuals, 
that value is classified as an outlier. The default threshold is 3 MAD.

This method is generally more effective than the mean and standard 
deviation method for detecting outliers, but it can be too aggressive in 
classifying values that are not really extremely different. Also, if 
more than 50% of the data points have the same value, MAD is computed to 
be 0, so any value different from the residual median is classified as 
an outlier.

*Median and Interquartile Deviation Method (IQD)*

For this outlier detection method, the median of the residuals is 
calculated, along with the 25th percentile and the 75th percentile. The 
difference between the 25th and 75th percentile is the interquartile 
deviation (IQD). Then, the difference is calculated between each 
historical value and the residual median. If the historical value is a 
certain number of MAD away from the median of the residuals, that value 
is classified as an outlier. The default threshold is 2.22, which is 
equivalent to 3 standard deviations or MADs.

This method is somewhat susceptible to influence from extreme outliers, 
but less so than the mean and standard deviation method. Box plots are 
based on this approach. The median and interquartile deviation method 
can be used for both symmetric and asymmetric data.

If you find a method that you think could work, we could implement it 
together and you can verify it with your data. Can you say anything 
about the data collected?
Anders

On 12/17/2014 09:25 PM, Rahul Amaram wrote:
> Hi Andre,
>
> So, I would like to remove the outlier and calculate the mean for the 
> remaining elements. Any suggestion apart from writing my own custom 
> math function? Also, I don't think that you have shared the link.
>
> Thanks,
> Rahul.
>
> On Thursday 18 December 2014 12:55 AM, Anders Håål wrote:
>> Hi Rahul,
>> Its possible, but the question is what algorithm to use. The second 
>> question would also be what would you do with the remaining set, 
>> calculate a mean?
>> When it comes to exclude a deviant value it sound close to what is 
>> called a outlier, http://en.wikipedia.org/wiki/Outlier. There are a 
>> number of mathematical solutions to this problem, but not sure which 
>> would be applicable or correct. Check this link for a discussions on 
>> the topic where one approach is using standard deviation, but from 
>> the discussion it does not sound like a statistical correct approach.
>>
>> If you or anyone else on this list find an good approach, I more then 
>> happy to try it. In Bischeck its possible to plug in your own 
>> functions as described in 
>> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Section-6.2 
>> so you can easily do your own testing. Using the cache browser cli 
>> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Section-4.4 
>> you can easily test your function.
>>
>> Anders
>>
>>
>> On 12/17/2014 03:40 PM, Rahul Amaram wrote:
>>> Hi,
>>>
>>> I had a quick question. Let us say we calculate the threshold based 
>>> on the values of the past six days, one value per day. Now let us 
>>> say, out of 6 values, one of these values is way too deviant. Then 
>>> is it possible to exclude this deviant value from calculating the 
>>> threshold?
>>>
>>> Thanks,
>>> Rahul.
>>
>>
>


-- 

Ingby<http://www.ingby.com>

IngbyForge<http://gforge.ingby.com>

bischeck - dynamic and adaptive thresholds for Nagios <http://www.bischeck.org>

anders.haal at ingby.com<mailto:anders.haal at ingby.com>

Mjukvara genom ingenjörsmässig kreativitet och kompetens

Ingenjörsbyn
Box 531
101 30 Stockholm
Sweden
www.ingby.com <http://www.ingby.com/>
Mobil: +46 70 575 35 46
Tele: +46 75 75 75 090
Fax:  +46 75 75 75 091

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/bischeck-users/attachments/20141217/7674f1d0/attachment.html>


More information about the Bischeck-users mailing list