blocks|key|1047246|text|您需要计算列的平均值(Average)和标准差。标准偏差有点令人困惑，但重要的事实是2/3的数据在|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1047247|均值%2B/-+StandardDeviation|1047248|通常，均值%2B/-+2*+StandardDeviation之外的任何值都是异常值，但您可以调整乘数。|1047249|http://en.wikipedia.org/wiki/Standard_deviation|offset|length|1047250|因此，为了清楚起见，您需要将数据转换为平均值的标准差。|1047251|即|1047252|def+getdeviations(x,+mean,+stddev):
+++return+math.abs(x+-+mean)+/+stddev|code-block|syntax|javascript|1047253|Numpy有用于此的函数。|1047254|entityMap|0|LINK|mutability|MUTABLE|url^0|0|0|0|0|1B|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|11|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|12|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|13|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|14|8|@]|9|@$H|15|I|16|1|17]]|A|$]]|$1|J|3|K|5|6|7|18|8|@]|9|@]|A|$]]|$1|L|3|M|5|6|7|19|8|@]|9|@]|A|$]]|$1|N|3|O|5|P|7|1A|8|@]|9|@]|A|$Q|R]]|$1|S|3|T|5|6|7|1B|8|@]|9|@]|A|$]]|$1|U|3|-4|5|6|7|1C|8|@]|9|@]|A|$]]]|V|$W|$5|X|Y|Z|A|$10|G]]]]

You need to calculate the Mean (Average) and Standard Deviation for the column. Stadard deviation is a bit confusing, but the important fact is that 2/3 of the data is within 

Mean +/- StandardDeviation 

Generally anything outside Mean +/- 2 * StandardDeviation is an outlier, but you can tweak the multiplier.

<a href="http://en.wikipedia.org/wiki/Standard_deviation" rel="nofollow">http://en.wikipedia.org/wiki/Standard_deviation</a>

So to be clear, you want to convert the data to standard deviations from the mean. 

ie

<pre><code>def getdeviations(x, mean, stddev):
 return math.abs(x - mean) / stddev
</code></pre>

Numpy has functions for this.

blocks|key|832411|text|我认为最好的办法是研究一下scipy的scoreatpercentile函数，例如，你可以尝试排除所有大于第99个百分位数的值。|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|832412|如果你没有正态分布，均值和标准差是不好的。|832413|一般来说，对数据的外观有一个大致的视觉概念是很好的。有一个matplotlib；我建议您在决定计划之前用它绘制一些数据图。|832414|entityMap|0|LINK|mutability|MUTABLE|url|http://www.scipy.org/|1|http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.scoreatpercentile.html|2|http://matplotlib.org/^0|D|5|0|J|H|1|0|0|T|A|2|0^^$0|@$1|2|3|4|5|6|7|T|8|@]|9|@$A|U|B|V|1|W]|$A|X|B|Y|1|Z]]|C|$]]|$1|D|3|E|5|6|7|10|8|@]|9|@]|C|$]]|$1|F|3|G|5|6|7|11|8|@]|9|@$A|12|B|13|1|14]]|C|$]]|$1|H|3|-4|5|6|7|15|8|@]|9|@]|C|$]]]|I|$J|$5|K|L|M|C|$N|O]]|P|$5|K|L|M|C|$N|Q]]|R|$5|K|L|M|C|$N|S]]]]

I think your best bet is to have a look into the <a href="http://www.scipy.org/" rel="nofollow">scipy</a>'s <a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.scoreatpercentile.html" rel="nofollow">scoreatpercentile</a> function. So for instance you could try excluding all the values that are above the 99th percentile. 

Mean and standard deviation are no good if you don't have a normal distribution.

Generally it's good to have a rough visual idea of what your data looks like. There is <a href="http://matplotlib.org/" rel="nofollow">matplotlib</a>; I recommend you make some plots of your data with it before deciding on a plan.

blocks|key|1047359|text|你声明的“发现坏处”的目标意味着你要寻找的不是异常值，而是高于或低于某个阈值的观察结果，我假设随着时间的推移，该阈值将保持不变。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1047360|例如，如果您的所有服务器都有98±0.1+%25的可用性，那么100%25可用性的服务器就是异常值，97.6%25可用性的服务器也是异常值。但这些可能在您想要的范围内。|1047361|另一方面，无论是否有一个或多个服务器低于此阈值，都可能有充分的理由先验地希望收到任何可用性低于95%25的服务器的通知。|1047362|因此，搜索异常值可能不会提供您感兴趣的信息。阈值可以基于历史数据在统计上确定，例如，通过将错误率建模为泊松或将可用性百分比建模为beta变量。在应用的设置中，这些阈值可能是基于性能要求确定的。|1047363|entityMap^0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|J|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|K|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|L|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|M|8|@]|9|@]|A|$]]|$1|H|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|I|$]]

Your stated goal of "finding badness" implies that it is not the outliers that you are looking for, but observations that fall above or below some threshold, and I would presume that the threshold would remain the same over time.

As an example, if all of your servers were at 98 ± 0.1 % availability, a server at 100% availability would be an outlier, as would a server at 97.6% availability. But these may be within your desired limits.

On the other hand, there may be good reasons apriori to want to be notified of any server at less than 95% availability, whether or not there is one or many servers below this threshold.

For this reason, a search for outliers may not provide the information that you are interested in. The thresholds could be determined statistically based on historical data, e.g. by modeling error rate as poisson or percent availability as beta variables. In an applied setting, these thresholds could probably be determined based on performance requirements.

blocks|key|832538|text|直观地识别异常值的一个好方法是绘制箱线图(或盒子和胡须图)，它将显示中位数、中位数上下的几个四分位数，以及离这个盒子“很远”的点(请参阅维基百科条目http://en.wikipedia.org/wiki/Box_plot)。在R中，有一个boxplot函数可以做到这一点。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|832539|以编程方式丢弃/识别离群值的一种方法是使用MAD或Median+Absolute+Deviation。与标准差不同，MAD对异常值不敏感。有时我会使用经验法则来考虑所有偏离中值超过5*MAD的点，都是异常值。|832540|entityMap|0|LINK|mutability|MUTABLE|url|http://en.wikipedia.org/wiki/Box_plot|1|http://en.wikipedia.org/wiki/Median_absolute_deviation^0|3C|7|22|11|0|0|P|P|1|0^^$0|@$1|2|3|4|5|6|7|R|8|@$9|S|A|T|B|C]]|D|@$9|U|A|V|1|W]]|E|$]]|$1|F|3|G|5|6|7|X|8|@]|D|@$9|Y|A|Z|1|10]]|E|$]]|$1|H|3|-4|5|6|7|11|8|@]|D|@]|E|$]]]|I|$J|$5|K|L|M|E|$N|O]]|P|$5|K|L|M|E|$N|Q]]]]

One good way of identifying outliers visually is to make a boxplot (or box-and-whiskers plot), which will show the median, and a couple of quartiles above and below the median, and the points that lie "far" from this box (see Wikipedia entry <a href="http://en.wikipedia.org/wiki/Box_plot" rel="noreferrer">http://en.wikipedia.org/wiki/Box_plot</a>). In R, there's a <code>boxplot</code> function to do just that.

One way to discard/identify outliers programmatically is to use the MAD, or <a href="http://en.wikipedia.org/wiki/Median_absolute_deviation" rel="noreferrer">Median Absolute Deviation</a>. The MAD is not sensitive to outliers, unlike the standard deviation. I sometimes use a rule of thumb to consider all points that are more than 5*MAD away from the median, to be outliers.

I have a python script that creates a list of lists of server uptime and performance data, where each sub-list (or 'row') contains a particular cluster's stats. For example, nicely formatted it looks something like this:

<pre><code>------- ------------- ------------ ---------- -------------------
Cluster %Availability Requests/Sec Errors/Sec %Memory_Utilization
------- ------------- ------------ ---------- -------------------
ams-a 98.099 1012 678 91
bos-a 98.099 1111 12 91
bos-b 55.123 1513 576 22
lax-a 99.110 988 10 89
pdx-a 98.123 1121 11 90
ord-b 75.005 1301 123 100
sjc-a 99.020 1000 10 88
...(so on)...
</code></pre>

So in list form, it might look like:

<pre><code>[[ams-a,98.099,1012,678,91],[bos-a,98.099,1111,12,91],...]
</code></pre>

My question: What's the best way to determine the outliers in each column? Or are outliers not necessarily the best way to attack the problem of finding 'badness'? In the data above, I'd definitely want to know about bos-b and ord-b, as well as ams-a since it's error rate is so high, but the others can be discarded. Depending on the column, since higher is not necessarily worse, nor is lower, I'm trying to figure out the most efficient way to do this. Seems like numpy gets mentioned a lot for this sort of stuff, but not sure where to even start with it (sadly, I'm more sysadmin than statistician...).

Thanks in advance!

Finding outliers in a data set

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我有一个python脚本，它创建服务器正常运行时间和性能数据的列表，其中每个子列表(或“行”)包含特定集群的统计数据。例如，格式很好的它看起来像这样：-------  -------------  ------------  ----------  -------------------Cluster  %Availa...

问发现数据集中的离群值
EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问发现数据集中的离群值EN

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问发现数据集中的离群值
EN