前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Label propagation with semi-supervised learning半监督学习之标签传播算法

Label propagation with semi-supervised learning半监督学习之标签传播算法

作者头像
到不了的都叫做远方
修改2020-05-06 11:45:09
1.6K0
修改2020-05-06 11:45:09
举报

Label propagation is a semi-supervised technique that makes use of the labeled and unlabeled data to learn about the unlabeled data. Quite often, data that will benefit from a classification algorithm is difficult to label. For example, labeling data might be very expensive, so only a subset is cost-effective to manually label. This said, there does seem to be slow, but growing, support for companies to hire taxonomists.

标签传播算法是利用有标签和无标签的数据来学习无标签的数据的半监督学习的技术。通常,对机器学习有效算法的数据的难点是给予标签。比如,标记数据可能非常贵,所以对于一部分数据进行手动标记是非常划算的,这意味着,缓慢但增长的支持公司来雇佣数据分类者。

Getting ready准备工作

Another problem area is censored data. You can imagine a case where the frontier of time will affect your ability to gather labeled data. Say, for instance, you took measurements of patients and gave them an experimental drug. In some cases, you are able to measure the outcome of the drug, if it happens fast enough, but you might want to predict the outcome of the drugs that have a slower reaction time. The drug might cause a fatal reaction for some patients, and life-saving measures might need to be taken.

另一个问题领域是删减数据,你能想象一个事实,你聚集标签数据的能力影响着你的耗时边界,就像你测量病人的病情并给他开出实验性的药物。在很多情况下,当它发生的很快时,你能直接测量药物的反馈,但是你可能想预测那些反应较慢的药物的表现。药物可能会对一些病人导致致命的问题,这时必须得采取救命的措施。

How to do it...怎么做

In order to represent the semi-supervised or censored data, we'll need to do a little data preprocessing. First, we'll walk through a simple example, and then we'll move on to some more difficult cases:

为了准备半监督或者删减的数据,我们将需要做一些数据预处理。首先,我们将要讲述一个简单的例子,然后我们继续复杂的例子:

from sklearn import datasets
d = datasets.load_iris()

Due to the fact that we'll be messing with the data, let's make copies and add an unlabeled member to the target name's copy. It'll make it easier to identify data later:

由于我们将要使数据变得混乱,让我们先做个拷贝,然后增加无标签的成员来作为目标。这将使稍后认出数据变得简单:

X = d.data.copy()
y = d.target.copy()
names = d.target_names.copy()
names = np.append(names, ['unlabeled'])
names
array(['setosa', 'versicolor', 'virginica', 'unlabeled'], dtype='<U10')

Now, let's update y with -1 . This is the marker for the unlabeled case. This is also why we added unlabeled to the end of names:

现在,让我们更新y为-1,这是无标签例子的标记,这就是为什么我们在names最后增加无标签数据

y[np.random.choice([True, False], len(y))] = -1

Our data now has a bunch of negative ones ( -1 ) interspersed with the actual data:

我们的数据有一部分-1散布在真实数据中

y[:10]
array([-1, -1, -1, -1, 0, 0, -1, -1, 0, -1])
names[y[:10]]
array(['setosa', 'unlabeled', 'unlabeled', 'unlabeled', 'setosa',
       'setosa', 'setosa', 'unlabeled', 'setosa', 'unlabeled'],
      dtype='<U10')

We clearly have a lot of unlabeled data, and the goal now is to use LabelPropagation to predict the labels:

我们知道有很多无标签的数据,现在的目标是使用 LabelPropagation来预测标签:

from sklearn import semi_supervised
lp = semi_supervised.LabelPropagation()
lp.fit(X, y)
LabelPropagation(gamma=20, kernel='rbf', max_iter=1000, n_jobs=None,
                 n_neighbors=7, tol=0.001)
preds = lp.predict(X)
(preds == d.target).mean()
0.9733333333333334

Not too bad, though we did use all the data, so it's kind of cheating. Also, the iris dataset is a fairly separated dataset. While we're at it, let's look at LabelSpreading , the "sister" class of LabelPropagation .We'll make the technical distinction between LabelPropagation and LabelSpreading in the How it works... section of this recipe, but it's easy to say that they are extremely similar:

不太差,通过使用所有数据,所以这有点作弊,而且,iris数据集是公正的分类数据。当我们使用的时候,让我们看看LabelSpreading,LabelPropagation的姊妹类。我们将在它如何运行的这部分来对LabelPropagation 和LabelSpreading进行技术上的区分。但是很容易说,他们太相似了。

ls = semi_supervised.LabelSpreading()

LabelSpreading is more robust and noisy as observed from the way it works:

ls.fit(X, y)
LabelSpreading(alpha=0.2, gamma=20, kernel='rbf', max_iter=30, n_jobs=None,
               n_neighbors=7, tol=0.001)
(ls.predict(X)== d.target).mean()
0.9666666666666667

Don't consider the fact that the label-spreading algorithm missed one more as an indication and that it performs worse in general. The whole point is that we might give some ability to predict well on the training set and to work on a wider range of situations.

不要想象实际中label-spreading算法丢失超过一个标志,它实际上表现的更差。总体上来说,我们给它在训练集上很好的预测能力,并且作用于更广泛的情形。

How it works...如何运行的

Label propagation works by creating a graph of the data points, with weights placed on the edge equal to the following:

标签传播算法运行是通过生成数据点的图形,通过以下公式来定位距离的边界。

The algorithm then works by labeled data points propagating their labels to the unlabeled data.This propagation is in part determined by edge weight.The edge weights can be placed in a matrix of transition probabilities. We can iteratively determine a good estimate of the actual labels.

这个算法通过标签数据点在无标签的数据上传播它的标签然后运行,这个传播取决于边界的宽度。边界宽度能够被放入一个传播能力的矩阵中,我们能够迭代来得到真实标签的更好的估计值。

本文系外文翻译,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文系外文翻译前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
作者已关闭评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档