专栏首页让自己透明,用于自己看的内容Kernel PCA for nonlinear dimensionality reduction核心PCA非线性降维

Kernel PCA for nonlinear dimensionality reduction核心PCA非线性降维

Most of the techniques in statistics are linear by nature, so in order to capture nonlinearity,we might need to apply some transformation. PCA is, of course, a linear transformation.In this recipe, we'll look at applying nonlinear transformations, and then apply PCA for dimensionality reduction.

多数统计学技术都是自然线性的,所以如果想要处理非线性情况,我们需要应用一些变换,PCA当然是线性变换,以下,我们将先应用非线性变换,然后再应用PCA进行降维。

Getting ready准备工作

Life would be so easy if data was always linearly separable, but unfortunately it's not.Kernel PCA can help to circumvent this issue. Data is first run through the kernel function that projects the data onto a different space; then PCA is performed.

如果数据都是能够线性分割的,生活将是多轻松啊,但是不幸的是他不是,核心PCA能帮忙绕过这个问题,数据将首先经过能够将数据转换成另一种形式的核函数,然后PCA开始崭露头角。

To familiarize yourself with the kernel functions, it will be a good exercise to think of how to generate data that is separable by the kernel functions available in the kernel PCA.Here, we'll do that with the cosine kernel. This recipe will have a bit more theory than the previous recipes.

为了使你熟悉核函数,思考如何生成能够被核心PCA用核函数分割的数据将会是一个好的练习,我们将会用余弦核函数,这一步会比之前的步骤更偏理论一些。

How to do it...怎么做

The cosine kernel works by comparing the angle between two samples represented in the feature space. It is useful when the magnitude of the vector perturbs the typical distance measure used to compare samples.

余弦核函数能够比较两个样本在特征空间中的夹角,用测量物理距离之间的大小来比较样本间的差距是非常有效的。

As a reminder, the cosine between two vectors is given by the following:像提出的那样,两向量cos公式如下:

This means that the cosine between A and B is the dot product of the two vectors normalized by the product of the individual norms. The magnitude of vectors A and B have no influence on this calculation.

公式的意思是A、B夹角的COS值是两个向量的点积除以自己的模的乘积。AB向量的差距和他们的估计值无关。

So, let's generate some data and see how useful it is. First, we'll imagine there are two different underlying processes; we'll call them A and B:现在让我们来生成一些数据,来看看如何使用。首先,我们设想有两个不同的潜在过程,我们称他们为A和B:

>>> import numpy as np

>>> A1_mean = [1, 1]

>>> A1_cov = [[2, .99], [1, 1]]

>>> A1 = np.random.multivariate_normal(A1_mean, A1_cov, 50)

>>> A2_mean = [5, 5]

>>> A2_cov = [[2, .99], [1, 1]]

>>> A2 = np.random.multivariate_normal(A2_mean, A2_cov, 50)

>>> A = np.vstack((A1, A2))

>>> B_mean = [5, 0]

>>> B_cov = [[.5, -1], [-.9, .5]]

>>> B = np.random.multivariate_normal(B_mean, B_cov, 100)

Once plotted, it will look like the following:绘图,他们会是这样:

By visual inspection, it seems that the two classes are from different processes, but separating them in one slice might be difficult. So, we'll use the kernel PCA with the cosine kernel discussed earlier:

通过视觉判断,有两类不同的过程,一刀切分辨他们会很难,所以我们用cos核模型的核心PCA提前讨论。

>>> kpca = decomposition.KernelPCA(kernel='cosine', n_components=1)

>>> AB = np.vstack((A, B))

>>> AB_transformed = kpca.fit_transform(AB)

Visualized in one dimension after the kernel PCA, the dataset looks like the following:

通过核心PCA后一维形象化,数据集将看起来是一下的样子:

Contrast this with PCA without a kernel:比较一下没有核的PCA

Clearly, the kernel PCA does a much better job.很明显,核心PCA表现很不错

How it works...如何工作的

There are several different kernels available as well as the cosine kernel. You can even write your own kernel function. The available kernels are:

有很多像cos函数一样不同的核函数可用,你也可以写自己的核函数,可选的核是:

1、poly (polynomial) 多项式核函数

2、rbf (radial basis function)径向基函数

3、sigmoid S型函数

4、cosine cos

5、precomputed 预计算

There are also options contingent of the kernel choice. For example, the degree argument will specify the degree for the poly , rbf , and sigmoid kernels; also, gamma will affect the rbf or poly kernels.

还有很多可组合的核函数,如级参数将为解释poly , rbf , and sigmoid核函数,γ将影响rbf or poly核函数

The recipe on SVM will cover the rbf kernel function in more detail.在SVM那部分将要包含更多关于rbf核函数的细节

A word of caution: kernel methods are great to create separability, but they can also cause overfitting if used without care.一点忠告,核方法很擅长分离,但是要注意因为不注意的使用它而引起的过拟合。

原文链接:http://www.packtpub.com

原文作者:Trent Hauck

相关文章

  • Decomposition to classify with DictionaryLearning字典学习的分解

    In this recipe, we'll show how a decomposition method can actually be used for c...

    到不了的都叫做远方
  • Using ridge regression to overcome linear regression's shortfalls

    In this recipe, we'll learn about ridge regression. It is different from vanilla...

    到不了的都叫做远方
  • Evaluating the linear regression model评估线性回归模型

    In this recipe, we'll look at how well our regression fits the underlying data. ...

    到不了的都叫做远方
  • Head First Stanford NLP (2)

    (深入浅出Stanford NLP 进阶篇) 本文接着介绍Stanford NLP工具的使用方法。

    宅男潇涧
  • Leetcode 174 Dungeon Game

    The demons had captured the princess (P) and imprisoned her in the bottom-right...

    triplebee
  • 陷入回声室:Twitter上的意大利疫苗辩论(CS AI)

    出现在美国和欧洲的麻疹是在2000年代初被消灭的一种疾病,与此同时,在社交媒体上关于接种疫苗的优点的辩论也越来越多。在这项研究中,我们调查发现Twitter上的...

    用户7035935
  • Codeforces Beta Round #2 A,B,C

    A. Winner time limit per test:1 second memory limit per test:64 megabytes input:...

    Angel_Kitty
  • Codeforces Round #395 (Div. 2)(A.思维,B,水)

    A. Taymyr is calling you time limit per test:1 second memory limit per test:256 ...

    Angel_Kitty
  • imp错误IMP-00098: INTERNAL ERROR: impgst2Segmentation fault

    如果使用impdp要看dump的内容,可以使用sqlfile参数,他会将所有的DDL语句写入文件,

    bisal
  • Controlling Access to the Kubernetes API

    ? API Server Ports and IPs By default the Kubernetes API server serves HTTP o...

    Walton

扫码关注云+社区

领取腾讯云代金券