# 机器学习系列：（三）特征提取与处理

### 分类变量特征提取

scikit-learn里有`DictVectorizer`类可以用来表示分类特征：

`from sklearn.feature_extraction import DictVectorizeronehot_encoder = DictVectorizer()instances = [{'city': 'New York'},{'city': 'San Francisco'}, {'city': 'Chapel Hill'}]print(onehot_encoder.fit_transform(instances).toarray())`
```[[ 0.  1.  0.]
[ 0.  0.  1.]
[ 1.  0.  0.]]```

### 文字特征提取

#### 词库表示法

```corpus = [

```from sklearn.feature_extraction.text import CountVectorizercorpus = [
'Duke lost the basketball game']vectorizer = CountVectorizer()print(vectorizer.fit_transform(corpus).todense())print(vectorizer.vocabulary_)```
```[[1 1 0 1 0 1 0 1]
[1 1 1 0 1 0 1 0]]
{'unc': 7, 'played': 5, 'game': 2, 'in': 3, 'basketball': 0, 'the': 6, 'duke': 1, 'lost': 4}```

```corpus = [
'I ate a sandwich']vectorizer = CountVectorizer()print(vectorizer.fit_transform(corpus).todense())print(vectorizer.vocabulary_)```
```[[0 1 1 0 1 0 1 0 0 1]
[0 1 1 1 0 1 0 0 1 0]
[1 0 0 0 0 0 0 1 0 0]]
{'unc': 9, 'played': 6, 'game': 3, 'in': 4, 'ate': 0, 'basketball': 1, 'the': 8, 'sandwich': 7, 'duke': 2, 'lost': 5}```

scikit-learn里面的`euclidean_distances`函数可以计算若干向量的距离，表示两个语义最相似的文档其向量在空间中也是最接近的。

```from sklearn.metrics.pairwise import euclidean_distancescounts = vectorizer.fit_transform(corpus).todense()for x,y in [[0,1],[0,2],[1,2]]:
dist = euclidean_distances(counts[x],counts[y])
print('文档{}与文档{}的距离{}'.format(x,y,dist))```
```文档0与文档1的距离[[ 2.44948974]]

#### 停用词过滤

```corpus = [
'I ate a sandwich']vectorizer = CountVectorizer(stop_words='english')print(vectorizer.fit_transform(corpus).todense())print(vectorizer.vocabulary_)```
```[[0 1 1 0 0 1 0 1]
[0 1 1 1 1 0 0 0]
[1 0 0 0 0 0 1 0]]
{'unc': 7, 'played': 5, 'game': 3, 'ate': 0, 'basketball': 1, 'sandwich': 6, 'duke': 2, 'lost': 4}```

#### 词根还原与词形还原

```from sklearn.feature_extraction.text import CountVectorizercorpus = [
'He ate the sandwiches',
'Every sandwich was eaten by him']vectorizer = CountVectorizer(binary=True, stop_words='english')print(vectorizer.fit_transform(corpus).todense())print(vectorizer.vocabulary_)```
```[[1 0 0 1]
[0 1 1 0]]
{'sandwich': 2, 'sandwiches': 3, 'ate': 0, 'eaten': 1}```

```corpus = [
'I am gathering ingredients for the sandwich.',
'There were many wizards at the gathering.']```

`import nltknltk.download()`
`showing info http://www.nltk.org/nltk_data/`

Out[8]:

`True`

NLTK的`WordNetLemmatizer`可以用`gathering`的词类确定词元。

`from nltk.stem.wordnet import WordNetLemmatizerlemmatizer = WordNetLemmatizer()print(lemmatizer.lemmatize('gathering', 'v'))print(lemmatizer.lemmatize('gathering', 'n'))`
```gather
gathering```

```from nltk import word_tokenizefrom nltk.stem import PorterStemmerfrom nltk.stem.wordnet import WordNetLemmatizerfrom nltk import pos_tagwordnet_tags = ['n', 'v']corpus = [
'He ate the sandwiches',
'Every sandwich was eaten by him']stemmer = PorterStemmer()print('Stemmed:', [[stemmer.stem(token) for token in word_tokenize(document)] for document in corpus])```
`Stemmed: [['He', 'ate', 'the', 'sandwich'], ['Everi', 'sandwich', 'wa', 'eaten', 'by', 'him']]`
```def lemmatize(token, tag):
if tag[0].lower() in ['n', 'v']:
return lemmatizer.lemmatize(token, tag[0].lower())
return tokenlemmatizer = WordNetLemmatizer()tagged_corpus = [pos_tag(word_tokenize(document)) for document in corpus]print('Lemmatized:', [[lemmatize(token, tag) for token, tag in document] for document in tagged_corpus])```
`Lemmatized: [['He', 'eat', 'the', 'sandwich'], ['Every', 'sandwich', 'be', 'eat', 'by', 'him']]`

#### 带TF-IDF权重的扩展词库

`from sklearn.feature_extraction.text import CountVectorizercorpus = ['The dog ate a sandwich, the wizard transfigured a sandwich, and I ate a sandwich']vectorizer = CountVectorizer(stop_words='english')print(vectorizer.fit_transform(corpus).todense())print(vectorizer.vocabulary_)`
```[[2 1 3 1 1]]
{'wizard': 4, 'transfigured': 3, 'sandwich': 2, 'dog': 1, 'ate': 0}```

f(t,d)是第d个文档（document）t个单词（term）的频率，∥x∥是频率向量的L2范数。另外，还有对数词频调整方法（logarithmically scaled term frequencies），把词频调整到一个更小的范围，或者词频放大法（augmented term frequencies），适用于消除较长文档的差异。对数词频公式如下：

`TfdfTransformer`类计算对数词频调整时，需要将参数`sublinear_tf`设置为`True`。词频放大公式如下：

maxf(w,d):w∈d是文档d中的最大词频。scikit-learn没有现成可用的词频放大公式，不过通过`CountVectorizer`可以轻松实现。

`TfdfTransformer`类默认返回TF-IDF值，其参数`use_idf`默认为`True`。由于TF-IDF加权特征向量经常用来表示文本，所以scikit-learn提供了`TfidfVectorizer`类将`CountVectorizer``TfdfTransformer`类封装在一起。

```from sklearn.feature_extraction.text import TfidfVectorizercorpus = [
'The dog ate a sandwich and I ate a sandwich',
'The wizard transfigured a sandwich']vectorizer = TfidfVectorizer(stop_words='english')print(vectorizer.fit_transform(corpus).todense())print(vectorizer.vocabulary_)```
```[[ 0.75458397  0.37729199  0.53689271  0.          0.        ]
[ 0.          0.          0.44943642  0.6316672   0.6316672 ]]
{'wizard': 4, 'transfigured': 3, 'sandwich': 2, 'dog': 1, 'ate': 0}```

#### 通过哈希技巧实现特征向量

`from sklearn.feature_extraction.text import HashingVectorizercorpus = ['the', 'ate', 'bacon', 'cat']vectorizer = HashingVectorizer(n_features=6)print(vectorizer.transform(corpus).todense())`
```[[-1.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  1.  0.  0.]
[ 0.  0.  0.  0. -1.  0.]
[ 0.  1.  0.  0.  0.  0.]]```

### 图片特征提取

#### 通过像素值提取特征

scikit-learn的`digits`数字集包括至少1700种0-9的手写数字图像。每个图像都有8x8像像素构成。每个像素的值是0-16，白色是0，黑色是16。如下图所示：

`%matplotlib inlinefrom sklearn import datasetsimport matplotlib.pyplot as pltdigits = datasets.load_digits()print('Digit:', digits.target[0])print(digits.images[0])plt.figure()plt.axis('off')plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation='nearest')plt.show()`
```Digit: 0
[[  0.   0.   5.  13.   9.   1.   0.   0.]
[  0.   0.  13.  15.  10.  15.   5.   0.]
[  0.   3.  15.   2.   0.  11.   8.   0.]
[  0.   4.  12.   0.   0.   8.   8.   0.]
[  0.   5.   8.   0.   0.   9.   8.   0.]
[  0.   4.  11.   0.   1.  12.   7.   0.]
[  0.   2.  14.   5.  10.  12.   0.   0.]
[  0.   0.   6.  13.  10.   0.   0.   0.]]```

`digits = datasets.load_digits()print('Feature vector:\n', digits.images[0].reshape(-1, 64))`
```Feature vector:
[[  0.   0.   5.  13.   9.   1.   0.   0.   0.   0.  13.  15.  10.  15.
5.   0.   0.   3.  15.   2.   0.  11.   8.   0.   0.   4.  12.   0.
0.   8.   8.   0.   0.   5.   8.   0.   0.   9.   8.   0.   0.   4.
11.   0.   1.  12.   7.   0.   0.   2.  14.   5.  10.  12.   0.   0.
0.   0.   6.  13.  10.   0.   0.   0.]]```

#### 对感兴趣的点进行特征提取

```%matplotlib inlineimport numpy as npfrom skimage.feature import corner_harris, corner_peaksfrom skimage.color import rgb2grayimport matplotlib.pyplot as pltimport skimage.io as iofrom skimage.exposure import equalize_histdef show_corners(corners, image):
fig = plt.figure()
plt.gray()
plt.imshow(image)
y_corner, x_corner = zip(*corners)
plt.plot(x_corner, y_corner, 'or')
plt.xlim(0, image.shape[1])
plt.ylim(image.shape[0], 0)
fig.set_size_inches(np.array(fig.get_size_inches()) * 1.5)
plt.show()mandrill = io.imread('mlslpic/3.1 mandrill.png')mandrill = equalize_hist(rgb2gray(mandrill))corners = corner_peaks(corner_harris(mandrill), min_distance=2)show_corners(corners, mandrill)```

#### SIFT和SURF

`import mahotas as mhfrom mahotas.features import surfimage = mh.imread('mlslpic/3.2 xiaobao.png', as_grey=True)print('第一个SURF描述符：\n{}\n'.format(surf.surf(image)[0]))print('抽取了%s个SURF描述符' % len(surf.surf(image)))`
```第一个SURF描述符：
[  1.15299134e+02   2.56185453e+02   3.51230841e+00   3.32786485e+02
1.00000000e+00   1.75644866e+00  -2.94268692e-03   3.30736379e-03
2.94268692e-03   3.30736379e-03  -2.58778609e-02   3.25587066e-02
2.58778609e-02   3.25587066e-02  -3.03768176e-02   4.18212640e-02
3.03768176e-02   4.18212640e-02  -5.75169209e-03   7.66422266e-03
5.75169209e-03   7.66422266e-03  -1.85200481e-02   3.10523761e-02
1.85200481e-02   3.10523761e-02  -9.61023554e-02   2.59842816e-01
1.12794174e-01   2.59842816e-01  -6.66368114e-02   2.72006376e-01
1.40583321e-01   2.72006376e-01  -1.91014197e-02   5.28250599e-02
2.03376276e-02   5.28250599e-02  -2.24247135e-02   3.35105185e-02
2.24247135e-02   3.35105185e-02  -2.36547964e-01   3.18867366e-01
2.36547964e-01   3.18867366e-01  -2.49737941e-01   3.00644512e-01
2.50125503e-01   3.03596724e-01  -1.69936886e-02   3.82398567e-02
2.00617910e-02   3.82398567e-02  -3.72417955e-03   5.53246035e-03
3.72417955e-03   5.53246035e-03  -2.99748321e-02   4.76884368e-02
2.99748321e-02   4.76884368e-02  -5.12157923e-02   7.42619311e-02
5.12284796e-02   7.42619311e-02  -1.02035696e-02   1.19729640e-02
1.02035696e-02   1.19729640e-02]

### 数据标准化

```from sklearn import preprocessingimport numpy as npX = np.array([
[0., 0., 5., 13., 9., 1.],
[0., 0., 13., 15., 10., 15.],
[0., 3., 15., 2., 0., 11.]])print(preprocessing.scale(X))```
```[[ 0.         -0.70710678 -1.38873015  0.52489066  0.59299945 -1.35873244]
[ 0.         -0.70710678  0.46291005  0.87481777  0.81537425  1.01904933]
[ 0.          1.41421356  0.9258201  -1.39970842 -1.4083737   0.33968311]]```

2119 篇文章110 人订阅

0 条评论

3354

3648

12.1K3

7308

1.5K7

1190

44613

6265

4514

1.6K7