许多机器学习问题需要从 类别变量、文本、图片中学习,需要从中提取出数字特征
通常使用 one-hot 编码,产生2进制的编码,会扩展数据,当数据值种类多时,不宜使用
from sklearn.feature_extraction import DictVectorizer
onehot_encoder = DictVectorizer()
X=[
{'city':'Beijing'},
{'city':'Guangzhou'},
{'city':'Shanghai'}
]
print(onehot_encoder.fit_transform(X).toarray())
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
one-hot 编码,没有顺序或大小之分,相比于用 0, 1, 2 来表示上述 3 个city,one-hot编码更好
this transformer will only do a binary one-hot encoding when feature values are of type string. If categorical features are represented as numeric values such as int, the DictVectorizer can be followed by sklearn.preprocessing.OneHotEncoder to complete binary one-hot encoding.
X=[
{'city':1},
{'city':4},
{'city':5}
]
onehot_encoder = DictVectorizer()
print(onehot_encoder.fit_transform(X).toarray())
[[1.]
[4.]
[5.]]
# string 类型
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
onehot_encoder = OneHotEncoder()
X=[
{'city':'Beijing'},
{'city':'Guangzhou'},
{'city':'Shanghai'}
]
X = pd.DataFrame(X)
print(onehot_encoder.fit_transform(X).toarray())
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
# 数字类型
onehot_encoder = OneHotEncoder()
X=[
{'city':1},
{'city':4},
{'city':5}
]
X = pd.DataFrame(X)
print(onehot_encoder.fit_transform(X).toarray())
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
from sklearn import preprocessing
import numpy as np
X = np.array([
[0., 0., 5., 13., 9., 1.],
[0., 0., 13., 15., 10., 15.],
[0., 3., 15., 2., 0., 11.]
])
s = preprocessing.StandardScaler()
print(s.fit_transform(X))
StandardScaler 均值为0,方差为1
[[ 0. -0.70710678 -1.38873015 0.52489066 0.59299945 -1.35873244]
[ 0. -0.70710678 0.46291005 0.87481777 0.81537425 1.01904933]
[ 0. 1.41421356 0.9258201 -1.39970842 -1.4083737 0.33968311]]
RobustScaler 对异常值有更好的鲁棒性,减轻异常值的影响
This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
from sklearn.preprocessing import RobustScaler
s = RobustScaler()
print(s.fit_transform(X))
[[ 0. 0. -1.6 0. 0. -1.42857143]
[ 0. 0. 0. 0.30769231 0.2 0.57142857]
[ 0. 2. 0.4 -1.69230769 -1.8 0. ]]
文本通常为自然语言
corpus = [
"UNC played Duke in basketball",
"Duke lost the basketball game"
]
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
print(vectorizer.fit_transform(corpus).todense())
# [[1 1 0 1 0 1 0 1]
# [1 1 1 0 1 0 1 0]]
print(vectorizer.vocabulary_)
# {'unc': 7, 'played': 5, 'duke': 1, 'in': 3,
# 'basketball': 0, 'lost': 4, 'the': 6, 'game': 2}
I,a
没有向量化corpus.append("I ate a sandwich and an apple")
print(vectorizer.fit_transform(corpus).todense())
# [[0 0 0 0 1 1 0 1 0 1 0 0 1]
# [0 0 0 0 1 1 1 0 1 0 0 1 0]
# [1 1 1 1 0 0 0 0 0 0 1 0 0]]
print(vectorizer.vocabulary_)
# {'unc': 12, 'played': 9, 'duke': 5, 'in': 7,
# 'basketball': 4, 'lost': 8, 'the': 11, 'game': 6,
# 'ate': 3, 'sandwich': 10, 'and': 1, 'an': 0, 'apple': 2}
from sklearn.metrics.pairwise import euclidean_distances
X = vectorizer.fit_transform(corpus).todense()
print("distance between doc1 and doc2 ", euclidean_distances(X[0],X[1]))
print("distance between doc1 and doc3 ", euclidean_distances(X[0],X[2]))
print("distance between doc2 and doc3 ", euclidean_distances(X[1],X[2]))
# distance between doc1 and doc2 [[2.44948974]]
# distance between doc1 and doc3 [[3.16227766]]
# distance between doc2 and doc3 [[3.16227766]]
可以看出,文档1跟文档2更相似 真实环境中,词汇数量相当大,需要的内存很大,为了缓和这个矛盾,采用稀疏向量 后序还有降维方法,来降低向量的维度
降维策略:
the\a\an\do \be\will\on\around
等,称之 stop_wordsCountVectorizer
可以通过 stop_words
关键词参数,过滤停用词,它本身也有一个基本的英语停用词列表vectorizer = CountVectorizer(stop_words='english')
print(vectorizer.fit_transform(corpus).todense())
# [[0 0 1 1 0 0 1 0 1]
# [0 0 1 1 1 1 0 0 0]
# [1 1 0 0 0 0 0 1 0]]
print(vectorizer.vocabulary_)
# {'unc': 8, 'played': 6, 'duke': 3, 'basketball': 2,
# 'lost': 5, 'game': 4, 'ate': 1, 'sandwich': 7, 'apple': 0}
我们发现 in\the\and\an
不见了
停用词列表包含的词很少,过滤后依然包含很多单词怎么办?
例如,jumping\jumps\jump
,一篇报道跳远比赛的文章中,这几个词时分别编码的,我们可以对他们进行统一处理,压缩成单个特征
corpus = [
'He ate the sandwiches',
'Every sandwich was eaten by him'
]
vectorizer = CountVectorizer(binary=True, stop_words='english')
print(vectorizer.fit_transform(corpus).todense())
# [[1 0 1 0]
# [0 1 0 1]]
print(vectorizer.vocabulary_)
# {'ate': 0, 'sandwiches': 2, 'sandwishes': 3, 'eaten': 1}
我们看到这两个句子表达的一个意思,特征向量却没有一个共同元素
corpus = [
'I am gathering ingredients for the sandwich.',
'There were many peoples at the gathering.'
]
from nltk.stem.wordnet import WordNetLemmatizer
# help(WordNetLemmatizer)
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('gathering','v')) # gather,动词
print(lemmatizer.lemmatize('gathering','n')) # gathering,名词
from nltk.stem import PorterStemmer
# help(PorterStemmer)
stemmer = PorterStemmer()
print(stemmer.stem('gathering')) # gather
小例子:
from nltk import word_tokenize # 取词
from nltk.stem import PorterStemmer # 词干提取
from nltk.stem.wordnet import WordNetLemmatizer # 词性还原
from nltk import pos_tag # 词性标注
wordnet_tags = ['n','v']
corpus = [
'He ate the sandwiches',
'Every sandwich was eaten by him'
]
stemmer = PorterStemmer()
print("词干:", [[stemmer.stem(word) for word in word_tokenize(doc)]
for doc in corpus])
# 词干: [['He', 'ate', 'the', 'sandwich'],
# ['everi', 'sandwich', 'wa', 'eaten', 'by', 'him']]
def lemmatize(word, tag):
if tag[0].lower() in ['n','v']:
return lemmatizer.lemmatize(word, tag[0].lower())
return word
lemmatizer = WordNetLemmatizer()
tagged_corpus = [pos_tag(word_tokenize(doc)) for doc in corpus]
print(tagged_corpus)
# [[('He', 'PRP'), ('ate', 'VBD'), ('the', 'DT'), ('sandwiches', 'NNS')],
# [('Every', 'DT'), ('sandwich', 'NN'), ('was', 'VBD'),
# ('eaten', 'VBN'), ('by', 'IN'), ('him', 'PRP')]]
print('词性还原:',[[lemmatize(word,tag) for word, tag in doc] for doc in tagged_corpus])
# 词性还原: [['He', 'eat', 'the', 'sandwich'],
# ['Every', 'sandwich', 'be', 'eat', 'by', 'him']]
对 n,v 开头的词性的单词进行了词性还原
词频是很重要的,创建编码单词频数的特征向量
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["The dog ate a sandwich, the people manufactured many sandwiches,\
and I ate a sandwich"]
vectorizer = CountVectorizer(stop_words='english')
freq = np.array(vectorizer.fit_transform(corpus).todense())
freq # array([[2, 1, 1, 3]], dtype=int64)
vectorizer.vocabulary_
# {'dog': 1, 'ate': 0, 'sandwich': 3, 'people': 2}
for word, idx in vectorizer.vocabulary_.items():
print(word, " 出现了 ", freq[0][idx]," 次")
dog 出现了 1 次
ate 出现了 2 次
sandwich 出现了 2 次
people 出现了 1 次
manufactured 出现了 1 次
sandwiches 出现了 1 次
TfidfVectorizer
可以统计单词的权值:单词频率-逆文本频率 TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["The dog ate a sandwich, and I ate a sandwich",
"the people manufactured a sandwich"]
vectorizer = TfidfVectorizer(stop_words='english')
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)
[[0.75458397 0.37729199 0. 0. 0.53689271]
[0. 0. 0.6316672 0.6316672 0.44943642]]
{'dog': 1, 'ate': 0, 'sandwich': 4, 'people': 3, 'manufactured': 2}
from sklearn.feature_extraction.text import HashingVectorizer
# help(HashingVectorizer)
corpus = [
'This is the first document.',
'This document is the second document.'
]
vectorizer = HashingVectorizer(n_features=2**4)
X = vectorizer.fit_transform(corpus).todense()
print(X)
x = vectorizer.transform(['This is the first document.']).todense()
print(x)
x in X # True
[[-0.57735027 0. 0. 0. 0. 0.
0. 0. -0.57735027 0. 0. 0.
0. 0.57735027 0. 0. ]
[-0.81649658 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.40824829
0. 0.40824829 0. 0. ]]
[[-0.57735027 0. 0. 0. 0. 0.
0. 0. -0.57735027 0. 0. 0.
0. 0.57735027 0. 0. ]]
词向量模型相比于词袋模型更好些。
词向量模型在类似的词语上产生类似的词向量(如,small、tiny都表示小),反义词的向量则只在很少的几个维度类似
# google colab 运行以下代码
import gensim
from google.colab import drive
drive.mount('/gdrive')
# !git clone https://github.com/mmihaltz/word2vec-GoogleNews-vectors.git
! wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
!cd /content
!gzip -d /content/GoogleNews-vectors-negative300.bin.gz
model = gensim.models.KeyedVectors.load_word2vec_format('/content/GoogleNews-vectors-negative300.bin', binary=True)
embedding = model.word_vec('cat')
embedding.shape # (300,)
相似度
print(model.similarity('cat','dog')) # 0.76094574
print(model.similarity('cat','sandwich')) # 0.17211203
最相似的n个单词
print(model.most_similar(positive=['good','ok'],negative=['bad'],topn=3))
# [('okay', 0.7390689849853516),
# ('alright', 0.7239435911178589),
# ('OK', 0.5975555777549744)]
将图片的矩阵展平后作为特征向量
from sklearn import datasets
digits = datasets.load_digits()
print(digits.images[0].reshape(-1,64))
图片特征向量
[[ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5. 0. 0. 3.
15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8. 8. 0. 0. 5. 8. 0.
0. 9. 8. 0. 0. 4. 11. 0. 1. 12. 7. 0. 0. 2. 14. 5. 10. 12.
0. 0. 0. 0. 6. 13. 10. 0. 0. 0.]]
不懂,暂时跳过。
扫码关注腾讯云开发者
领取腾讯云代金券
Copyright © 2013 - 2025 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有
深圳市腾讯计算机系统有限公司 ICP备案/许可证号:粤B2-20090059 深公网安备号 44030502008569
腾讯云计算(北京)有限责任公司 京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287
Copyright © 2013 - 2025 Tencent Cloud.
All Rights Reserved. 腾讯云 版权所有