“好电影”应该是0.707107,我认为应该是: 1/1*ln(5/2) = 0.91629。
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
texts = [
"good movie", "not a good movie", "did not like",
"i like it", "good one"
]
# using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
features = tfidf.fit_transform(texts)
pd.DataFrame(
features.todense(),
columns=tfidf.get_feature_names()
)
发布于 2019-05-22 03:16:22
由于使用了norm
和smooth_idf
参数。默认情况下,两者都是真的。
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
texts = [
"good movie", "not a good movie", "did not like",
"i like it", "good one"
]
# using default tokenizer in TfidfVectorizer
tfidf = TfidfVectorizer(min_df=2, max_df=0.5,norm=None,smooth_idf=False, ngram_range=(1, 2))
features = tfidf.fit_transform(texts)
pd.DataFrame(
features.todense(),
columns=tfidf.get_feature_names()
)
输出:
good movie like movie not
0 1.916291 0.000000 1.916291 0.000000
1 1.916291 0.000000 1.916291 1.916291
2 0.000000 1.916291 0.000000 1.916291
3 0.000000 1.916291 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000
默认情况下,sklearn计算idf的公式为log [ n / df(t) ] + 1
。所以你的计算结果是0.91621,然后加1。
如果使用smooth_idf=True
(缺省),则公式将变为log [ (1 + n) / (1 + df(d, t)) ] + 1
tfidf = TfidfVectorizer(min_df=2, max_df=0.5,norm=None,smooth_idf=True, ngram_range=(1, 2))
的输出是
good movie like movie not
0 1.693147 0.000000 1.693147 0.000000
1 1.693147 0.000000 1.693147 1.693147
2 0.000000 1.693147 0.000000 1.693147
3 0.000000 1.693147 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000
如何0.707107??
如果你看到,对于第一行,我们有两次1.693417 (称之为a),因此l2范数是sqrt(a^2 + a^2),它是sqrt(1.69^2+1.69^2)= sqrt(5.73349),等于2.3944。现在你除以1.693147/2.3944,大约得到0.707107。
https://stackoverflow.com/questions/56243414
复制相似问题