该库是具有 tf * idf 权重的 Ruby 向量空间模型(VSM),它能够用 tf * idf 计算文本之间的相似度。
Github:
https://github.com/jpmckinney/tf-idf-similarity
用法
require 'matrix'
require 'tf-idf-similarity'
创建一组文档:
document1 = TfIdfSimilarity::Document.new("Lorem ipsum dolor sit amet...")
document2 = TfIdfSimilarity::Document.new("Pellentesque sed ipsum dui...")
document3 = TfIdfSimilarity::Document.new("Nam scelerisque dui sed leo...")
corpus = [document1, document2, document3]
使用 Term Frequency-Inverse Document Frequency 函数创建文档项矩阵:
https://en.wikipedia.org/wiki/Tf%E2%80%93idf
model = TfIdfSimilarity::TfIdfModel.new(corpus)
或者,使用 Okapi BM25 排名函数创建文档项矩阵:
https://en.wikipedia.org/wiki/Okapi_BM25
model = TfIdfSimilarity::BM25Model.new(corpus)
创建一个相似矩阵:
matrix = model.similarity_matrix
查找矩阵中两个文档的相似度:
matrix[model.document_index(document1), model.document_index(document2)]
打印文档中术语的 tf * idf 值:
tfidf_by_term = {}
document1.terms.each do |term|
tfidf_by_term[term] = model.tfidf(document1, term)
end
puts tfidf_by_term.sort_by{|_,tfidf| -tfidf}
自行标记文档,例如通过排除停止词:
require 'unicode_utils'
text = "Lorem ipsum dolor sit amet..."
tokens = UnicodeUtils.each_word(text).to_a - ['and', 'the', 'to']
document1 = TfIdfSimilarity::Document.new(text, :tokens => tokens)
自己提供每个术语出现的次数和文档中的 token 数量:
require 'unicode_utils'
text = "Lorem ipsum dolor sit amet..."
tokens = UnicodeUtils.each_word(text).to_a - ['and', 'the', 'to']
term_counts = Hash.new(0)
size = 0
tokens.each do |token|
# Unless the token is numeric.
unless token[/\A\d+\z/]
# Remove all punctuation from tokens.
term_counts[token.gsub(/\p{Punct}/, '')] += 1
size += 1
end
end
document1 = TfIdfSimilarity::Document.new(text, :term_counts => term_counts, :size => size
详细信息请参阅文档:
https://www.rubydoc.info/gems/tf-idf-similarity