there are many data-mining problems which can be expressed as finding similar sets, such as:
pages with similar words, e.g., for classification by topic
recommendation systems, classification by people or movie
entity resolution, different informs all point to one person
essential parts
shingling
minhashing
jaccard similarity measure
matrix represent
minihash
总体步骤
例子
性质
签名的相似性
implementation
将对矩阵的转置操作用不同的hash函数代替。
LSH
LSH means locality-sensitive hashing.
概述
general idea: generate from the collection of all elements (signatures in our examples) a small list of candidates pairs : pairs of elements whose similarity must be evaluated.
for signature matrices: hush columns to many buckets, and make elements of the same bucket candidate pairs.