apple.Turicreate已经是第四篇了。本模块主要阐述该平台相似模块的一些功能。
也是目前求相似解决方案很赞的一个。
官方地址:https://apple.github.io/turicreate/docs/api/turicreate.toolkits.nearest_neighbors.html
地址二:https://apple.github.io/turicreate/docs/userguide/nearest_neighbors/nearest_neighbors.html
输入的主要方式为SFrame,主打两个模块:
import turicreate
references = turicreate.SFrame({'x1': [0.98, 0.62, 0.11],
'x2': [0.69, 0.58, 0.36]})
# 建模-最近邻
model = turicreate.nearest_neighbors.create(references)
# 建模-近邻图谱
sim_graph = model.similarity_graph(k=1)
该案例涵盖了两大主要模块,查询与近邻图谱。
.
turicreate.nearest_neighbors.create(dataset, label=None, features=None, distance=None, method='auto', verbose=True, **kwargs)
leaf_size,针对ball tree method,树叶节点数量
num_tables,针对LSH,构造哈希表的数量,默认是20,建议在10-30
num_projections_per_table,针对LSH,每个散列表的投影/散列函数的数量
.
model.query
NearestNeighborsModel.save
NearestNeighborsModel.similarity_graph
NearestNeighborsModel.summary
.
大体有以下几种:LSH、auto、brute_force、ball_tree
.
NearestNeighborsModel.query(dataset, label=None, k=5, radius=None, verbose=True)
dataset:数据集
label:
k:返回前K个数据
radius:半径,半径小于radius才能返回
verbose:
.
sf = turicreate.SFrame({'label': range(3),
'feature1': [0.98, 0.62, 0.11],
'feature2': [0.69, 0.58, 0.36]})
model = turicreate.nearest_neighbors.create(sf, 'label')
queries = turicreate.SFrame({'label': ['a','b','c'],
'feature1': [0.05, 0.61, 0.99],
'feature2': [0.06, 0.97, 0.86]})
model.query(queries, 'label', k=2)
sf是数据集,其中看到了label可以代表三个个案例的名称;
create(sf, ‘label’),以label为三个案例的名称进行初始化;
queries为查询的数据集。
.
这个模块在第二篇提到过:《极简主义︱利用apple机器学习平台Turicreate实现图像相似性检索(二)》
similarity_graph(k=5, radius=None, include_self_edges=False, output_type='SGraph', verbose=True)
k:最大返回样本数
radius:float,半径范围,如果超过radius个间隔,就不显示
include_self_edges:是否包含自己,True的话,返回的结果包含自己(自己的相似性为1)
output_type:‘SGraph’, ‘SFrame’两种
verbose:打印进度更新和模型细节。
常规例子:
sf = turicreate.SFrame({'label': range(3),
'feature1': [0.98, 0.62, 0.11],
'feature2': [0.69, 0.58, 0.36]})
model = turicreate.nearest_neighbors.create(sf, 'label')
sim_graph = model.similarity_graph(k=3)
sim_graph.edges
.
apple.Turicreate中有很多距离可以直接用:
# euclidean
tc.distances.euclidean([1, 2, 3], [4, 5, 6])
tc.distances.euclidean({'a': 2, 'c': 4}, {'b': 3, 'c': 12})
# squared_euclidean
tc.distances.squared_euclidean([1, 2, 3], [4, 5, 6])
tc.distances.squared_euclidean({'a': 2, 'c': 4},
{'b': 3, 'c': 12})
# manhattan
tc.distances.manhattan([1, 2, 3], [4, 5, 6])
tc.distances.manhattan({'a': 2, 'c': 4}, {'b': 3, 'c': 12})
# cosine
tc.distances.cosine([1, 2, 3], [4, 5, 6])
tc.distances.cosine({'a': 2, 'c': 4}, {'b': 3, 'c': 12})
# levenshtein 字符距离
tc.distances.levenshtein("fossa", "fossil")
# dot_product
tc.distances.dot_product([1, 2, 3], [4, 5, 6])
tc.distances.dot_product({'a': 2, 'c': 4}, {'b': 3, 'c': 12})
# transformed_dot_product
tc.distances.transformed_dot_product([1, 2, 3], [4, 5, 6])
tc.distances.transformed_dot_product({'a': 2, 'c': 4}, {'b': 3, 'c': 12})
# jaccard
tc.distances.jaccard({'a': 2, 'c': 4}, {'b': 3, 'c': 12})
# weighted_jaccard
tc.distances.weighted_jaccard({'a': 2, 'c': 4},
{'b': 3, 'c': 12})
# 复合距离
>>> sf = turicreate.SFrame({'X1': [0.98, 0.62, 0.11],
... 'X2': [0.69, 0.58, 0.36],
... 'species': ['cat', 'dog', 'fossa']})
...
>>> dist_spec = [[('X1', 'X2'), 'euclidean', 2],
... [('species',), 'levenshtein', 0.4]]
...
>>> d = turicreate.distances.compute_composite_distance(dist_spec, sf[0], sf[1])
>>> print d
1.95286120899
# turicreate.toolkits.distances.build_address_distance
>>> homes = turicreate.SFrame({'sqft': [1230, 875, 1745],
... 'street': ['phinney', 'fairview', 'cottage'],
... 'city': ['seattle', 'olympia', 'boston'],
... 'state': ['WA', 'WA', 'MA']})
...
>>> my_dist = turicreate.distances.build_address_distance(street='street',
... city='city',
... state='state')
>>> my_dist
[[['street'], 'jaccard', 5],
[['state'], 'jaccard', 5],
[['city'], 'levenshtein', 1]]
.
# manhattan距离
model = tc.nearest_neighbors.create(sf, features=['bedroom', 'bath', 'size'],
distance='manhattan')
model.summary()
此时用的create中模块的距离distance中的曼哈顿距离。
# tc自带的tc.distances.manhattan距离
model = tc.nearest_neighbors.create(sf, features=['bedroom', 'bath', 'size'],
distance=tc.distances.manhattan)
knn = model.query(sf[:3], k=3)
knn.print_rows()
另外一种是tc单独的距离模块:tc.distances.manhattan
tc单独的距离模块的一些案例:
sf_check = sf[['bedroom', 'bath', 'size']]
print "distance check 1:", tc.distances.manhattan(sf_check[2], sf_check[10])
print "distance check 2:", tc.distances.manhattan(sf_check[2], sf_check[14])
.
案例一:
my_dist = [[['bedroom', 'bath'], 'manhattan', 1],
[['size', 'lot'], 'euclidean', 2]]
model = tc.nearest_neighbors.create(sf, distance=my_dist)
model.summary()
# 分开求距离
案例二:
>>> my_dist = [[['X1', 'X2'], 'euclidean', 2.],
... [['str_feature'], 'levenshtein', 3.]]
...
>>> model = turicreate.nearest_neighbors.create(sf, distance=my_dist)
个性化的判定,某些指标之间选择特定的距离函数。
.
大体有以下三种:
笔者对LSH很感兴趣。
model = tc.nearest_neighbors.create(sf)
model = tc.nearest_neighbors.create(sf, features=['bedroom', 'bath', 'size'],method = 'lsh')
model.summary()
method 唯一改变的就是这里,当然还可以选择几个特定的参数:
num_tables/num_projections_per_table
.
数据准备并标准化
import turicreate as tc
url = 'https://static.turi.com/datasets/regression/houses.csv'
sf = tc.SFrame.read_csv(url)
sf.head(5)
# 数据标准化
for c in sf.column_names():
sf[c] = (sf[c] - sf[c].mean()) / sf[c].std()
数据准备好之后,进行建模:
#建模
model = tc.nearest_neighbors.create(sf)
model = tc.nearest_neighbors.create(sf, features=['bedroom', 'bath', 'size'])
model.summary()
模型建立好之后就可以进行两个主要模块的调用:
knn = model.query(sf[:5], k=5)
knn.head()
这边数据操作的时候有个小问题,建模的时候是用
‘bedroom’, ‘bath’, ‘size’这三个指标,
但是查询的时候,即使给了超过三个(’bedroom’, ‘bath’, ‘size’)指标,计算的时候,也只是这三个。
这个比较好,但是指标名称一定要对应起来。
sim_graph = model.similarity_graph(k=3)
sim_graph.edges