我正在尝试使用tslearn对时间序列数据进行kmeans聚类。我正在对110个不同的文件进行110次集群。在对原始数据x执行x = np.squeeze(to_time_series_dataset(x))
之后,下面附加了一个特定文件的示例数据。我也尝试使用数据而不压缩数据,但是对于某些视频,值错误ValueError: x and y arrays must have at least 2 entries
仍然会弹出。
根据我的理解,我怀疑可能是因为在某些文件中,只有一个值不是nan
,例如[1, nan, nan, nan]
。如果是这样的话,我真的不能用实际值替换nans
,因为在我的数据中,-1表示"no",0表示“不确定”,1表示“是”。这也是为什么我没有标准化数据,因为它已经在-1到1的范围内。
有什么建议吗?提前谢谢。
[[ 0. 1. -1. nan]
[-1. 1. 0. -1.]
[ 0. -1. nan nan]
[ 0. 0. -1. nan]
[ 0. 1. 0. -1.]
[ 0. -1. nan nan]
[ 0. -1. -1. nan]
[ 0. 0. -1. nan]
[ 0. -1. nan nan]
[ 0. -1. nan nan]
[ 0. 0. -1. nan]
[-1. -1. nan nan]
[ 1. 1. -1. nan]
[ 1. -1. nan nan]
[ 0. -1. nan nan]
[ 1. -1. nan nan]
[ 0. -1. -1. nan]
[ 0. -1. nan nan]
[ 1. -1. nan nan]
[ 0. 0. -1. nan]
[ 0. -1. -1. nan]
[ 0. 1. -1. nan]
[ 0. 0. -1. nan]
[ 1. -1. nan nan]]
如果我不压缩数据,就会是这样
[[[ 0.]
[ 1.]
[-1.]
[nan]]
[[-1.]
[ 1.]
[ 0.]
[-1.]]
[[ 0.]
[-1.]
[nan]
[nan]]
[[ 0.]
[ 0.]
[-1.]
[nan]]
[[ 0.]
[ 1.]
[ 0.]
[-1.]]
[[ 0.]
[-1.]
[nan]
[nan]]
[[ 0.]
[-1.]
[-1.]
[nan]]
[[ 0.]
[ 0.]
[-1.]
[nan]]
[[ 0.]
[-1.]
[nan]
[nan]]
[[ 0.]
[-1.]
[nan]
[nan]]
[[ 0.]
[ 0.]
[-1.]
[nan]]
[[-1.]
[-1.]
[nan]
[nan]]
[[ 1.]
[ 1.]
[-1.]
[nan]]
[[ 1.]
[-1.]
[nan]
[nan]]
[[ 0.]
[-1.]
[nan]
[nan]]
[[ 1.]
[-1.]
[nan]
[nan]]
[[ 0.]
[-1.]
[-1.]
[nan]]
[[ 0.]
[-1.]
[nan]
[nan]]
[[ 1.]
[-1.]
[nan]
[nan]]
[[ 0.]
[ 0.]
[-1.]
[nan]]
[[ 0.]
[-1.]
[-1.]
[nan]]
[[ 0.]
[ 1.]
[-1.]
[nan]]
[[ 0.]
[ 0.]
[-1.]
[nan]]
[[ 1.]
[-1.]
[nan]
[nan]]]
我调用下面的代码来完成实际的集群。因为我不确定每个文件中有多少个集群是最好的,所以我尝试2、3或4个集群并评估它们的剪影得分。
for j in [2,3,4]:
km = TimeSeriesKMeans(n_clusters=j, metric="dtw")
labels = km.fit_predict(x)
silhouetteScore = silhouette_score(x, labels, metric="dtw")
num_of_clusters_list.append(j)
silhouetteScore_list.append(silhouetteScore)
print(f"{j} clusters, score is {silhouetteScore}")
发布于 2022-10-07 19:59:51
您可以使用KneeLocator (或ElbowLocator)找到最佳K。
https://stackoverflow.com/questions/72518907
复制相似问题