K均值如同智慧的图书管理员:
整个过程展现"动态平衡"的哲学,通过迭代逼近最优分组。
import java.util.*;
public class KMeans {
private static class Point {
double x, y;
Point(double x, double y) { this.x = x; this.y = y; }
}
public static Map<Integer, List<Point>> cluster(List<Point> points, int K, int maxIter) {
// 1. 初始化中心点(随机选择)
List<Point> centroids = new ArrayList<>();
Random rand = new Random();
for (int i=0; i<K; i++) {
centroids.add(points.get(rand.nextInt(points.size())));
}
Map<Integer, List<Point>> clusters;
for (int iter=0; iter<maxIter; iter++) {
// 2. 分配数据点
clusters = new HashMap<>();
for (Point p : points) {
int nearest = findNearestCentroid(p, centroids);
clusters.computeIfAbsent(nearest, k->new ArrayList<>()).add(p);
}
// 3. 更新中心点
List<Point> newCentroids = new ArrayList<>();
for (int i=0; i<K; i++) {
List<Point> cluster = clusters.get(i);
double sumX=0, sumY=0;
for (Point p : cluster) {
sumX += p.x;
sumY += p.y;
}
newCentroids.add(new Point(sumX/cluster.size(), sumY/cluster.size()));
}
if (converged(centroids, newCentroids)) break;
centroids = newCentroids;
}
return clusters;
}
private static int findNearestCentroid(Point p, List<Point> centroids) {
int minIndex = 0;
double minDist = Double.MAX_VALUE;
for (int i=0; i<centroids.size(); i++) {
double d = distance(p, centroids.get(i));
if (d < minDist) {
minDist = d;
minIndex = i;
}
}
return minIndex;
}
private static double distance(Point a, Point b) {
return Math.sqrt(Math.pow(a.x-b.x, 2) + Math.pow(a.y-b.y, 2));
}
public static void main(String[] args) {
List<Point> data = Arrays.asList(
new Point(1,1), new Point(1,2), new Point(2,1),
new Point(8,8), new Point(9,8), new Point(8,9)
);
Map<Integer, List<Point>> clusters = cluster(data, 2, 100);
clusters.forEach((k,v) -> System.out.println("Cluster "+k+": "+v.size()+" points"));
}
}
指标 | 数值 | 说明 |
---|---|---|
时间复杂度 | O(nKI*d) | n:数据量 K:簇数 I:迭代次数 d:维度 |
空间复杂度 | O(nd + Kd) | 存储数据点和中心点 |
算法特性:
典型案例:
新手必练:
// 三维点类扩展
class Point3D {
double x, y, z;
// 距离计算增加z维度
static double distance(Point3D a, Point3D b) {
return Math.sqrt(Math.pow(a.x-b.x,2) + Math.pow(a.y-b.y,2) + Math.pow(a.z-b.z,2));
}
}
高手进阶:
// K-means++初始化优化
List<Point> initCentroids(List<Point> points, int K) {
List<Point> centroids = new ArrayList<>();
centroids.add(points.get(rand.nextInt(points.size())));
for (int i=1; i<K; i++) {
double[] distances = new double[points.size()];
double sum = 0;
for (int j=0; j<points.size(); j++) {
double minDist = Double.MAX_VALUE;
for (Point c : centroids) {
minDist = Math.min(minDist, distance(points.get(j), c));
}
distances[j] = minDist;
sum += distances[j];
}
// 按距离概率选择下一个中心
double r = rand.nextDouble() * sum;
for (int j=0; j<distances.length; j++) {
r -= distances[j];
if (r <= 0) {
centroids.add(points.get(j));
break;
}
}
}
return centroids;
}
K均值教会我们:
当你能在TB级用户数据中发现隐藏的细分市场时,说明真正掌握了聚类的精髓——这不仅需要算法理解,更需要商业洞察力。记住:聚类不是终点,而是理解数据的第一步,真正的价值在于如何解释和应用这些发现的模式。