文章/答案/技术大牛

发布

社区首页 >问答首页 >MapReduce冠层聚类中心

问MapReduce冠层聚类中心
EN

Stack Overflow用户

提问于 2014-01-05 09:11:22

回答 1查看 1.1K关注 0票数 1

我正在尝试理解这个用于冠层聚类的代码。这两个类别(一个地图，一个缩减)的目的是寻找树冠中心。我的问题是，我不理解映射和约简函数之间的区别。它们几乎一样。

所以有什么区别吗？还是我在减速机上又重复了同样的过程？

我认为答案是，映射和reduce函数处理代码的方式是不同的。它们对数据执行不同的操作，即使使用类似的代码。

所以，当我们试图找到树冠中心时，有人能解释一下地图的过程并减少吗？

例如，我知道地图可能是这样的-- (joe，1) (dave，1) (joe，1) (joe，1)

然后减缩会是这样的：--(乔，3) (戴夫，1)

同样的事情会发生在这里吗？

还是我做了两次相同的任务？

非常感谢。

地图功能：

package nasdaq.hadoop;

import java.io.*;
import java.util.*;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;

public class CanopyCentersMapper extends Mapper<LongWritable, Text, Text, Text> {
    //A list with the centers of the canopy
    private ArrayList<ArrayList<String>> canopyCenters;

@Override
public void setup(Context context) {
        this.canopyCenters = new ArrayList<ArrayList<String>>();
}

@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    //Seperate the stock name from the values to create a key of the stock and a list of values - what is list of values?
    //What exactly are we splitting here?
    ArrayList<String> stockData = new ArrayList<String>(Arrays.asList(value.toString().split(","))); 

    //remove stock and make first canopy center around it canopy center
    String stockKey = stockData.remove(0);

    //?
    String stockValue = StringUtils.join(",", stockData);

    //Check wether the stock is avaliable for usage as a new canopy center
    boolean isClose = false;    

    for (ArrayList<String> center : canopyCenters) {    //Run over the centers

    //I think...let's say at this point we have a few centers. Then we have our next point to check.
    //We have to compare that point with EVERY center already created. If the distance is larger than EVERY T1
    //then that point becomes a new center! But the more canopies we have there is a good chance it is within
    //the radius of one of the canopies...

            //Measure the distance between the center and the currently checked center
            if (ClusterJob.measureDistance(center, stockData) <= ClusterJob.T1) {
                    //Center is too close
                    isClose = true;
                    break;
            }
    }

    //The center is not smaller than the small radius, add it to the canopy
    if (!isClose) {
        //Center is not too close, add the current data to the center
        canopyCenters.add(stockData);

        //Prepare hadoop data for output
        Text outputKey = new Text();
        Text outputValue = new Text();

        outputKey.set(stockKey);
        outputValue.set(stockValue);

        //Output the stock key and values to reducer
        context.write(outputKey, outputValue);
    }
}

}

减少功能：

    package nasdaq.hadoop;

import java.io.*;
import java.util.*;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;

public class CanopyCentersReducer extends Reducer<Text, Text, Text, Text> {
    //The canopy centers list
    private ArrayList<ArrayList<String>> canopyCenters;

@Override
public void setup(Context context) {
        //Create a new list for the canopy centers
        this.canopyCenters = new ArrayList<ArrayList<String>>();
}

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
    for (Text value : values) {
        //Format the value and key to fit the format
        String stockValue = value.toString();
        ArrayList<String> stockData = new ArrayList<String>(Arrays.asList(stockValue.split(",")));
        String stockKey = key.toString();

        //Check wether the stock is avaliable for usage as a new canopy center
        boolean isClose = false;    
        for (ArrayList<String> center : canopyCenters) {    //Run over the centers
                //Measure the distance between the center and the currently checked center
                if (ClusterJob.measureDistance(center, stockData) <= ClusterJob.T1) {
                        //Center is too close
                        isClose = true;
                        break;
                }
        }

        //The center is not smaller than the small radius, add it to the canopy
        if (!isClose) {
            //Center is not too close, add the current data to the center
            canopyCenters.add(stockData);

            //Prepare hadoop data for output
            Text outputKey = new Text();
            Text outputValue = new Text();

            outputKey.set(stockKey);
            outputValue.set(stockValue);

            //Output the stock key and values to reducer
            context.write(outputKey, outputValue);
        }


    }

**编辑--更多代码和解释

Stockkey是表示股票的键值。(纳斯达克之类的)

ClusterJob.measureDistance()：

    public static double measureDistance(ArrayList<String> origin, ArrayList<String> destination)
{
    double deltaSum = 0.0;
    //Run over all points in the origin vector and calculate the sum of the squared deltas
    for (int i = 0; i < origin.size(); i++) {
        if (destination.size() > i) //Only add to sum if there is a destination to compare to
        {
            deltaSum = deltaSum + Math.pow(Math.abs(Double.valueOf(origin.get(i)) - Double.valueOf(destination.get(i))),2);
        }
    }
    //Return the square root of the sum
    return Math.sqrt(deltaSum);

canopy

java

hadoop

map

reduce

回答 1

Stack Overflow用户

回答已采纳

发布于 2014-01-05 19:02:01

好的，代码的直截了当的解释是：-映射者遍历数据的某些子集(大概是随机的)，并生成冠层中心--所有这些都至少是T1之间的距离。这些中心被发射出来。-减速机然后从所有映射器中遍历属于每个特定股票键(如MSFT、GOOG等)的所有冠层中心，然后确保在每个股票键值的T1内没有冠层中心(例如，GOOG中没有两个中心在彼此的T1内，尽管MSFT的一个中心和GOOG的一个中心可能很近)。

代码的目标还不清楚，我个人认为一定有一个bug。减速机基本上解决了这个问题，就像你试图独立地为每个股票键生成中心一样(即，为GOOG计算所有数据点的冠层中心)，而地图绘制者似乎解决了为所有股票生成中心的问题。这样排列在一起，你就会产生矛盾，所以这两个问题实际上都没有得到解决。

如果您想要所有股票键的中心：-那么映射输出必须将所有东西发送到一个减速机。将映射输出键设置为一些琐碎的内容，如NullWritable。然后，减速器将执行正确的操作，而不改变。

如果您想要每个股票键的中心：-然后需要更改映射器，以便有效地为每个股票键保留一个单独的树冠列表，您可以为每个股票键保留一个单独的arrayList (首选，因为它会更快)，或者您只需要更改距离度量，使属于不同股票键的股票键之间的距离是无限的(这样它们就永远不会交互)。

顺便说一下，你们的距离测量也有一些不相关的问题。首先，使用Double.parseDouble解析数据，但不捕获NumberFormatException。因为你给它的是stockData，它在第一个字段中包含“GOOG”这样的非数字字符串，所以一旦你运行它，你就会崩溃。第二，距离度量忽略任何缺少值的字段。这是L2 (pythagorean)距离度量的不正确实现。要知道为什么，请考虑这个字符串："，“与任何其他点的距离为0，如果它被选择为冠层中心，则不能选择其他中心。与其仅仅将缺失维度的增量设置为零，不如考虑将其设置为类似于该属性的填充意味的合理值，或者(为了安全起见)将该行从数据集中丢弃以进行聚类。

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/20931868

复制

相似问题

问MapReduce冠层聚类中心
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问MapReduce冠层聚类中心EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问MapReduce冠层聚类中心
EN