我正在尝试理解这个用于冠层聚类的代码。这两个类别(一个地图,一个缩减)的目的是寻找树冠中心。我的问题是,我不理解映射和约简函数之间的区别。它们几乎一样。
所以有什么区别吗?还是我在减速机上又重复了同样的过程?
我认为答案是,映射和reduce函数处理代码的方式是不同的。它们对数据执行不同的操作,即使使用类似的代码。
所以,当我们试图找到树冠中心时,有人能解释一下地图的过程并减少吗?
例如,我知道地图可能是这样的-- (joe,1) (dave,1) (joe,1) (joe,1)
然后减缩会是这样的:--(乔,3) (戴夫,1)
同样的事情会发生在这里吗?
还是我做了两次相同的任务?
非常感谢。
地图功能:
package nasdaq.hadoop;
import java.io.*;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;
public class CanopyCentersMapper extends Mapper<LongWritable, Text, Text, Text> {
//A list with the centers of the canopy
private ArrayList<ArrayList<String>> canopyCenters;
@Override
public void setup(Context context) {
this.canopyCenters = new ArrayList<ArrayList<String>>();
}
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//Seperate the stock name from the values to create a key of the stock and a list of values - what is list of values?
//What exactly are we splitting here?
ArrayList<String> stockData = new ArrayList<String>(Arrays.asList(value.toString().split(",")));
//remove stock and make first canopy center around it canopy center
String stockKey = stockData.remove(0);
//?
String stockValue = StringUtils.join(",", stockData);
//Check wether the stock is avaliable for usage as a new canopy center
boolean isClose = false;
for (ArrayList<String> center : canopyCenters) { //Run over the centers
//I think...let's say at this point we have a few centers. Then we have our next point to check.
//We have to compare that point with EVERY center already created. If the distance is larger than EVERY T1
//then that point becomes a new center! But the more canopies we have there is a good chance it is within
//the radius of one of the canopies...
//Measure the distance between the center and the currently checked center
if (ClusterJob.measureDistance(center, stockData) <= ClusterJob.T1) {
//Center is too close
isClose = true;
break;
}
}
//The center is not smaller than the small radius, add it to the canopy
if (!isClose) {
//Center is not too close, add the current data to the center
canopyCenters.add(stockData);
//Prepare hadoop data for output
Text outputKey = new Text();
Text outputValue = new Text();
outputKey.set(stockKey);
outputValue.set(stockValue);
//Output the stock key and values to reducer
context.write(outputKey, outputValue);
}
}
}
减少功能:
package nasdaq.hadoop;
import java.io.*;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
public class CanopyCentersReducer extends Reducer<Text, Text, Text, Text> {
//The canopy centers list
private ArrayList<ArrayList<String>> canopyCenters;
@Override
public void setup(Context context) {
//Create a new list for the canopy centers
this.canopyCenters = new ArrayList<ArrayList<String>>();
}
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for (Text value : values) {
//Format the value and key to fit the format
String stockValue = value.toString();
ArrayList<String> stockData = new ArrayList<String>(Arrays.asList(stockValue.split(",")));
String stockKey = key.toString();
//Check wether the stock is avaliable for usage as a new canopy center
boolean isClose = false;
for (ArrayList<String> center : canopyCenters) { //Run over the centers
//Measure the distance between the center and the currently checked center
if (ClusterJob.measureDistance(center, stockData) <= ClusterJob.T1) {
//Center is too close
isClose = true;
break;
}
}
//The center is not smaller than the small radius, add it to the canopy
if (!isClose) {
//Center is not too close, add the current data to the center
canopyCenters.add(stockData);
//Prepare hadoop data for output
Text outputKey = new Text();
Text outputValue = new Text();
outputKey.set(stockKey);
outputValue.set(stockValue);
//Output the stock key and values to reducer
context.write(outputKey, outputValue);
}
}
**编辑--更多代码和解释
Stockkey是表示股票的键值。(纳斯达克之类的)
ClusterJob.measureDistance():
public static double measureDistance(ArrayList<String> origin, ArrayList<String> destination)
{
double deltaSum = 0.0;
//Run over all points in the origin vector and calculate the sum of the squared deltas
for (int i = 0; i < origin.size(); i++) {
if (destination.size() > i) //Only add to sum if there is a destination to compare to
{
deltaSum = deltaSum + Math.pow(Math.abs(Double.valueOf(origin.get(i)) - Double.valueOf(destination.get(i))),2);
}
}
//Return the square root of the sum
return Math.sqrt(deltaSum);
发布于 2014-01-05 19:02:01
好的,代码的直截了当的解释是:-映射者遍历数据的某些子集(大概是随机的),并生成冠层中心--所有这些都至少是T1之间的距离。这些中心被发射出来。-减速机然后从所有映射器中遍历属于每个特定股票键(如MSFT、GOOG等)的所有冠层中心,然后确保在每个股票键值的T1内没有冠层中心(例如,GOOG中没有两个中心在彼此的T1内,尽管MSFT的一个中心和GOOG的一个中心可能很近)。
代码的目标还不清楚,我个人认为一定有一个bug。减速机基本上解决了这个问题,就像你试图独立地为每个股票键生成中心一样(即,为GOOG计算所有数据点的冠层中心),而地图绘制者似乎解决了为所有股票生成中心的问题。这样排列在一起,你就会产生矛盾,所以这两个问题实际上都没有得到解决。
如果您想要所有股票键的中心:-那么映射输出必须将所有东西发送到一个减速机。将映射输出键设置为一些琐碎的内容,如NullWritable。然后,减速器将执行正确的操作,而不改变。
如果您想要每个股票键的中心:-然后需要更改映射器,以便有效地为每个股票键保留一个单独的树冠列表,您可以为每个股票键保留一个单独的arrayList (首选,因为它会更快),或者您只需要更改距离度量,使属于不同股票键的股票键之间的距离是无限的(这样它们就永远不会交互)。
顺便说一下,你们的距离测量也有一些不相关的问题。首先,使用Double.parseDouble解析数据,但不捕获NumberFormatException。因为你给它的是stockData,它在第一个字段中包含“GOOG”这样的非数字字符串,所以一旦你运行它,你就会崩溃。第二,距离度量忽略任何缺少值的字段。这是L2 (pythagorean)距离度量的不正确实现。要知道为什么,请考虑这个字符串:",“与任何其他点的距离为0,如果它被选择为冠层中心,则不能选择其他中心。与其仅仅将缺失维度的增量设置为零,不如考虑将其设置为类似于该属性的填充意味的合理值,或者(为了安全起见)将该行从数据集中丢弃以进行聚类。
https://stackoverflow.com/questions/20931868
复制相似问题