文章/答案/技术大牛

发布

社区首页 >问答首页 >在Hadoop中链接多个MapReduce作业

问在Hadoop中链接多个MapReduce作业
EN

Stack Overflow用户

提问于 2010-03-23 19:55:15

回答 14查看 87.7K关注 0票数 126

在许多实际应用MapReduce的情况下，最终的算法都是几个MapReduce步骤。

即Map1、Reduce1、Map2、Reduce2等。

因此，您拥有上一次reduce的输出，该输出需要作为下一个map的输入。

一旦管道成功完成，您(通常)不希望保留中间数据。另外，因为这些中间数据通常是某种数据结构(如“map”或“set”)，所以您不希望在写入和读取这些键值对时花费太多精力。

在Hadoop中，推荐的方法是什么？

有没有一个(简单的)例子来说明如何以正确的方式处理这些中间数据，包括事后的清理？

mapreduce

hadoop

Stack Overflow用户

发布于 2015-10-27 21:12:19

您可以按照代码中给出的方式运行MR chain。

请注意：仅提供了驱动程序代码

public class WordCountSorting {
// here the word keys shall be sorted
      //let us write the wordcount logic first

      public static void main(String[] args)throws IOException,InterruptedException,ClassNotFoundException {
            //THE DRIVER CODE FOR MR CHAIN
            Configuration conf1=new Configuration();
            Job j1=Job.getInstance(conf1);
            j1.setJarByClass(WordCountSorting.class);
            j1.setMapperClass(MyMapper.class);
            j1.setReducerClass(MyReducer.class);

            j1.setMapOutputKeyClass(Text.class);
            j1.setMapOutputValueClass(IntWritable.class);
            j1.setOutputKeyClass(LongWritable.class);
            j1.setOutputValueClass(Text.class);
            Path outputPath=new Path("FirstMapper");
            FileInputFormat.addInputPath(j1,new Path(args[0]));
                  FileOutputFormat.setOutputPath(j1,outputPath);
                  outputPath.getFileSystem(conf1).delete(outputPath);
            j1.waitForCompletion(true);
                  Configuration conf2=new Configuration();
                  Job j2=Job.getInstance(conf2);
                  j2.setJarByClass(WordCountSorting.class);
                  j2.setMapperClass(MyMapper2.class);
                  j2.setNumReduceTasks(0);
                  j2.setOutputKeyClass(Text.class);
                  j2.setOutputValueClass(IntWritable.class);
                  Path outputPath1=new Path(args[1]);
                  FileInputFormat.addInputPath(j2, outputPath);
                  FileOutputFormat.setOutputPath(j2, outputPath1);
                  outputPath1.getFileSystem(conf2).delete(outputPath1, true);
                  System.exit(j2.waitForCompletion(true)?0:1);
      }

}

该序列是

(JOB1)MAP->REDUCE-> (JOB2)映射

这样做是为了对键进行排序，但是还有更多的方法，例如使用树地图

然而，我想让你的注意力集中在作业被链接的方式上！

谢谢

票数 6

查看全部 14 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/2499585

复制

相似问题

问在Hadoop中链接多个MapReduce作业
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Hadoop中链接多个MapReduce作业EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Hadoop中链接多个MapReduce作业
EN