前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >用Python来写MapReduce的实

用Python来写MapReduce的实

作者头像
py3study
发布2020-01-09 17:11:32
3760
发布2020-01-09 17:11:32
举报
文章被收录于专栏:python3python3

用Python来写分布式的程序。这样速度快。便于调试,更有实际意义。MapReduce适合于对文本文件的处理及数据挖掘用:

  在每台机器上: su - hadoop wget http://www.python.org/ftp/python/3.0.1/Python-3.0.1.tar.bz2 tar jxvf Python-3.0.1.tar.bz2 cd Python-3.0.1 ./configure --prefix=/home/hadoop/python;make;make install

vi /home/hadoop/mapper.py

代码语言:javascript
复制
#!/home/hadoop/python/bin/python3.0

import sys
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print ("%st%s" % (word, 1))

vi /home/hadoop/reduce.py

代码语言:javascript
复制
#!/home/hadoop/python/bin/python3.0

from operator import itemgetter
import sys

word2count = {}

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('t', 1)
    try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
    except ValueError:
        pass

sorted_word2count = sorted(word2count.items(), key=itemgetter(0))

for word, count in sorted_word2count:
    print ("%st%s" % (word, count))

  测测好不好用: echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py foo 1 foo 1 quux 1 labs 1 foo 1 bar 1 quux 1

echo "foo foo quux labs foo bar quux" | /home/hadoop/mapper.py | sort | /home/hadoop/reduce.py bar 1 foo 3 labs 1 quux 2

  在各个节点上都要准备好这两个文件啊!!!

  在master主节点上执行: # 拷贝conf目录到hdfs文件系统中 $ cd /home/hadoop/hadoop-0.19.1 $ bin/hadoop dfs -copyFromLocal conf 111

  # 查看一下是否已经拷过去了 $ bin/hadoop dfs -ls Found 1 items drwxr-xr-x - hadoop supergroup 0 2009-05-18 15:27 /user/hadoop/111

  # 分布计算 $ bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar -mapper /home/hadoop/mapper.py -reducer /home/hadoop/reduce.py -input 111/* -output 111-output additionalConfSpec_:null null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming packageJobJar: [/tmp/hadoop-hadoop/hadoop-unjar29198/] [] /tmp/streamjob29199.jar tmpDir=null [...] INFO mapred.FileInputFormat: Total input paths to process : 12 [...] INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hadoop/mapred/local] [...] INFO streaming.StreamJob: Running job: job_200905191453_0001 [...] INFO streaming.StreamJob: To kill this job, run: ... [...] [...] INFO streaming.StreamJob: map 0% reduce 0% [...] INFO streaming.StreamJob: map 43% reduce 0% [...] INFO streaming.StreamJob: map 86% reduce 0% [...] INFO streaming.StreamJob: map 100% reduce 0% [...] INFO streaming.StreamJob: map 100% reduce 33% [...] INFO streaming.StreamJob: map 100% reduce 70% [...] INFO streaming.StreamJob: map 100% reduce 77% [...] INFO streaming.StreamJob: map 100% reduce 100% [...] INFO streaming.StreamJob: Job complete: job_200905191453_0001 [...] INFO streaming.StreamJob: Output: 111-output [hadoop@wangyin4 hadoop-0.19.1]$ $ bin/hadoop dfs -ls 111-output Found 2 items drwxr-xr-x - hadoop supergroup 0 2009-05-19 14:54 /user/hadoop/111-output/_logs -rw-r--r-- 2 hadoop supergroup 30504 2009-05-19 16:26 /user/hadoop/111-output/part-00000 $ bin/hadoop dfs -cat 111-output/part-00000 you 3 you've 1 your 1 zero 3 zero, 1 Over,搞定。大家可以拓展这个例子,写出自己的应用来。

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2019-08-25 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档