blocks|key|943830|text|对不起，我对MongoDb不太确定。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|943831|如果您只想知道，如果我们使用数据源是一个表，那么当MapRed使用HBase时，这就是我的答案。|943832|我们将使用TableInputFormat在MapRed作业中使用Hbase表。|943833|来自http://hbase.apache.org/book.html#hbase.mapreduce.classpath|offset|length|943834|7.7.图-任务拆分7.7.1。默认的HBase+MapReduce拆分器|943835|当TableInputFormat用于在MapReduce作业中获取HBase表时，它的拆分器将为表的每个区域创建一个映射任务。因此，如果表中有100个区域，则不管扫描中选择了多少列族，都会有100个映射任务。|943836|7.7.2.自定义分离器|943837|有关那些对实现自定义拆分器感兴趣的人，请参见TableInputFormatBase中的方法TableInputFormatBase。这就是地图任务分配的逻辑所在。|943838|entityMap|0|LINK|mutability|MUTABLE|url|http://hbase.apache.org/book.html#hbase.mapreduce.classpath^0|0|0|0|2|1N|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|Z|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|10|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|11|8|@]|9|@]|A|$]]|$1|F|3|G|5|6|7|12|8|@]|9|@$H|13|I|14|1|15]]|A|$]]|$1|J|3|K|5|6|7|16|8|@]|9|@]|A|$]]|$1|L|3|M|5|6|7|17|8|@]|9|@]|A|$]]|$1|N|3|O|5|6|7|18|8|@]|9|@]|A|$]]|$1|P|3|Q|5|6|7|19|8|@]|9|@]|A|$]]|$1|R|3|-4|5|6|7|1A|8|@]|9|@]|A|$]]]|S|$T|$5|U|V|W|A|$X|Y]]]]

Sorry,I am not sure about MongoDb.

If you just wanted to know,how splitting is happening if we are using the data source is a table,then this is my answer when MapRed working with HBase.

we will use TableInputFormat to use an Hbase table in MapRed job.

From the <a href="http://hbase.apache.org/book.html#hbase.mapreduce.classpath" rel="nofollow">http://hbase.apache.org/book.html#hbase.mapreduce.classpath</a>

7.7. Map-Task Splitting
7.7.1. The Default HBase MapReduce Splitter

When TableInputFormat is used to source an HBase table in a MapReduce job, its splitter will make a map task for each region of the table. Thus, if there are 100 regions in the table, there will be 100 map-tasks for the job - regardless of how many column families are selected in the Scan.

7.7.2. Custom Splitters

For those interested in implementing custom splitters, see the method getSplits in TableInputFormatBase. That is where the logic for map-task assignment resides.

blocks|key|926384|text|您正在描述DBInputFormat。这是一种从外部数据库读取拆分的输入格式。HDFS只涉及设置作业，而不涉及实际输入。还有一个DBOutputFormat。对于像DBInputFormat这样的输入，分裂是合乎逻辑的，例如。关键范围。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|926385|请阅读使用Apache访问数据库以获得详细解释。|926386|entityMap|0|LINK|mutability|MUTABLE|url|http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/lib/db/DBInputFormat.html|1|http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/lib/db/DBOutputFormat.html|2|http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/^0|5|D|1S|E|2A|D|5|D|0|1S|E|1|0|3|D|2|0^^$0|@$1|2|3|4|5|6|7|T|8|@$9|U|A|V|B|C]|$9|W|A|X|B|C]|$9|Y|A|Z|B|C]]|D|@$9|10|A|11|1|12]|$9|13|A|14|1|15]]|E|$]]|$1|F|3|G|5|6|7|16|8|@]|D|@$9|17|A|18|1|19]]|E|$]]|$1|H|3|-4|5|6|7|1A|8|@]|D|@]|E|$]]]|I|$J|$5|K|L|M|E|$N|O]]|P|$5|K|L|M|E|$N|Q]]|R|$5|K|L|M|E|$N|S]]]]

You are describing <a href="http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/lib/db/DBInputFormat.html" rel="nofollow"><code>DBInputFormat</code></a>. This is an input format that reads the split from an external database. HDFS only gets involved in setting up the job, but not in actual input. There is also an <a href="http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapred/lib/db/DBOutputFormat.html" rel="nofollow"><code>DBOutputFormat</code></a>. With an input like <code>DBInputFormat</code> the splits are logical, eg. key ranges.

Read <a href="http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/" rel="nofollow">Database Access with Apache Hadoop</a> for a detailed explanation.

blocks|key|926394|text|这是个好问题，不傻。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|926395|1.|offset|length|style|BOLD|926396|"mongodb://localhost:27017/mongo_hadoop.messages和运行我的映射器和还原器，并将数据存储回mongodb，HDFS将如何进入画面。“|926397|在这种情况下，您不必考虑hdfs。你不需要做任何与hdf相关的事情。就像用每个线程编写一个多线程应用程序一样，将数据写入mongodb。|926398|事实上，hdfs是独立于map减少的，而map减少也是独立于hdfs的。所以，你可以把它们分开使用，也可以一起使用，作为你的愿望。|926399|2.如果您想要输入/输出db以映射减少，则表示考虑DBInputFormat，但这是另一个问题。|926400|现在，hadoop+DBInputFormat只支持JDBC。我不确定是否有一些mongodb版本的DBInputFormat。也许你可以自己搜索或者实现它。|926401|entityMap^0|0|0|2|0|0|0|0|0|2|0|0^^$0|@$1|2|3|4|5|6|7|T|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|U|8|@$D|V|E|W|F|G]]|9|@]|A|$]]|$1|H|3|I|5|6|7|X|8|@]|9|@]|A|$]]|$1|J|3|K|5|6|7|Y|8|@]|9|@]|A|$]]|$1|L|3|M|5|6|7|Z|8|@]|9|@]|A|$]]|$1|N|3|O|5|6|7|10|8|@$D|11|E|12|F|G]]|9|@]|A|$]]|$1|P|3|Q|5|6|7|13|8|@]|9|@]|A|$]]|$1|R|3|-4|5|6|7|14|8|@]|9|@]|A|$]]]|S|$]]

This is a good question, not stupid.

1. 

"mongodb://localhost:27017/mongo_hadoop.messages and running my mappers and reducers and storing the data back to mongodb, how will HDFS come into picture. "

Under this situation, u needn't consider hdfs. U needn't do anything related with hdf. Just like write a multiple-thread application with each thread write data to mongodb.

In fact, hdfs is independent to map-reduce, and map-reduce is also independent to hdfs. So, u can use them separately or together as your wish.

2.
if u want to input/output db to map-reduce, u show consider DBInputFormat, but that's another question. 

Now, hadoop DBInputFormat only support JDBC. I'm not sure whether some mongodb version of DBInputFormat. Maybe U can search it or implement it by yourself.

This might sound like some stupid question.
I might write a MR code that can take input and output as HDFS locations and then I really don't need to worry about the parallel computing power of hadoop/MR. (Please correct me if I am wrong here).

However if my input is not an HDFS location say I am taking a MongoDB data as input - mongodb://localhost:27017/mongo_hadoop.messages and running my mappers and reducers and storing the data back to mongodb, how will HDFS come into picture. I mean how can I be sure that the 1 GB or any sized big file is first being distributed on HDFS and then parallel computing is being done on it?
Is it that this direct URI will not distribute the data and I need to take the BSON file instead, load it up on HDFS and then give the HDFS path as Input to MR or the framework is smart enough to do this by itself?

I am sorry if the above question is too stupid or not making any sense at all. I am really new to big data but very much excited to dive into this domain.

Thanks.

How to make MapReduce work with HDFS

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

这听起来可能是个愚蠢的问题。我可能会编写一个MR代码，它可以将输入和输出作为HDFS的位置，然后我真的不需要担心hadoop/先生的并行计算能力(如果我错了，请纠正我)。但是，如果我的输入不是HDFS位置，比如我将一个mongodb://localhost:27017/mongo_hadoop.messages数据作为...

问如何使MapReduce与HDFS协同工作
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何使MapReduce与HDFS协同工作EN