blocks|key|4167801|text|如果您有许多文件，而且每个文件都很小(在此之前，我将将300+as视为Spark的小文件)，您可以尝试使用SparkContext.wholeTextFiles创建一个RDD，其中每个记录都是一个完整的文件。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|4167802|entityMap^0|1H|R|0^^$0|@$1|2|3|4|5|6|7|H|8|@$9|I|A|J|B|C]]|D|@]|E|$]]|$1|F|3|-4|5|6|7|K|8|@]|D|@]|E|$]]]|G|$]]

If you have many files, and each file is small (you say 300MB above which I would count as small for Spark), you could try using <code>SparkContext.wholeTextFiles</code> which will create an RDD where each record is an entire file.

blocks|key|4112912|text|它与其说是一个完整的解决方案，不如说是一个想法，我还没有对它进行测试。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4112913|您可以从将数据处理管道提取到函数开始。|4112914|def+pipeline(f:+String,+n:+Int)+=+{
++++sqlContext
++++++++.read
++++++++.format("com.databricks.spark.csv")
++++++++.option("header",+"true")
++++++++.load(f)
++++++++.repartition(n)
++++++++.groupBy(...)
++++++++.agg(...)
++++++++.cache+//+Cache+so+we+can+force+computation+later
}|code-block|syntax|javascript|4112915|如果您的文件很小，您可以调整n参数，以使用尽可能少的分区，以适应单个文件中的数据，并避免洗牌。这意味着您正在限制并发性，但我们稍后再讨论这个问题。|offset|length|style|CODE|4112916|val+n:+Int+=+???+|4112917|接下来，您必须获得输入文件的列表。这个步骤取决于数据源，但在大多数情况下，它或多或少是简单的：|4112918|val+files:+Array[String]+=+???|4112919|接下来，您可以使用pipeline函数映射上面的列表：|4112920|val+rdds+=+files.map(f+=>+pipeline(f,+n))|4112921|由于我们将并发限制在单个文件的级别上，所以我们希望通过提交多个作业来补偿。让我们添加一个简单的助手，强制计算并用Future包装它。|4112922|import+scala.concurrent._
import+ExecutionContext.Implicits.global

def+pipelineToFuture(df:+org.apache.spark.sql.DataFrame)+=+future+{
++++df.rdd.foreach(_+=>+())+//+Force+computation
++++df
}|4112923|最后，我们可以在rdds上使用上面的助手|4112924|val+result+=+Future.sequence(
+++rdds.map(rdd+=>+pipelineToFuture(rdd)).toList
)|4112925|根据您的需求，您可以添加onComplete回调或使用反应性流来收集结果。|4112926|entityMap^0|0|0|0|E|1|0|0|0|0|9|8|0|0|1K|6|0|0|8|4|0|0|C|A|0^^$0|@$1|2|3|4|5|6|7|1A|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|1B|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|1C|8|@]|9|@]|A|$G|H]]|$1|I|3|J|5|6|7|1D|8|@$K|1E|L|1F|M|N]]|9|@]|A|$]]|$1|O|3|P|5|F|7|1G|8|@]|9|@]|A|$G|H]]|$1|Q|3|R|5|6|7|1H|8|@]|9|@]|A|$]]|$1|S|3|T|5|F|7|1I|8|@]|9|@]|A|$G|H]]|$1|U|3|V|5|6|7|1J|8|@$K|1K|L|1L|M|N]]|9|@]|A|$]]|$1|W|3|X|5|F|7|1M|8|@]|9|@]|A|$G|H]]|$1|Y|3|Z|5|6|7|1N|8|@$K|1O|L|1P|M|N]]|9|@]|A|$]]|$1|10|3|11|5|F|7|1Q|8|@]|9|@]|A|$G|H]]|$1|12|3|13|5|6|7|1R|8|@$K|1S|L|1T|M|N]]|9|@]|A|$]]|$1|14|3|15|5|F|7|1U|8|@]|9|@]|A|$G|H]]|$1|16|3|17|5|6|7|1V|8|@$K|1W|L|1X|M|N]]|9|@]|A|$]]|$1|18|3|-4|5|6|7|1Y|8|@]|9|@]|A|$]]]|19|$]]

It is more an idea than a full solution and I haven't tested it yet.

You can start with extracting your data processing pipeline into a function. 

<pre><code>def pipeline(f: String, n: Int) = {
 sqlContext
 .read
 .format("com.databricks.spark.csv")
 .option("header", "true")
 .load(f)
 .repartition(n)
 .groupBy(...)
 .agg(...)
 .cache // Cache so we can force computation later
}
</code></pre>

If your files are small you can adjust <code>n</code> parameter to use as small number of partitions as possible to fit data from a single file and avoid shuffling. It means you are limiting concurrency but we'll get back to this issue later.

<pre><code>val n: Int = ??? 
</code></pre>

Next you have to obtain a list of input files. This step depends on a data source but most of the time it is more or less straightforward:

<pre><code>val files: Array[String] = ???
</code></pre>

Next you can map above list using <code>pipeline</code> function:

<pre><code>val rdds = files.map(f =&gt; pipeline(f, n))
</code></pre>

Since we limit concurrency at the level of the single file we want to compensate by submitting multiple jobs. Lets add a simple helper which forces evaluation and wraps it with <code>Future</code>

<pre><code>import scala.concurrent._
import ExecutionContext.Implicits.global

def pipelineToFuture(df: org.apache.spark.sql.DataFrame) = future {
 df.rdd.foreach(_ =&gt; ()) // Force computation
 df
}
</code></pre>

Finally we can use above helper on the <code>rdds</code>:

<pre><code>val result = Future.sequence(
 rdds.map(rdd =&gt; pipelineToFuture(rdd)).toList
)
</code></pre>

Depending on your requirements you can add <code>onComplete</code> callbacks or use reactive streams to collect the results.

blocks|key|4445854|text|这样，我们就可以并行地编写多个RDD。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4445855|public+class+ParallelWriteSevice+implements+IApplicationEventListener+{

++++private+static+final+IprogramLogger+logger+=+programLoggerFactory.getLogger(ParallelWriteSevice.class);

++++private+static+ExecutorService+executorService=null;
++++private+static+List<Future<Boolean>>+futures=new+ArrayList<Future<Boolean>>();

++++public+static+void+submit(Callable+callable)+{
++++++++if(executorService==null)
++++++++{
++++++++++++executorService=Executors.newFixedThreadPool(15);//Based+on+target+tables+increase+this
++++++++}

++++++++futures.add(executorService.submit(callable));
++++}

++++public+static+boolean+isWriteSucess()+{
++++++++boolean+writeFailureOccured+=+false;
++++++++try+{
++++++++++++for+(Future<Boolean>+future+:+futures)+{
++++++++++++++++try+{
++++++++++++++++++++Boolean+writeStatus+=+future.get();
++++++++++++++++++++if+(writeStatus+==+false)+{
++++++++++++++++++++++++writeFailureOccured+=+true;
++++++++++++++++++++}
++++++++++++++++}+catch+(Exception+e)+{
++++++++++++++++++++logger.error("Erorr+-+Scdeduled+write+failed+"+%2B+e.getMessage(),+e);
++++++++++++++++++++writeFailureOccured+=+true;
++++++++++++++++}
++++++++++++}
++++++++}+finally+{
++++++++++++resetFutures();+++++++++
++++++++++++++if+(executorService+!=+null)+
++++++++++++++++++executorService.shutdown();
++++++++++++++executorService+=+null;

++++++++}
++++++++return+!writeFailureOccured;
++++}

++++private+static+void+resetFutures()+{
++++++++++++logger.error("resetFutures+called");
++++++++++++//futures.clear();
++++}




}|code-block|syntax|javascript|4445856|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

By this way we can write multiple RDD parallely

<pre><code>public class ParallelWriteSevice implements IApplicationEventListener {

 private static final IprogramLogger logger = programLoggerFactory.getLogger(ParallelWriteSevice.class);

 private static ExecutorService executorService=null;
 private static List&lt;Future&lt;Boolean&gt;&gt; futures=new ArrayList&lt;Future&lt;Boolean&gt;&gt;();

 public static void submit(Callable callable) {
 if(executorService==null)
 {
 executorService=Executors.newFixedThreadPool(15);//Based on target tables increase this
 }

 futures.add(executorService.submit(callable));
 }

 public static boolean isWriteSucess() {
 boolean writeFailureOccured = false;
 try {
 for (Future&lt;Boolean&gt; future : futures) {
 try {
 Boolean writeStatus = future.get();
 if (writeStatus == false) {
 writeFailureOccured = true;
 }
 } catch (Exception e) {
 logger.error("Erorr - Scdeduled write failed " + e.getMessage(), e);
 writeFailureOccured = true;
 }
 }
 } finally {
 resetFutures(); 
 if (executorService != null) 
 executorService.shutdown();
 executorService = null;

 }
 return !writeFailureOccured;
 }

 private static void resetFutures() {
 logger.error("resetFutures called");
 //futures.clear();
 }




}
</code></pre>

I have a scenario where a certain number of operations including a group by has to be applied on a number of small (~300MB each) files. The operation looks like this..

<code>df.groupBy(....).agg(....)</code>

Now to process it on multiple files, I can use a wildcard "/**/*.csv" however, that creates a single RDD and partitions it to for the operations. However, looking at the operations, it is a group by and involves lot of shuffle which is unnecessary if the files are mutually exclusive.

What, I am looking at is, a way where i can create independent RDD's on files and operate on them independently.

Processing multiple files as independent RDD's in parallel

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我有一个场景，其中一定数量的操作(包括group )必须应用于多个小文件(每个文件约300 on )。手术看起来是这样..。df.groupBy(....).agg(....)现在，要在多个文件上处理它，我可以使用通配符“/**/*..csv”来创建单个RDD，并将其划分为操作。但是，从操作上看，它是一个组，涉及大量的...

问作为独立的RDD并行处理多个文件
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问作为独立的RDD并行处理多个文件EN