blocks|key|4617170|text|你可以指定整个目录，使用通配符，甚至目录和通配符的CSV。例如：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4617171|sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")|code-block|syntax|javascript|4617172|正如Nick+Chammas指出的那样，这是对Hadoop的FileInputFormat的曝光，因此这也适用于Hadoop(和滚烫)。|offset|length|style|CODE|4617173|entityMap|0|LINK|mutability|MUTABLE|url|https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/FileInputFormat.html^0|0|0|U|F|U|F|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|V|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|W|8|@$I|X|J|Y|K|L]]|9|@$I|Z|J|10|1|11]]|A|$]]|$1|M|3|-4|5|6|7|12|8|@]|9|@]|A|$]]]|N|$O|$5|P|Q|R|A|$S|T]]]]

You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:

<pre><code>sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
</code></pre>

As Nick Chammas points out this is an exposure of Hadoop's <a href="https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/FileInputFormat.html" rel="noreferrer"><code>FileInputFormat</code></a> and therefore this also works with Hadoop (and Scalding).

blocks|key|110953|text|按如下方式使用union：|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|110954|val+sc+=+new+SparkContext(...)
val+r1+=+sc.textFile("xxx1")
val+r2+=+sc.textFile("xxx2")
...
val+rdds+=+Seq(r1,+r2,+...)
val+bigRdd+=+sc.union(rdds)|code-block|syntax|javascript|110955|那么bigRdd就是包含所有文件的RDD。|110956|entityMap^0|7|5|0|0|2|6|0^^$0|@$1|2|3|4|5|6|7|O|8|@$9|P|A|Q|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|R|8|@]|D|@]|E|$I|J]]|$1|K|3|L|5|6|7|S|8|@$9|T|A|U|B|C]]|D|@]|E|$]]|$1|M|3|-4|5|6|7|V|8|@]|D|@]|E|$]]]|N|$]]

Use <code>union</code> as follows:

<pre><code>val sc = new SparkContext(...)
val r1 = sc.textFile("xxx1")
val r2 = sc.textFile("xxx2")
...
val rdds = Seq(r1, r2, ...)
val bigRdd = sc.union(rdds)
</code></pre>

Then the <code>bigRdd</code> is the RDD with all files.

blocks|key|4617222|text|您可以使用单个textFile调用来读取多个文件。Scala：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4617223|sc.textFile(','.join(files))+|code-block|syntax|javascript|4617224|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

You can use a single textFile call to read multiple files. Scala:

<pre><code>sc.textFile(','.join(files)) 
</code></pre>

blocks|key|4617258|text|你可以使用这个|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4617259|首先，您可以获取S3路径的缓冲区/列表：|4617260|import+scala.collection.JavaConverters._
import+java.util.ArrayList
import+com.amazonaws.services.s3.AmazonS3Client
import+com.amazonaws.services.s3.model.ObjectListing
import+com.amazonaws.services.s3.model.S3ObjectSummary
import+com.amazonaws.services.s3.model.ListObjectsRequest

def+listFiles(s3_bucket:String,+base_prefix+:+String)+=+{
++++var+files+=+new+ArrayList[String]

++++//S3+Client+and+List+Object+Request
++++var+s3Client+=+new+AmazonS3Client();
++++var+objectListing:+ObjectListing+=+null;
++++var+listObjectsRequest+=+new+ListObjectsRequest();

++++//Your+S3+Bucket
++++listObjectsRequest.setBucketName(s3_bucket)

++++//Your+Folder+path+or+Prefix
++++listObjectsRequest.setPrefix(base_prefix)

++++//Adding+s3://+to+the+paths+and+adding+to+a+list
++++do+{
++++++objectListing+=+s3Client.listObjects(listObjectsRequest);
++++++for+(objectSummary+<-+objectListing.getObjectSummaries().asScala)+{
++++++++files.add("s3://"+%2B+s3_bucket+%2B+"/"+%2B+objectSummary.getKey());
++++++}
++++++listObjectsRequest.setMarker(objectListing.getNextMarker());
++++}+while+(objectListing.isTruncated());

++++//Removing+Base+Directory+Name
++++files.remove(0)

++++//Creating+a+Scala+List+for+same
++++files.asScala
++}|code-block|syntax|javascript|4617261|现在将这个列表对象传递给下面的代码，注意:+sc是SQLContext的对象|4617262|var+df:+DataFrame+=+null;
++for+(file+<-+files)+{
++++val+fileDf=+sc.textFile(file)
++++if+(df!=+null)+{
++++++df=+df.unionAll(fileDf)
++++}+else+{
++++++df=+fileDf
++++}
++}|4617263|现在，您得到了最终的统一RDD，即df|4617264|可选，也可以在单个BigRDD中对其进行重新分区|4617265|val+files+=+sc.textFile(filename,+1).repartition(1)|4617266|重新分区总是有效的:D|4617267|entityMap^0|0|0|0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|W|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|X|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|Y|8|@]|9|@]|A|$G|H]]|$1|I|3|J|5|6|7|Z|8|@]|9|@]|A|$]]|$1|K|3|L|5|F|7|10|8|@]|9|@]|A|$G|H]]|$1|M|3|N|5|6|7|11|8|@]|9|@]|A|$]]|$1|O|3|P|5|6|7|12|8|@]|9|@]|A|$]]|$1|Q|3|R|5|F|7|13|8|@]|9|@]|A|$G|H]]|$1|S|3|T|5|6|7|14|8|@]|9|@]|A|$]]|$1|U|3|-4|5|6|7|15|8|@]|9|@]|A|$]]]|V|$]]

You can use this 

First You can get a Buffer/List of S3 Paths : 

<pre><code>import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest

def listFiles(s3_bucket:String, base_prefix : String) = {
 var files = new ArrayList[String]

 //S3 Client and List Object Request
 var s3Client = new AmazonS3Client();
 var objectListing: ObjectListing = null;
 var listObjectsRequest = new ListObjectsRequest();

 //Your S3 Bucket
 listObjectsRequest.setBucketName(s3_bucket)

 //Your Folder path or Prefix
 listObjectsRequest.setPrefix(base_prefix)

 //Adding s3:// to the paths and adding to a list
 do {
 objectListing = s3Client.listObjects(listObjectsRequest);
 for (objectSummary &lt;- objectListing.getObjectSummaries().asScala) {
 files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
 }
 listObjectsRequest.setMarker(objectListing.getNextMarker());
 } while (objectListing.isTruncated());

 //Removing Base Directory Name
 files.remove(0)

 //Creating a Scala List for same
 files.asScala
 }
</code></pre>

Now Pass this List object to the following piece of code, note : sc is an object of SQLContext

<pre><code>var df: DataFrame = null;
 for (file &lt;- files) {
 val fileDf= sc.textFile(file)
 if (df!= null) {
 df= df.unionAll(fileDf)
 } else {
 df= fileDf
 }
 }
</code></pre>

Now you got a final Unified RDD i.e. df

Optional, And You can also repartition it in a single BigRDD 

<pre><code>val files = sc.textFile(filename, 1).repartition(1)
</code></pre>

Repartitioning always works :D

blocks|key|111163|text|您可以使用|type|unstyled|depth|inlineStyleRanges|entityRanges|data|111164|JavaRDD<String+,+String>+records+=+sc.wholeTextFiles("path+of+your+directory")|code-block|syntax|javascript|111165|在这里，您将获得文件的路径和该文件的内容。因此，您可以一次执行整个文件的任何操作，从而节省开销|111166|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|K|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|L|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|M|8|@]|9|@]|A|$]]|$1|I|3|-4|5|6|7|N|8|@]|9|@]|A|$]]]|J|$]]

you can use 

<pre><code>JavaRDD&lt;String , String&gt; records = sc.wholeTextFiles("path of your directory")
</code></pre>

here you will get the path of your file and content of that file. so you can perform any action of a whole file at a time that saves the overhead

blocks|key|4617305|text|rdd+=+textFile('/data/{1.txt,2.txt}')|type|code-block|depth|inlineStyleRanges|entityRanges|data|syntax|javascript|4617306|unstyled|entityMap^0|0^^$0|@$1|2|3|4|5|6|7|G|8|@]|9|@]|A|$B|C]]|$1|D|3|-4|5|E|7|H|8|@]|9|@]|A|$]]]|F|$]]

<pre><code>rdd = textFile('/data/{1.txt,2.txt}')
</code></pre>

I want to read a bunch of text files from a hdfs location and perform mapping on it in an iteration using spark.

<code>JavaRDD&lt;String&gt; records = ctx.textFile(args[1], 1);</code> is capable of reading only one file at a time.

I want to read more than one file and process them as a single RDD. How?

How to read multiple text files into a single RDD?

Hadoop 

我想从hdfs位置读取一堆文本文件，并使用spark在迭代中对其执行映射。JavaRDD<String> records = ctx.textFile(args[1], 1);一次只能读取一个文件。我希望读取多个文件，并将它们作为单个RDD进行处理。多么?

问如何将多个文本文件读入一个RDD？
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何将多个文本文件读入一个RDD？EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何将多个文本文件读入一个RDD？
EN