文章/答案/技术大牛

发布

社区首页 >问答首页 >使用S3从HadoopInputFile读取文件会产生FileNotFoundException

问使用S3从HadoopInputFile读取文件会产生FileNotFoundException
EN

Stack Overflow用户

提问于 2021-10-13 17:02:58

回答 1查看 279关注 0票数 0

我试图从S3的目录中读取拼图文件。

val bucketKey = "s3a://foo/direcoty_to_retrieve/"
val conf: Configuration = new Configuration()
conf.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, true)
val inputFile = HadoopInputFile.fromPath(new Path(bucketKey), conf)
val reader: ParquetReader[GenericRecord] =  AvroParquetReader.builder[GenericRecord](inputFile).withConf(conf).build()

不管我得到了什么

Exception in thread "main" java.io.FileNotFoundException: No such file or directory: s3a://foo/direcoty_to_retrieve
    at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3356)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3053)
    at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(HadoopInputFile.java:39)

编辑:当我将AvroParquetReader.builder与filePath一起使用时，例如：

val reader: ParquetReader[GenericRecord] =  AvroParquetReader.builder[GenericRecord](new Path(bucketKey)).withConf(conf).build()

它可以工作，但是这个选项被废弃了，我不想使用它。在本地目录下，它可以工作。正确设置了AWS_ACCESS_KEY和AWS_SECRET_ACCESS_KEY的env变量。有什么问题吗？

scala

hadoop

amazon-s3

parquet

回答 1

Stack Overflow用户

发布于 2022-09-28 11:19:41

通过S3库(依赖于Alpakka Avro地板 )从Amazon存储区读取拼花文件，我也面临着同样的问题。经过一些调试，我发现了ParquetReader.build方法中的问题。

public ParquetReader<T> build() throws IOException {
  ParquetReadOptions options = optionsBuilder.build();

  if (path != null) {
    FileSystem fs = path.getFileSystem(conf);
    FileStatus stat = fs.getFileStatus(path);

    if (stat.isFile()) {
      return new ParquetReader<>(
          Collections.singletonList((InputFile) HadoopInputFile.fromStatus(stat, conf)),
          options,
          getReadSupport());

    } else {
      List<InputFile> files = new ArrayList<>();
      for (FileStatus fileStatus : fs.listStatus(path, HiddenFileFilter.INSTANCE)) {
        files.add(HadoopInputFile.fromStatus(fileStatus, conf));
      }
      return new ParquetReader<T>(files, options, getReadSupport());
    }

  } else {
    return new ParquetReader<>(Collections.singletonList(file), options, getReadSupport());
  }
}

当使用HadoopInputFile作为输入时，构建器path属性设置为null，读取器在else块中启动。作为表示为文件系统中的目录的parquet文件，这将导致java.io.FileNotFoundException。

目前的解决方案是使用不推荐的方法：

AvroParquetReader.builder[GenericRecord](new Path(bucketKey)).withConf(conf).build()

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/69559525

复制

相似问题

问使用S3从HadoopInputFile读取文件会产生FileNotFoundException
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用S3从HadoopInputFile读取文件会产生FileNotFoundExceptionEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问使用S3从HadoopInputFile读取文件会产生FileNotFoundException
EN