我正在尝试将数据帧写入bigquery表。我已经使用所需的参数设置了sparkSession。然而,在写的时候,我得到了一个错误:
Py4JJavaError: An error occurred while calling o114.save.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
代码如下:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark2 = SparkSession.builder\
.config("spark.jars", "/Users/xyz/Downloads/gcs-connector-hadoop2-latest.jar") \
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.0")\
.config("google.cloud.auth.service.account.json.keyfile", "/Users/xyz/Downloads/MyProject-cd7627f8ef9b.json") \
.getOrCreate()
spark2.conf.set("parentProject", "xyz")
data=spark2.createDataFrame(
[
("AAA", 51),
("BBB", 23),
],
['codiPuntSuministre', 'valor']
)
spark2.conf.set("temporaryGcsBucket","bqconsumptions")
data.write.format('bigquery') \
.option("credentialsFile", "/Users/xyz/Downloads/MyProject-xyz.json")\
.option('table', 'consumptions.c1') \
.mode('append') \
.save()
df=spark2.read.format("bigquery").option("credentialsFile", "/Users/xyz/Downloads/MyProject-xyz.json")\
.load("consumptions.c1")
如果从代码中删除write,我不会得到任何错误,所以错误是在尝试编写时出现的,并且可能与使用bigquery操作的auxiliar存储桶相关
发布于 2020-11-18 19:12:41
这里的错误表明它无法识别文件系统,您可以使用下面的链接添加对gs文件系统的支持,因为当您写入bigquery时,文件被临时加载到google云存储桶中,然后被加载到bigquery表中。
spark._jsc.hadoopConfiguration().set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
https://stackoverflow.com/questions/64824940
复制