首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >如何在本地用java连接到spark的Google大查询?

如何在本地用java连接到spark的Google大查询?
EN

Stack Overflow用户
提问于 2019-12-05 20:56:32
回答 2查看 3.4K关注 0票数 2

我正在尝试使用java中的spark连接到Google big query,但我无法找到相同的准确文档。

我试过了:https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example

https://github.com/GoogleCloudPlatform/spark-bigquery-connector#compiling-against-the-connector

我的代码:

代码语言:javascript
运行
复制
sparkSession.conf().set("credentialsFile", "/path/OfMyProjectJson.json");
Dataset<Row> dataset = sparkSession.read().format("bigquery").option("table","myProject.myBigQueryDb.myBigQuweryTable")
          .load();
dataset.printSchema();

但这是抛出异常:

代码语言:javascript
运行
复制
Exception in thread "main" java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated
    at java.util.ServiceLoader.fail(ServiceLoader.java:232)
    at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
    at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
    at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
    at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
    at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
    at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
    at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:614)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
    at com.mySparkConnector.getDataset(BigQueryFetchClass.java:12)


Caused by: java.lang.IllegalArgumentException: A project ID is required for this service but could not be determined from the builder or the environment.  Please set a project ID using the builder.
    at com.google.cloud.spark.bigquery.repackaged.com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)
    at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.ServiceOptions.<init>(ServiceOptions.java:285)
    at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.<init>(BigQueryOptions.java:91)
    at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.<init>(BigQueryOptions.java:30)
    at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions$Builder.build(BigQueryOptions.java:86)
    at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions.getDefaultInstance(BigQueryOptions.java:159)
    at com.google.cloud.spark.bigquery.BigQueryRelationProvider$.$lessinit$greater$default$2(BigQueryRelationProvider.scala:29)
    at com.google.cloud.spark.bigquery.BigQueryRelationProvider.<init>(BigQueryRelationProvider.scala:40)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at java.lang.Class.newInstance(Class.java:442)
    at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
    ... 15 more

我的json文件包含project_id我试图搜索可能的解决方案,但无法找到任何解决方案,因此请帮助我找到此异常的解决方案,或者任何关于如何连接到spark的大查询的文档。

EN

回答 2

Stack Overflow用户

发布于 2021-04-29 04:46:07

在气流中,我得到了与DataProcPySparkOperator运算符完全相同的错误。修复方法是提供

代码语言:javascript
运行
复制
dataproc_pyspark_jars='gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar'

而不是

代码语言:javascript
运行
复制
dataproc_pyspark_jars='gs://spark-lib/bigquery/spark-bigquery-latest.jar'

我猜在您的情况下,它应该作为命令行参数传递,如下所示

代码语言:javascript
运行
复制
--jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
票数 1
EN

Stack Overflow用户

发布于 2019-12-06 00:39:04

最近,spark-bigquery- PR handling this issue合并了一个连接器,新版本的连接器将很快发布。

现在一个简单的解决方案是将环境变量GOOGLE_APPLICATION_CREDENTIALS=/path/OfMyProjectJson.json添加到spark运行时。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59195716

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档