我正致力于在EMR上运行一个应用程序,并且在使用火花卡桑德拉连接器时遇到了困难。我没有问题把它拉到我的本地,但我所有尝试使用上的库都失败了。
当我使用--jars s3://XXX/XXXX/spark-cassandra-connector-driver_2.12-3.2.0.jar
包含库时,我在下面的行中出错
d = spark \
.read \
.format("org.apache.spark.sql.cassandra") \
.options(table="YYYY", keyspace="YYY") \
.load()
有错误
py4j.protocol.Py4JJavaError: An error occurred while calling o121.load.
: java.lang.ClassNotFoundException:
Failed to find data source: org.apache.spark.sql.cassandra. Please find packages at
http://spark.apache.org/third-party-projects.html
当我尝试使用--packages com.datastax.spark:spark-cassandra-connector_2.12:3.2.0
添加包时,应用程序在
com.datastax.spark#spark-cassandra-connector_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5ee06249-c545-4b92-804f-ecedd322158a;1.0
confs: [default]
:: resolution report :: resolve 524554ms :: artifacts dl 0ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 1 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: com.datastax.spark#spark-cassandra-connector_2.12;3.2.0
==== local-m2-cache: tried
file:/home/hadoop/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom
-- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:
file:/home/hadoop/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar
==== local-ivy-cache: tried
/home/hadoop/.ivy2/local/com.datastax.spark/spark-cassandra-connector_2.12/3.2.0/ivys/ivy.xml
-- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:
/home/hadoop/.ivy2/local/com.datastax.spark/spark-cassandra-connector_2.12/3.2.0/jars/spark-cassandra-connector_2.12.jar
==== central: tried
https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom
-- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:
https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar
==== spark-packages: tried
https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom
-- artifact com.datastax.spark#spark-cassandra-connector_2.12;3.2.0!spark-cassandra-connector_2.12.jar:
https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: com.datastax.spark#spark-cassandra-connector_2.12;3.2.0: not found
::::::::::::::::::::::::::::::::::::::::::::::
:::: ERRORS
Server access error at url https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://repo1.maven.org/maven2/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.pom (java.net.ConnectException: Connection timed out (Connection timed out))
Server access error at url https://repos.spark-packages.org/com/datastax/spark/spark-cassandra-connector_2.12/3.2.0/spark-cassandra-connector_2.12-3.2.0.jar (java.net.ConnectException: Connection timed out (Connection timed out))
我敢打赌--package
问题来自防火墙配置问题,但我看不到任何开放访问的方法。至于--jars
问题,我不知道为什么.jar
不足以让Spark识别org.apache.spark.sql.cassandra
格式。
在任何一个问题上的任何帮助都将不胜感激,谢谢!
发布于 2022-08-09 14:26:52
是的,--packages
的问题很可能是因为您的出口设置阻止访问Maven中心。
要使用--jars
,您需要指定所有必要的jars,如driver
、connector
、Java驱动程序等。避免这种情况的最简单方法是使用所谓的程序集构建,即具有com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.2.0
坐标的Maven Central也提供。只需下载引用的jar文件。
https://stackoverflow.com/questions/73293038
复制相似问题