当我在本地火花机上运行下面的火花放电示例时,我在这里遇到了一个问题,它能够非常好地运行。但是,当我开始在远程星火集群上运行它时,我会在工作节点上收到以下错误:
Caused by: org.apache.spark.SparkException:
Error from python worker:
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 163, in _run_module_as_main
mod_name, _Error)
File "/usr/lib/python2.7/runpy.py", line 102, in _get_module_details
loader = get_loader(mod_name)
File "/usr/lib/python2.7/pkgutil.py", line 462, in get_loader
return find_loader(fullname)
File "/usr/lib/python2.7/pkgutil.py", line 472, in find_loader
for importer in iter_importers(fullname):
File "/usr/lib/python2.7/pkgutil.py", line 428, in iter_importers
__import__(pkg)
File "/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 53, in <module>
File "/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 34, in <module>
File "/spark/python/lib/pyspark.zip/pyspark/java_gateway.py", line 31, in <module>
File "/spark/python/lib/pyspark.zip/pyspark/find_spark_home.py", line 68
print("Could not find valid SPARK_HOME while searching {0}".format(paths), file=sys.stderr)
^
SyntaxError: invalid syntax
PYTHONPATH was:
/spark/python/lib/pyspark.zip:/spark/python/lib/py4j-0.10.9-src.zip:/spark/jars/spark-
core_2.12-3.1.1.jar:/spark/python/lib/py4j-0.10.9-src.zip:/spark/python:
org.apache.spark.SparkException: EOFException occurred while reading the port number from pyspark.daemon's stdout
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:217)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:132)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:105)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
这就是我试图运行的python脚本示例。
import pyspark
import sys
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName('SparkTest')
conf.setMaster('spark://xxxxx:7077')
conf.setSparkHome('/spark/')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext
words = sc.parallelize(["abc","bcd","abc","ddd","eee","hjfghjgf","eee","sbc"])
counts = words.count()
print("Number of elements in RDD ".counts)
补充资料:-
我正在尝试运行的python脚本是在windows机器上安装的,pyspark运行在python3.7上,星星之火3.1.1-bin-hadoop3.2客户端
上。
任何帮助都将不胜感激。谢谢
发布于 2022-08-02 06:14:23
我也有过同样的问题。这个问题的原因是:正如它所说的,"/usr/lib/python2.7“用于辅助机器,这种情况不应该发生,因为我打算使用python3来使用anaconda env。
我用这种方式修好了。只需将系统python (/usr/bin/ python )链接到您想要使用的python。
-rwxr-xr-x. 1 root root 11392 Apr 11 2018 pwscore
-rwxr-xr-x 1 root root 78 Nov 17 2020 pydoc
lrwxrwxrwx 1 root root 8 Feb 10 17:57 pydoc3 -> pydoc3.6
-rwxr-xr-x 1 root root 78 Nov 17 2020 pydoc3.6
lrwxrwxrwx 1 root root 50 Jun 23 19:37 python -> /home/hadoop/app/anaconda3/envs/bigdata/bin/python
lrwxrwxrwx 1 root root 9 Feb 10 18:04 python2 -> python2.7
-rwxr-xr-x 1 root root 7144 Nov 17 2020 python2.7
lrwxrwxrwx 1 root root 9 Feb 10 17:57 python3 -> python3.6
-rwxr-xr-x 2 root root 11328 Nov 17 2020 python3.6
-rwxr-xr-x 2 root root 11328 Nov 17 2020 python3.6m
经过这一修改,火花工程。
https://stackoverflow.com/questions/70462636
复制相似问题