首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >无网络接入的CDH集群如何安装parckage(如mmlspark)?

无网络接入的CDH集群如何安装parckage(如mmlspark)?
EN

Stack Overflow用户
提问于 2020-07-29 13:36:53
回答 2查看 711关注 0票数 4

因为在中国很难连接maven.org,所以我不能不通过以下方式安装mmlspark

代码语言:javascript
运行
复制
pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1 --repositories=https://mmlspark.azureedge.net/maven

会得到

代码语言:javascript
运行
复制
:::: ERRORS
        Server access error at url https://repo1.maven.org/maven2/com/microsoft/ml/lightgbm/lightgbmlib/2.3.100/lightgbmlib-2.3.100.pom (java.net.ConnectException: Connection timed out (Connection timed out))

        Server access error at url https://repo1.maven.org/maven2/com/microsoft/ml/lightgbm/lightgbmlib/2.3.100/lightgbmlib-2.3.100.jar (java.net.ConnectException: Connection timed out (Connection timed out))


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: com.microsoft.ml.lightgbm#lightgbmlib;2.3.100: not found, download failed: com.microsoft.ml.spark#mmlspark_2.11;1.0.0-rc1!mmlspark_2.11.jar, download failed: org.scalatest#scalatest_2.11;3.0.5!scalatest_2.11.jar(bundle), download failed: com.microsoft.cntk#cntk;2.4!cntk.jar, download failed: org.openpnp#opencv;3.2.0-1!opencv.jar(bundle)]
        at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1308)
        at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:315)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:926)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:935)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[TerminalIPythonApp] WARNING | Unknown error in handling PYTHONSTARTUP file /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/python/pyspark/shell.py:

尝试手动安装

我有一个亚马逊ec2实例,它可以访问maven.org,我下载了所有包并复制到本地CDH集群,路径/opt/cloudera/parcels/CDH/lib/spark/jars/mmlspark_jars/

并设置config:

第一个spark-defaults.conf:

代码语言:javascript
运行
复制
spark.driver.extraClassPath=/opt/cloudera/parcels/CDH/lib/spark/jars/mmlspark_jars/*
spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/spark/jars/mmlspark_jars/*

第二: spark-env.sh:

代码语言:javascript
运行
复制
export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/spark/jars/mmlspark_jars/*:$SPARK_CLASSPATH

可以看到jar已加载

import mmlspark仍然是ModuleNotFoundError: No module named 'mmlspark'

通过一些努力

我发现:解压mmlspark.jar,压缩文件夹里面的mmlspark并放到hdfs( hdfs://test/mmlspark.zip ),加载这个.zip到pyfiles (--py-files hdfs://test/mmlspark.zip ),就可以成功导入mmlspark了。

我使用jar依赖项和mmlspark.zip启动了一个pyspark shell:

代码语言:javascript
运行
复制
pyspark --jars "/user/spark/mmlspark_jars/com.github.vowpalwabbit_vw-jni-8.7.0.3.jar,/user/spark/mmlspark_jars/com.jcraft_jsch-0.1.54.jar,/user/spark/mmlspark_jars/com.microsoft.cntk_cntk-2.4.jar,/user/spark/mmlspark_jars/com.microsoft.ml.lightgbm_lightgbmlib-2.3.100.jar,/user/spark/mmlspark_jars/com.microsoft.ml.spark_mmlspark_2.11-1.0.0-rc1.jar,/user/spark/mmlspark_jars/commons-codec_commons-codec-1.10.jar,/user/spark/mmlspark_jars/commons-logging_commons-logging-1.2.jar,/user/spark/mmlspark_jars/io.spray_spray-json_2.11-1.3.2.jar,/user/spark/mmlspark_jars/org.apache.httpcomponents_httpclient-4.5.6.jar,/user/spark/mmlspark_jars/org.apache.httpcomponents_httpcore-4.4.10.jar,/user/spark/mmlspark_jars/org.openpnp_opencv-3.2.0-1.jar,/user/spark/mmlspark_jars/org.scala-lang.modules_scala-xml_2.11-1.0.6.jar,/user/spark/mmlspark_jars/org.scala-lang_scala-reflect-2.11.12.jar,/user/spark/mmlspark_jars/org.scalactic_scalactic_2.11-3.0.5.jar,/user/spark/mmlspark_jars/org.scalatest_scalatest_2.11-3.0.5.jar" --py-files hdfs://test/mmlspark.zip

测试代码

代码语言:javascript
运行
复制
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
Y = iris.target
df = np.column_stack([X,Y])
df = pd.DataFrame(df)
df.columns = ['f1', 'f2', 'f3', 'f4', 'label']
feature_cols = ['f1', 'f2', 'f3', 'f4']
df = spark.createDataFrame(df)

from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
df1 = vec_assembler.transform(df)

from mmlspark.lightgbm import LightGBMRegressor
model = LightGBMRegressor(objective='quantile',
                          alpha=0.2,
                          learningRate=0.3,
                          numLeaves=31,
                         featuresCol='features',
                         labelCol='label').fit(df1)

错误

代码语言:javascript
运行
复制
Py---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-55-fe341b86ea18> in <module>
     18                           numLeaves=31,
     19                          featuresCol='features',
---> 20                          labelCol='label').fit(df1)

/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
    130                 return self.co(params)._fit(dataset)
    131             else:
--> 132                 return self._fit(dataset)
    133         else:
    134             raise ValueError("Params must be either a param map or a list/tuple of param maps, "

/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in _fit(self, dataset)
    293 
    294     def _fit(self, dataset):
--> 295         java_model = self._fit_java(dataset)
    296         model = self._create_model(java_model)
    297         return self._copyValues(model)

/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in _fit_java(self, dataset)
    289         :return: fitted Java model
    290         """
--> 291         self._transfer_params_to_java()
    292         return self._java_obj.fit(dataset._jdf)
    293 

/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in _transfer_params_to_java(self)
    125                 self._java_obj.set(pair)
    126             if self.hasDefault(param):
--> 127                 pair = self._make_java_param_pair(param, self._defaultParamMap[param])
    128                 pair_defaults.append(pair)
    129         if len(pair_defaults) > 0:

/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/ml/wrapper.py in _make_java_param_pair(self, param, value)
    111         sc = SparkContext._active_spark_context
    112         param = self._resolveParam(param)
--> 113         java_param = self._java_obj.getParam(param.name)
    114         java_value = _py2java(sc, value)
    115         return java_param.w(java_value)

/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.4JJavaError as e:
     65             s = e.java_exception.toString()

/opt/cloudera/parcels/CDH/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o1298.getParam.
: java.util.NoSuchElementException: Param metric does not exist.
    at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:729)
    at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:729)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.ml.param.Params$class.getParam(params.scala:728)
    at org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42)
    at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
py

在这里,我认为这个错误是因为mmlspark python端口无法加载jar,这导致了Py4JJavaError。但我不知道,我已经做了我知道的一切。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2020-08-03 17:34:38

最后,我得到了它。关键是将.jar传递给pyFiles,这让我非常惊讶,python可以读取.jar

bash:

代码语言:javascript
运行
复制
pyspark \
--master yarn \
--conf spark.submit.pyFiles=hdfs://pupuxdc/user/spark/mmlspark_jars/…… .jar \
--conf spark.yarn.dist.jars=hdfs://pupuxdc/user/spark/mmlspark_jars/…… .jar

pyspark代码:

代码语言:javascript
运行
复制
spark_builder = (
    SparkSession
    .builder
    .config("spark.port.maxRetries", 100)
    .appName(app_name))
    
spark = spark_builder.getOrCreate()
jar_files = [...]
for i in jar_files:
    spark.sparkContext.addPyFile(i)

注意,.config('spark.submit.pyFiles=hdfs://pupuxdc/user/spark/mmlspark_jars/…… .jar')不会生效。

票数 0
EN

Stack Overflow用户

发布于 2020-11-17 01:24:11

我尝试了评论,但我没有足够的声誉,所以对于那些使用HDP的人来说,同样的答案也适用于Mithril。

此外,您不需要将jar文件上传到hdfs。从本地目录读取jar文件也可以达到同样的效果。

bash:

代码语言:javascript
运行
复制
pyspark \
--master yarn \
--py-files /<path>/.jar,/<path>/.jar,/<path>/.jar... \
--jars /<path>/.jar,/<path>/.jar,/<path>/.jar...

它也适用于Jupyter Notebook。只要在SparkSession前面加上下面这行就行了。

代码语言:javascript
运行
复制
os.environ['PYSPARK_SUBMIT_ARGS'] = '--py-files /<path>/.jar,/<path>/.jar,/<path>/.jar... --jars /<path>/.jar,/<path>/.jar,/<path>/.jar...'
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/63146931

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档