Python Spark安装及配置步骤

Python Spark安装及配置步骤

一、scala安装

scala下载路径

https://www.scala-lang.org/files/archive/

1、下载安装包

muyi@master:~$ wget http://www.scala-lang.org/files/archive/scala-2.12.7.tgz

2、解压文件到根目录

tar xvf '/home/muyi/Desktop/scala-2.12.7.tgz'

3、移动文件到指定目录

sudo mv scala-2.12.7 /usr/local/scala

4、编辑配置文件

gedit /home/muyi/.bash_profile

export SCALA_HOME=/usr/local/scala

export PATH=$PATH:$SCALA_HOME/bin

5、使配置生效

source /home/muyi/.bash_profile

6、启动scala

muyi@master:~$ scala

****

在CentOS以及其他的Linux系统中遇到安装包安装错误的原因,大多数都是因为缺少依赖包导致的,所以对于错误:zipimport.ZipImportError: can’t decompress data,是因为缺少zlib 的相关工具包导致的,知道了问题所在,那么我们只需要安装相关依赖包即可,

1、打开终端,输入一下命令安装zlib相关依赖包:

yum -y install zlib*

2、进入 python安装包,修改Module路径的setup文件:

vim module/setup

找到一下一行代码,去掉注释:

#zlib zlibmodule.c -I$(prefix)/include -L$(exec_prefix)/lib -lz

去掉注释

zlib zlibmodule.c -I$(prefix)/include -L$(exec_prefix)/lib -lz

另外,在这里说明一下,对于在安装Python安装的过程中遇到这个问题,安装完上面的依赖包后,即可重新进入终端,进入python的安装包路径下执行:

make && make install

重新编译安装即可

****

二、Spark 安装

spark下载路径

https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz

1、下载安装包

muyi@master:~$ wget https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz

2、解压文件

tar zxf spark-2.4.0-bin-hadoop2.7.tgz

3、移动文件

sudo mv spark-2.4.0-bin-hadoop2.7 /usr/local/spark/

4、配置环境变量

gedit /home/muyi/.bash_profile

export SPARK_HOME=/usr/local/spark

export PATH=$PATH:$SPARK_HOME/bin

5、使配置生效

source /home/muyi/.bash_profile

6、启动Pyspark

pyspark

7、设置pyspark显示信息

切换到/usr/local/spark/conf

cp log4j.properties.template log4j.properties

编辑 log4j.properties

修改 log4j.rootCategory=WARN,console

8、创建测试文本

file:/home/muyi/wordcount/input/LICENSE.txt

hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt

其中hdfs文件 需要启动 hadoop服务 start-all.sh

三、Python Spark 运行

1、PySpark本地运行

pyspark --master local[4]

sc.master

textFile = sc.textFile("file:/usr/local/spark/README.md")

textFile.count()

注意:使用 file关键字

In [3]: textFile = spark.read.text("file:/home/muyi/wordcount/input/LICENSE.txt")

In [4]: textFile.count()

Out[4]: 1594

In [6]: textFile = sc.textFile("file:/home/muyi/wordcount/input/LICENSE.txt")

In [7]: textFile.count()

Out[7]: 1594

读取HDFS文件

textFile = sc.textFile("hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt")

textFile.count()

In [3]: lines=sc.textFile("file:/home/muyi/hadooplist.txt")

In [4]: pairs=line.map(lambda s:(s,1))

In [5]: pairs=lines.map(lambda s:(s,1))

In [6]: counts = pairs.reduceByKey(lambda a,b:a+b)

In [7]: counts

Out[7]: PythonRDD[8] at RDD at PythonRDD.scala:53

In [8]: counts.count()

Out[8]: 11

In [14]: accum =sc.accumulator(0)

In [15]: accum

Out[15]: Accumulator

In [16]: sc.parallelize([1,2,3,4]).foreach(lambda x:accum.add(x))

In [17]: accum.value

Out[17]: 10

退出

exit()

2、Hadoop Yarn 运行 pyspark

yarn-site.xml 配置 master 和 slave 都需要配置

mapreduce_shuffle

master:18040

master:18030

master:18025

master:18141

yarn.resourcemanager.webapp.address

master:18088

false

false

100000

10000

3000

2000

启动 hadoop start-all.sh

HADOOP_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop pyspark --master yarn --deploy-mode client

3、Spark Standalone 运行 pyspark

3.1 配置 spark-env.sh

export JAVA_HOME=/usr/java/jdk1.8.0_181

export SCALA_HOME=/usr/local/scala

export HADOOP_HOME=/usr/hadoop/hadoop-2.7.7

export SPARK_MASTER_IP=192.168.222.3

export SPARK_MASTER_PORT=7077

export SPARK_HOME=/usr/local/spark

export HADOOP_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop

export SPARK_DIST_CLASSPATH=$(/usr/hadoop/hadoop-2.7.7/bin/hadoop classpath)

export SPARK_WORKER_CORES=1

export SPARK_WORKER_MEMORY=1g

export SPARK_WORKER_INSTANCES=1

3.2 复制spark程序到slave

3.3 在master虚拟机编辑slaves

编辑 master虚拟机 /usr/local/spark/conf/slaves

输入 slave

3.4 启动服务

启动 spark start-all.sh

启动 hadoop start-all.sh

[muyi@master spark]$ pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512m

Python 3.7.0 (default, Jun 28 2018, 13:15:42)

Type 'copyright', 'credits' or 'license' for more information

IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

18/12/07 04:38:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/__ / .__/\_,_/_/ /_/\_\ version 2.4.0

/_/

Using Python version 3.7.0 (default, Jun 28 2018 13:15:42)

SparkSession available as 'spark'.

In [1]: ts=sc.textFile("hdfs://master:9000/user/hduser/wordcount/input/LICENSE.t

...: xt")

In [2]: ts.count()

Out[2]: 1594

四、在IPython NoteBook运行 Python Spark程序

Linux虚拟机安装 anaconda3步骤

1、清华镜像网站下载安装文件

https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/

命令行:

wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-5.3.1-Linux-x86_64.sh

2、安装Anaconda.sh文件

bash Anaconda3-.3.5.1-LINux-X86_64.sh -b

3、配置~/.bash_profile文件,添加anaconda的bin目录到PATH中

export PATH=/home/muyi/anaconda3/bin:$PATH

export ANACONDA_PATH=/home/muyi/anaconda3

source ~/.bash_profile

4、重启虚拟机

不同模式运行IPython NoteBook命令

4.1、Local

[muyi@master Desktop]$ cd '/home/muyi/pythonwork/ipynotebook'

[muyi@master ipynotebook]$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark

4.2、Hadoop Yarn-client

只能测试服务器文件 即 hdfs路径下的文件

启动hadoop start-all.sh

IPython NoteBook 在 Hadoop Yarn-client运行

启动hadoop start-all.sh

[muyi@master ~]$ cd '/home/muyi/pythonwork/ipynotebook'

[muyi@master ipynotebook]$ PYSPARK_DRIVER_PYTHON_OPTS="notebook" HADOOP_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop MASTER=yarn-client pyspark

或者

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" HAHOOP_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop MASTER=yarn-client pyspark

4.3、Spark Stand Alone

首先 启动hadoop start-all.sh

其次 启动 spark start-all.sh

PYTHON_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" MASTER=spark://master:7077 pyspark --num-executors 1 --total-executor-cores 3 --executor-memory 512m

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" MASTER=spark://master:7077 pyspark --num-executors 1 --total-executor-cores 2 --executor-memory 512m

五、spark-submit 实现

1、[muyi@master spark]$ ./bin/spark-submit examples/src/main/python/pi.py

2、[muyi@master spark]$ ./bin/spark-submit examples/src/main/python/wordcount.py 'file:/home/muyi/Desktop/test.txt'

3、测试Spark安装是否成功

在hadoop yarn中查看

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 1g --executor-memory 1g --executor-cores 1 examples/jars/spark-examples*.jar 10

4、spark-submit 测试python文件

[muyi@master spark]$ ./bin/spark-submit examples/src/main/python/pi.py

注意事项

SPARK 配置pyspark 需要在slave上安装anaconda

yarn 运行 需要在 master 和slave 同时安装anaconda、spark

设置虚拟内存 不能小于2G。

最终 ~/.bash_profile 配置

PATH=$PATH:$HOME/bin

export PATH

export JAVA_HOME=/usr/java/jdk1.8.0_181

export PATH=$JAVA_HOME/bin:$PATH

export HADOOP_HOME=/usr/hadoop/hadoop-2.7.7

export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

export HADOOP_CLASSPATH=$/lib/tools.jar

export SCALA_HOME=/usr/local/scala

export PATH=$PATH:$SCALA_HOME/bin

export SPARK_HOME=/usr/local/spark

export PATH=$PATH:$SPARK_HOME/bin

export ANACONDA_PATH=/home/muyi/anaconda3

export PATH=$PATH:$ANACONDA_PATH/bin

export PYSPARK_DRIVER_PYTHON=$ANACONDA_PATH/bin/ipython

export PYSPARK_PYTHON=$ANACONDA_PATH/bin/python

export HADOOP_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop

export HDFS_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop

export YARN_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop

六、Python Spark RDD

转换

intRdd = sc.parallelize([3,1,2,5,5])

intRdd.collect()

stringRdd = sc.parallelize(['apple','orange','pear','apple'])

stringRdd.collect()

def addOne(x):

return (x+1)

intRdd.map(addOne).collect()

intRdd.map(lambda x : x + 2).collect()

stringRdd.map(lambda x:'firut:'+x).collect()

intRdd.filter(lambda x:x

intRdd.filter(lambda x:x==3).collect()

intRdd.filter(lambda x:x>1 and x

stringRdd.filter(lambda x:'a' in x).collect()

intRdd.distinct().collect()

srdd=intRdd.randomSplit([0.4,0.6])

srdd[0].collect()

srdd[1].collect()

grdd= intRdd.groupBy(lambda x: "even" if(x%2==0) else "odd").collect()

print(grdd[0][0],sorted(grdd[0][1]))

print(grdd[1][0],sorted(grdd[1][1]))

intRdd1=sc.parallelize([3,1,2,5,5])

intRdd2=sc.parallelize([5,6])

intRdd3=sc.parallelize([2,7])

intRdd1.union(intRdd2).union(intRdd3).collect()

intRdd1.intersection(intRdd2).collect()

intRdd1.subtract(intRdd2).collect()

intRdd1.cartesian(intRdd2).collect()

动作

intRdd.first()

intRdd.take(2)

intRdd.takeOrdered(3,key=lambda x: -x)

intRdd.takeOrdered(3,key=lambda x: x)

intRdd.stats()

intRdd.min()

intRdd.max()

intRdd.stdev()

intRdd.count()

intRdd.sum()

intRdd.mean()

RDD Key-Value 转换

Word Count 案例

[muyi@master Desktop]$ cd '/home/muyi/pythonwork/ipynotebook'

[muyi@master ipynotebook]$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark

textFile=sc.textFile("file:/home/muyi/pythonwork/ipynotebook/data/test.txt")

stringRDD=textFile.flatMap(lambda line : line.split(" "))

countsRDD=stringRDD.map(lambda word:(word,1)).reduceByKey(lambda x,y : x+y)

countsRDD.saveAsTextFile("file:/home/muyi/pythonwork/ipynotebook/data/output")

%ll data

%ll data/output

%cat data/output/part-00000

如果第二次执行报错,提示output目录已经存在

删除output目录

%rm -R data/output

集成开发环境

GTK版本升级

使用eclipse的时候让升级gtk+

yum install gtk2 gtk2-devel gtk2-devel-docs

查看是否安装GTK

pkg-config --list-all grep gtk

查看版本

pkg-config --modversion gtk+-2.0

  • 发表于:
  • 原文链接https://kuaibao.qq.com/s/20181220G0Z9YZ00?refer=cp_1026
  • 腾讯「云+社区」是腾讯内容开放平台帐号(企鹅号)传播渠道之一,根据《腾讯内容开放平台服务协议》转载发布内容。
  • 如有侵权,请联系 yunjia_community@tencent.com 删除。

扫码关注云+社区

领取腾讯云代金券