前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Hadoop2.7+Spark2.4.0+scala2.12.12+pyspark伪分布式环境搭建

Hadoop2.7+Spark2.4.0+scala2.12.12+pyspark伪分布式环境搭建

作者头像
用户1622570
发布2020-09-10 10:26:00
1K0
发布2020-09-10 10:26:00
举报

用大数据框架做机器学习第一步~~~~~~~~~~~~~~~

环境:VMware ubuntu虚拟机

基础的linux操作本教程默认会,所以写的相对简明,有问题可以留言。

一、vim

sudo apt-get install vim

二、JAVA-JDK 安装

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_261

export PATH=JAVA_HOME/bin:PATH

三、scala 安装

下载:https://www.scala-lang.org/download/all.html

mv scala的解压文件 /usr/local/scala

sudo vim /etc/profile

export SCALA_HOME=/usr/local/scala/scala-2.12.12

export PATH="$PATH: /usr/local/scala/scala-2.12.12/bin"

【大数据组件下载地址】

http://archive.apache.org/dist/

四、Hadoop2.7 安装

下载后解压到指定文件夹,我的是/usr/local/hadoop/hadoop-2.7.0

sudo vim /etc/profile

sudo vim ~/.bashrc

添加

export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.0

export PATH="

source /etc/profile

source~/.bashrc

1、打开hadoop-2.7.0/etc/hadoop/hadoop-env.sh文件,

vim hadoop-2.7.0/etc/hadoop/hadoop-env.sh

# The java implementation to use.(修改这里)

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_261

(Ctrl+Shift+V,粘贴)

2、打开hadoop-2.7.0/etc/hadoop/core-site.xml文件,编辑如下:

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:9000</value>

</property>

</configuration>

3、打开hadoop-2.7.0/etc/hadoop/mapred-site.xml文件,编辑如下:

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>localhost:9001</value>

</property>

</configuration>

4、打开hadoop-2.7.0/etc/hadoop/hdfs-site.xml文件,编辑如下:

<configuration>

<property>

<name>dfs.name.dir</name>

<value>/usr/local/hadoop/hadoop-2.7.0/namenode</value>

</property>

<property>

<name>dfs.data.dir</name>

<value>/usr/local/hadoop/hadoop-2.7.0/datanode</value>

</property>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

</configuration>

五、spark安装

下载后解压到指定文件夹,我的是/usr/local/spark/spark

sudo vim /etc/profile

sudo vim ~/.bashrc

添加

export SPARK_HOME=/usr/local/spark/spark

export PATH={SPARK_HOME}/bin:PATH

source /etc/profile

source~/.bashrc

修改spark-env.sh文件,

修改前先备份并重命名cp spark-env.sh.tempalte spark-env.sh

然后打开spark-env.sh文件,追加内容:

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_261

export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.0

export SCALA_HOME=/usr/local/scala/scala-2.12.12

export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.0/etc/hadoop

export SPARK_MASTER_IP=ubuntu

export SPARK_WORKER_MEMORY=512M

六、SSH配置

1、安装SSH服务

sudo apt-get install openssh-client

sudo apt-get install openssh-server

ssh-keygen -t

cat ~/.ssh/id_rsa.pub

将SSH Key添加到github(在settings 里面, add)

2、免密登录

cd ~/.ssh

cat id_rsa.pub >> authorized_keys

七、MVN安装

地址:http://maven.apache.org/download.cgi

下载解压:

# tar -xvf apache-maven-3.6.3-bin.tar.gz

# sudo mv -f apache-maven-3.6.3 /usr/local/

编辑 /etc/profile 文件 sudovim /etc/profile,在文件末尾添加如下代码:

export M2_HOME=/usr/local/mvn/apache-maven-3.6.3

export PATH={M2_HOME}/bin:PATH

保存文件,并运行如下命令使环境变量生效:

# source /etc/profile

# mvn -v

八、Pyspark安装

sudo apt-get install python

sudo apt-get install python-pip

sudo pip install pyspark==2.4.0 -i https://pypi.doubanio.com/simple

九、spark iforest 离群点检测

https://github.com/titicaca/spark-iforest

git clone git@github.com:titicaca/spark-iforest.git

Step 1:

cd spark-iforest/

mvn clean package -DskipTests

cp target/spark-iforest-<version>.jar $SPARK_HOME/jars/

Step 2.:

cd spark-iforest/python

python setup.py sdist

pip install dist/pyspark-iforest-<version>.tar.gz

测试栗子:

from pyspark.ml.linalg import Vectors

import tempfile

from pyspark.sql import SparkSession

spark = SparkSession \

.builder.master("local[*]")\

.appName("IForestExample") \

.getOrCreate()

data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([7.0,9.0]),),

(Vectors.dense([9.0,8.0]),), (Vectors.dense([8.0, 9.0]),)]

# NOTE: features need to be dense vectors for the model input

df = spark.createDataFrame(data, ["features"])

from pyspark_iforest.ml.iforest import *

# Init an IForest Object

iforest = IForest(contamination=0.3, maxDepth=2)

# Fit on a given data frame

model = iforest.fit(df)

# Check if the model has summary or not, the newly trained modelhas the summary info

model.hasSummary

# Show model summary

summary = model.summary

# Show the number of anomalies

summary.numAnomalies

# Predict for a new data frame based on the fitted model

transformed = model.transform(df)

# Collect spark data frame into local df

rows = transformed.collect()

temp_path = tempfile.mkdtemp()

iforest_path = temp_path + "/iforest"

# Save the iforest estimator into the path

iforest.save(iforest_path)

# Load iforest estimator from a path

loaded_iforest = IForest.load(iforest_path)

model_path = temp_path + "/iforest_model"

# Save the fitted model into the model path

model.save(model_path)

# Load a fitted model from a model path

loaded_model = IForestModel.load(model_path)

# The loaded model has no summary info

loaded_model.hasSummary

# Use the loaded model to predict a new data frame

loaded_model.transform(df).show()

最后结果输出如图:

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2020-09-04,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 机器学习和数学 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 一、vim
  • 二、JAVA-JDK 安装
  • 三、scala 安装
  • 四、Hadoop2.7 安装
  • 五、spark安装
  • 六、SSH配置
  • 七、MVN安装
  • 八、Pyspark安装
  • 九、spark iforest 离群点检测
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档