Mac做java开发(四):​hadoop分布式环境搭建

大数据时代,分布式技术至关重要,因此,这篇文章介绍hadoop分布式环境搭建,作为个人学习大数据技术的实验环境。

首先介绍一个对学生和初创企业友好的免费云服务器提供商,不过,需要不断免费延期申请,三丰云,官网:

https://www.sanfengyun.com/freeServer/

博主jesse申请了两台免费云服务器,centOS系统,本地ssh远程连接。

好了,简单的硬件设备准备好了,下面进入hadoop分布式环境配置阶段:这里全部使用terminal进行操作。

第一步,ssh远程连接云服务器:

// terminal截屏如下:
(base) Jesse-Mac:~ jesse$ ssh root@111.67.204.---
root@111.67.204.---'s password: 
Last login: Tue Jul 30 16:20:40 2019
[root@localhost ~]# ls
anaconda-ks.cfg  install.sh  test

第二步,安装java:

yum install -y java-1.8.0-openjdk.x86_64

yum install -y java-1.8.0-openjdk-devel

java -version

// 进入安装目录
cd /usr/lib/jvm
ls -lh

// 把JAVA_HOME设置为 /usr/lib/jvm/jre
echo 'export JAVA_HOME=/usr/lib/jvm/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/tools.jar' >> /etc/profile

source /etc/profile

// 关闭防火墙
systemctl stop firewalld.service   #停止firewall
systemctl disable firewalld.service   #禁止firewall开机启动

// 关闭selinux
yum install perl

perl -p -i.bak -e 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/sysconfig/selinux

第三步,设置hostname:

hostnamectl set-hostname hadoop001

// 这里IP地址为租用的两台云服务器的ip地址
echo '
111.67.194.--- hadoop001
111.67.204.--- hadoop002
' > /etc/hosts

第四步,创建hadoop用户,并切换到hadoop用户配置无密登录:

groupadd bigdata

useradd -g bigdata -s /bin/bash hadoop

# 给hadoop添加sudo权限
echo -e 'hadoop\tALL=(ALL)\tNOPASSWD:ALL' >> /etc/sudoers

// 切换到hadoop用户:
su hadoop


// ssh相关配置:
cd /home/hadoop
ssh-keygen


cat /home/hadoop/.ssh/id_rsa.pub

// 记录cat命令输出每个节点的Key,所有key全部写入到每个节点上的文件/home/hadoop/.ssh/authorized_keys中,并设置权限:
echo 'ssh-rsa AAAJTY9KBUyIP hadoop@hadoop001
ssh-rsa AAAkuBXlqN8T hadoop@hadoop002' >> ~/.ssh/authorized_keys

// 测试是否ssh免密成功:
ssh hadoop001

ssh hadoop002

// 重启电脑:
exit
reboot

第五步,配置hadoop:

hadoop下载官网:

https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz

su hadoop

// 下载hadoop
# wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz

// 解压,移动文件
tar -zxvf hadoop-3.1.2.tar.gz

sudo mkdir -p /opt/hadoop

sudo chown hadoop.bigdata /opt/hadoop

mv hadoop-3.1.2 /opt/hadoop/hadoop-3.1.2

// 切换到root账号,添加hadoop home的全局变量
exit 

echo '
export HADOOP_HOME="/opt/hadoop/hadoop-3.1.2"
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
' >> /etc/profile

source /etc/profile


// 回到hadoop用户
su hadoop

// 添加JAVA_HOME环境变量配置,在文件最上面添加一行:JAVA_HOME=/usr/lib/jvm/jre
sed -i '1i\JAVA_HOME=/usr/lib/jvm/jre' ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh
sed -i '1i\JAVA_HOME=/usr/lib/jvm/jre' ${HADOOP_HOME}/etc/hadoop/mapred-env.sh

第六步,测试hadoop:

// 首先创建data.txt文件:terminal截屏如下
[root@localhost test]# mkdir input
[root@localhost test]# cd input/
[root@localhost input]# touch data.txt
[root@localhost input]# ls
data.txt

// Hadoop中为我们提供了一个单词计数的MapReduce程序
[root@localhost mapreduce]# cd /opt/hadoop/hadoop-3.1.2/share/hadoop/mapreduce
[root@localhost mapreduce]# ls
hadoop-mapreduce-client-app-3.1.2.jar
hadoop-mapreduce-client-common-3.1.2.jar
hadoop-mapreduce-client-core-3.1.2.jar
hadoop-mapreduce-client-hs-3.1.2.jar
hadoop-mapreduce-client-hs-plugins-3.1.2.jar
hadoop-mapreduce-client-jobclient-3.1.2.jar
hadoop-mapreduce-client-jobclient-3.1.2-tests.jar
hadoop-mapreduce-client-nativetask-3.1.2.jar
hadoop-mapreduce-client-shuffle-3.1.2.jar
hadoop-mapreduce-client-uploader-3.1.2.jar
hadoop-mapreduce-examples-3.1.2.jar
jdiff
lib
lib-examples
sources
[root@localhost mapreduce]# hadoop jar hadoop-mapreduce-examples-3.1.2.jar wordcount /root/test/input/data.txt /root/test/output/

// 运行完成后,进入/root/test/output/文件夹下
[root@localhost output]#  cd /root/test/output/
[root@localhost output]# ls
part-r-00000  _SUCCESS  // 实际结果存在part-r-00000,_SUCCESS只是一个状态文件。

第七步,修改core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml、slaves配置文件:

# core-site.xml

echo '<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
       <property>
                <name>fs.defaultFS</name>
                <value>hdfs://hadoop001:9000</value>
             <description>设定namenode的主机名及端口</description>
       </property>
       <property>
                <name>io.file.buffer.size</name>
                <value>131072</value>
        <description> 设置缓存大小 </description>
        </property>
       <property>
               <name>hadoop.tmp.dir</name>
               <value>/opt/hadoop/hadoop-3.1.2/tmp</value>
               <description> 存放临时文件的目录 </description>
       </property>
       <property>
            <name>fs.checkpoint.period</name>
            <value>3600</value>
            <description> 检查点备份日志最长时间 </description>
       </property>
       <property>
            <name>hadoop.security.authorization</name>
            <value>false</value>
       </property>
</configuration>' > ${HADOOP_HOME}/etc/hadoop/core-site.xml

# hdfs-site.xml

echo '<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <description>分片数量</description>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file://${hadoop.tmp.dir}/name</value>
        <description>命名空间和事务在本地文件系统永久存储的路径</description>
    </property>
    <property>
        <name>dfs.namenode.hosts</name>
        <value>hadoop001,hadoop002</value>
        <description>2个datanode</description>
    </property>
    <property>
        <name>dfs.blocksize</name>
        <value>11534336</value>
        <description>HDFS块大小11M,如果你只有普通网线,就别64M了,没什么用</description>
    </property> 
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file://${hadoop.tmp.dir}/data</value>
        <description>DataNode在本地文件系统中存放块的路径</description>
    </property>
    <property>
       <name>dfs.namenode.secondary.http-address</name>
       <value>hadoop002:50090</value>
       <description>secondary namenode设置到woker2</description>
    </property>
    
</configuration>
' > ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml

# yarn-site.xml

echo '<?xml version="1.0"?>
<configuration>
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
        <description>
            常用类:CapacityScheduler、FairScheduler或者FifoScheduler这里使用公平调度
            org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
        </description>
    </property> 
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop001</value>
        <description>指定resourcemanager服务器指向hadoop001</description>
    </property>
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
        <description>配置启用日志聚集功能</description>
    </property>
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>106800</value>
        <description>配置聚集的日志在HDFS上保存最长时间</description>
    </property>
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>3096</value>
        <description>可使用的物理内存总量</description>
    </property>
    <property>
        <name>yarn.nodemanager.local-dirs</name>
        <value>file://${hadoop.tmp.dir}/nodemanager</value>
        <description>列表用逗号分隔</description>
    </property>
    <property>
        <name>yarn.nodemanager.log-dirs</name>
        <value>file://${hadoop.tmp.dir}/nodemanager/logs</value>
        <description>列表用逗号分隔</description>
    </property>
    <property>
        <name>yarn.nodemanager.log.retain-seconds</name>
        <value>10800</value>
        <description>单位为S</description>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>' > ${HADOOP_HOME}/etc/hadoop/yarn-site.xml

# mapred-site.xml

echo '<?xml version="1.0"?>
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
        <description>执行框架设置为Hadoop YARN</description>
    </property>
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>1024</value>
        <description>maps的资源限制</description>
    </property>
    <property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx512M</value>
        <description>maps中jvm child的堆大小</description>
    </property>
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>1024</value>
        <description>reduces的资源限制</description>
    </property>
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx512M</value>
        <description>reduces jvm child的堆大小</description>
    </property> 
    <property>
        <name> mapreduce.jobhistory.address</name>
        <value>hadoop001:10200</value>
        <description>设置mapreduce的历史服务器安装在hadoop001机器上</description>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>hadoop001:19888</value>
        <description>历史服务器的web页面地址和端口号</description>
    </property>
</configuration>
' > ${HADOOP_HOME}/etc/hadoop/mapred-site.xml

# slaves

echo '
hadoop001
hadoop002
' > ${HADOOP_HOME}/etc/hadoop/slaves

第八步,格式化:

# hadoop001,hadoop002 均执行:
mkdir -p ${HADOOP_HOME}/tmp

# hadoop001上执行
hdfs namenode -format

第九步,启动:

start-dfs.sh

start-yarn.sh

yarn --daemon start

mr-jobhistory-daemon.sh start historyserver

执行start-dfs.sh时出现如下报错:

进入logs所在的文件目录,ll:

[hadoop@hadoop002 test]$ cd /opt/hadoop/hadoop-3.1.2
[hadoop@hadoop002 hadoop-3.1.2]$ ll
total 184
drwxrwxrwx 2 hadoop bigdata   4096 Jan 29 11:35 bin
drwxrwxrwx 3 hadoop bigdata     19 Jan 29  2019 etc
drwxrwxrwx 2 hadoop bigdata    101 Jan 29 11:35 include
drwxrwxrwx 3 hadoop bigdata     19 Jan 29 11:35 lib
drwxrwxrwx 4 hadoop bigdata   4096 Jan 29 11:36 libexec
-rwxrwxrwx 1 hadoop bigdata 147145 Jan 23  2019 LICENSE.txt
drwxr-xr-x 2 root   root        36 Jul 30 20:41 logs
-rwxrwxrwx 1 hadoop bigdata  21867 Jan 23  2019 NOTICE.txt
-rwxrwxrwx 1 hadoop bigdata   1366 Jan 23  2019 README.txt
drwxrwxrwx 3 hadoop bigdata   4096 Jul 31 00:47 sbin
drwxrwxrwx 4 hadoop bigdata     29 Jan 29 12:05 share
drwxr-xr-x 3 root   root        17 Jul 30 19:51 tmp

发现logs权限没有给到hadoop,这里采用修改文件所属于分组的方法:

chown -R hadoop:bigdata logs
chown -R hadoop:bigdata tmp

一般情况下,后面就一路畅通了。

最后,作为检验,看看hadoop集群目录是否存在:

[hadoop@hadoop001 test]$ hdfs dfs -ls /
Found 1 items
drwxrwx---   - hadoop supergroup          0 2019-07-31 01:28 /tmp
[hadoop@hadoop001 test]$ 

完结

Reference:

http://dicoding.site/archives/209

https://blog.csdn.net/l_15156024189/article/details/81810553

https://blog.csdn.net/u010670689/article/details/42388299

原文发布于微信公众号 - MiningAlgorithms(gh_d0cc50d1ed34)

原文发表时间:2019-08-01

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

扫码关注云+社区

领取腾讯云代金券