前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Mac hadoop + hive整合s3-伪分布式环境

Mac hadoop + hive整合s3-伪分布式环境

原创
作者头像
框框不是欢欢
修改2022-04-26 16:59:01
1.4K1
修改2022-04-26 16:59:01
举报
文章被收录于专栏:大数据探索

hadoop环境

环境信息

搭建方式:伪分布式环境

JDK: java1.8 路径为:/Library/Java/JavaVirtualMachines/jdk1.8.0_291.jdk/Contents/Home

hadoop版本:hadoop-3.2.3

配置免密登录

1、提供远程登录权限

2、创建ssh密钥

代码语言:javascript
复制
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

3、将密钥放入授权目录

代码语言:javascript
复制
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

4、使用ssh localhost验证,能够正常登录即可

下载hadoop

1、下载地址:https://dlcdn.apache.org/hadoop/common/hadoop-3.2.3/hadoop-3.2.3.tar.gz

2、解压hadoop-3.2.3.tar.gz,我在本地的存放地址为~/Documents/java/hadoop-3.2.3

伪分布式搭建

本文采用s3作为文件系统存储,hdfs存储的方式不做赘述

1、修改hadoop-env.sh,添加下面java_home配置

代码语言:javascript
复制
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_291.jdk/Contents/Home

2、修改core-site.xml,添加下面内容

代码语言:javascript
复制
<configuration>
   <property>
    <name>fs.defaultFS</name>
    <value>s3a://mybucket</value>
  </property>
  <property>
    <name>fs.s3a.access.key</name>
    <value>*******</value>
  </property>
  <property>
    <name>fs.s3a.secret.key</name>
    <value>*******</value>
  </property>
  <property>
    <name>fs.s3a.connection.ssl.enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>fs.s3a.path.style.access</name>
    <value>true</value>
  </property>
   <property>
    <name>fs.s3a.endpoint</name>
    <value>http://s3.ap-northeast-1.amazonaws.com</value>
  </property> 
  <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  </property>
  <property>
      <name>hadoop.tmp.dir</name>
      <value>/hadoop/tmp</value>
    <description>A base for other temporary directories.</description>
  </property>
</configuration>

3、修改hdfs-site.xml,添加以下内容

代码语言:javascript
复制
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
        <name>dfs.namenode.datanode.registration.ip-hostname-check</name>
        <value>false</value>
    </property><property>
        <name>dfs.permissions.enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>dfs.namenode.servicerpc-bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>dfs.namenode.http-bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>dfs.namenode.https-bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>dfs.client.use.datanode.hostname</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.datanode.use.datanode.hostname</name>
        <value>false</value>
    </property>
</configuration>

4、修改mapred-site.xml,添加以下内容

代码语言:javascript
复制
<configuration>
    <property>
         <name>mapreduce.framework.name</name>
         <value>yarn</value>
     </property>
</configuration>

5、修改yarn-site.xml,添加以下内容

代码语言:javascript
复制
<configuration>
 
<!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
  
     
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>localhost</value>
    </property>
    <property>
        <name>yarn.resourcemanager.store.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore</value>
    </property>
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property> 
    <property>
        <name>mapred.map.output.compress.codec</name>
        <value>org.apache.hadoop.io.compress.SnappyCodec</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
    </property>
    <property>
        <name>mapreduce.map.output.compress</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.resourcemanager.recovery.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.resourcemanager.bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>yarn.nodemanager.bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>yarn.nodemanager.bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>yarn.timeline-service.bind-host</name>
        <value>0.0.0.0</value>
    </property>
    <property>
        <name>yarn.application.classpath</name>
        <value>
            /Users/sheen/Documents/java/hadoop-3.2.3/etc/hadoop,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/common/*,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/common/lib/*,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/hdfs/*,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/hdfs/lib/*,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/mapreduce/*,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/mapreduce/lib/*,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/yarn/*,
            /Users/sheen/Documents/java/hadoop-3.2.3/share/hadoop/yarn/lib/*
        </value>
    </property>
</configuration>

填坑操作

1、hadoop yarn使用s3作为文件系统,当提交hive任务执行时,会出现下面问题

代码语言:javascript
复制
java.io.IOException: Resource s3a://yarn/user/root/DistributedShell/application_1641533299713_0002/ExecScript.sh changed on src filesystem (expected 1641534006000, was 1641534011000
    at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:273)
    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
    at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

问题原因:这个错误出现在hadoop-yarn-common包下的org.apache.hadoop.yarn.util.FSDownload类中,在s3在复制文件的过程中会改变文件的时间戳(hdfs不会)

代码语言:javascript
复制
private void verifyAndCopy(Path destination)
    throws IOException, YarnException {
  final Path sCopy;
  try {
    sCopy = resource.getResource().toPath();
  } catch (URISyntaxException e) {
    throw new IOException("Invalid resource", e);
  }
  FileSystem sourceFs = sCopy.getFileSystem(conf);
  FileStatus sStat = sourceFs.getFileStatus(sCopy);
  if (sStat.getModificationTime() != resource.getTimestamp()) {
    throw new IOException("Resource " + sCopy + " changed on src filesystem" +
        " - expected: " +
        "\"" + Times.formatISO8601(resource.getTimestamp()) + "\"" +
        ", was: " +
        "\"" + Times.formatISO8601(sStat.getModificationTime()) + "\"" +
        ", current time: " + "\"" + Times.formatISO8601(Time.now()) + "\"");
  }
  if (resource.getVisibility() == LocalResourceVisibility.PUBLIC) {
    if (!isPublic(sourceFs, sCopy, sStat, statCache)) {
      throw new IOException("Resource " + sCopy +
          " is not publicly accessible and as such cannot be part of the" +
          " public cache.");
    }
  }
 
  downloadAndUnpack(sCopy, destination);
}

解决方案:

1、github下载hadoop代码,地址:https://github.com/apache/hadoop

2、切换到branch-3.2.3分支,修改hadoop/hadoop-yarn/hadoop-yarn-common的org.apache.hadoop.yarn.util.FSDownload类代码

代码语言:javascript
复制
private void verifyAndCopy(Path destination)
    throws IOException, YarnException {
  final Path sCopy;
  try {
    sCopy = resource.getResource().toPath();
  } catch (URISyntaxException e) {
    throw new IOException("Invalid resource", e);
  }
  FileSystem sourceFs = sCopy.getFileSystem(conf);
  FileStatus sStat = sourceFs.getFileStatus(sCopy);
  if (sStat.getModificationTime() != resource.getTimestamp()) {
    /*
    throw new IOException("Resource " + sCopy + " changed on src filesystem" +
        " - expected: " +
        "\"" + Times.formatISO8601(resource.getTimestamp()) + "\"" +
        ", was: " +
        "\"" + Times.formatISO8601(sStat.getModificationTime()) + "\"" +
        ", current time: " + "\"" + Times.formatISO8601(Time.now()) + "\"");
    */
    LOG.warn("Resource " + sCopy + " changed on src filesystem" +
            " - expected: " +
            "\"" + Times.formatISO8601(resource.getTimestamp()) + "\"" +
            ", was: " +
            "\"" + Times.formatISO8601(sStat.getModificationTime()) + "\"" +
            ", current time: " + "\"" + Times.formatISO8601(Time.now()) + "\"" +
            ". Stop showing exception here, use a warning instead.");
  }
  if (resource.getVisibility() == LocalResourceVisibility.PUBLIC) {
    if (!isPublic(sourceFs, sCopy, sStat, statCache)) {
      throw new IOException("Resource " + sCopy +
          " is not publicly accessible and as such cannot be part of the" +
          " public cache.");
    }
  }
 
  downloadAndUnpack(sCopy, destination);
}

3、重新编译打包hadoop-yarn-common

4、将打好hadoop-yarn-common-3.2.3.jar复制到hadoop-3.2.3/share/hadoop/yarn目录下,替换掉原先的的包

hive环境

下载hive

1、下载hive,地址:https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

2、解压apache-hive-3.1.2-bin.tar.gz,本地存放目录为:~/Documents/java/apache-hive-3.1.2-bin

hive配置

1、下载mysql连接,并存放在hive的lib目录下

代码语言:javascript
复制
cd ~/Document/apache-hive-3.1.2-bin/lib
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.16/mysql-connector-java-8.0.16-sources.jar

2、从hadoop添加支持s3的jar包,这里使用软连接

代码语言:javascript
复制
mkdir ~/Documents/java/apache-hive-3.1.2-bin/auxlib
ln -s ~/Documents/java/hadoop-3.2.3/share/hadoop/tools/lib/*aws* ~/Documents/java/apache-hive-3.1.2-bin/auxlib/

3、修改hive_env.sh,添加以下内容

代码语言:javascript
复制
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_291.jdk/Contents/Home
export HADOOP_HOME=/Users/sheen/Documents/java/hadoop-3.2.3
export HIVE_HOME=/Users/sheen/Documents/java/apache-hive-3.1.2-bin
export HIVE_AUX_JARS_PATH=$HIVE_HOME/auxlib

4、新增hive-site.xml文件,并配置以下内容

代码语言:javascript
复制
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://127.0.0.1:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>123456</value>
</property>
<property>
  <name>hive.querylog.location</name>
  <value>/hive/tmp</value>
</property>
<property>
  <name>hive.exec.local.scratchdir</name>
  <value>/hive/tmp</value>
</property>
<property>
  <name>hive.downloaded.resources.dir</name>
  <value>/hive/tmp</value>
</property>
</configuration>

5、初始化hive元数据

代码语言:javascript
复制
./bin/schematool -initSchema -dbType mysql -userName root -passWord root

6、在hive/conf下新建core-site.xml文件,添加以下内容

代码语言:javascript
复制
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>s3a://mybucket</value>
  </property>
  <property>
    <name>fs.s3a.aws.credentials.provider</name>
    <description>The credential provider type.</description>
    <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>
  </property>
  <property>
    <name>fs.s3a.bucket.mybucket.access.key</name>
    <value>*****</value>
  </property>
  <property>
    <name>fs.s3a.bucket.mybucket.secret.key</name>
    <value>******</value>
  </property>
  <property>
    <name>fs.s3a.connection.ssl.enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>fs.s3a.path.style.access</name>
    <value>true</value>
  </property>
   <property>
    <name>fs.s3a.endpoint</name>
    <value>http://s3.ap-northeast-1.amazonaws.com</value>
  </property>  
  <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  </property>
</configuration>

小细节:

hadoop和hive的fs.defaultFS最好配一样,且如果fs.defaultFS配的时候有带桶,比如s3a://mybucket,带了mybucket这个桶,那么fs.s3a.secret.key必须配成fs.s3a.bucket.mybucket.secret.key。不然找不到secretKey,accessKey也是如此。

启动hadoop + hive

1、启动hadoop,出现error是hdfs的报错,无影响,无视就行

代码语言:javascript
复制
~/Documents/java/hadoop-3.2.3/sbin/start-all.sh

访问localhost:8088

2、启动hive

代码语言:javascript
复制
~/Documents/java/apache-hive-3.1.2-bin/bin/hive

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • hadoop环境
    • 环境信息
      • 配置免密登录
        • 下载hadoop
          • 伪分布式搭建
            • 填坑操作
            • hive环境
              • 下载hive
                • hive配置
                • 启动hadoop + hive
                领券
                问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档