弹性 MapReduce Hive 连接方式-EMR 开发指南-文档中心-腾讯云

本文为您介绍在弹性 MapReduce 中使用 Hive 客户端、Beeline、Java 三种方式连接 Hive。 
开发准备
确认您已经开通了腾讯云，并且创建了一个 EMR 集群，详情请参见创建集群。
在创建 EMR 集群时在软件配置界面选择 Hive 组件。
说明：
登录 EMR 节点的方式可参考登录 Linux 实例。在集群详情页中选择 集群资源 > 资源管理，单击对应节点资源 ID 进入云服务器列表，单击右侧 登录 即可使用 WebShell 登录实例。
登录 Linux 实例用户名默认为 root，密码为创建 EMR 时用户自己输入的密码。输入正确后，即可进入命令行界面。
本文操作均是以 hadoop 用户进行，请在登录命令行界面后使用 su hadoop 命令切换用户身份。
连接Hive
Hive 服务默认部署在 Master 节点，您也可以参考集群扩容将 HiveServer2 部署至 Router 节点，本文以 Master 节点为示例连接 Hive 服务。
方式一：通过 Hive 客户端
登录 EMR 集群的 Master 节点，切换到 Hadoop 用户，执行以下命令即可进入 Hive 命令行：
hive
您也可以使用 -h 参数来获取 Hive 指令的基本信息。
方式二：通过 Beeline 连接 HiveServer2
登录 EMR 的 Master 节点，通过 beeline 命令连接 Hive:
beeline -u "jdbc:hive2://${hs2_ip}:${hs2_port}" -n hadoop
说明：
1. ${hs2_ip} 为集群  HiveServer2 服务部署节点的内网 ip，您可以在集群详情页中 集群服务 > Hive > 角色管理 中查看。
2. ${hs2_port} 为集群 HiveServer2 的端口号，默认值是 7001，您可以在集群详情页中 集群服务 > Hive > 配置管理 中 hive-site.xml 配置文件的 hive.server2.thrift.port 配置项中查看。
方式三：通过 Java 连接 Hive
本文以 Maven 来为示例管理您的工程。Maven 是一个项目管理工具，能够帮助您方便的管理项目的依赖信息，即它可以通过 pom.xml 文件的配置获取 jar 包，而不用去手动添加。
首先在本地下载并安装 Maven，配置好 Maven 的环境变量，如果您使用 IDE，请在 IDE 中设置好 Maven 相关配置。
在本地 shell 下进入要新建工程的目录，例如/tmp/mavenWorkplace中，输入如下命令新建一个 Maven 工程：
mvn archetype:generate -DgroupId=${yourgroupID} -DartifactId=${yourartifactID} -DarchetypeArtifactId=maven-archetype-quickstart
说明：
1. ${yourgroupID} 即为您的包名；${yourartifactID} 为您的项目名称；
2. maven-archetype-quickstart 表示创建一个 Maven Java 项目。工程创建过程中需要下载一些文件，请保持网络通畅。
其中我们主要关心 pom.xml 文件和 main 下的 java 文件夹。pom.xml 文件主要用于依赖和打包配置，Java 文件夹下放置您的源代码。
首先在 pom.xml 文件中配置项目依赖（hadoop-common 和 hive-jdbc）：
<dependencies>
       <dependency>
           <groupId>org.apache.hive</groupId>
           <artifactId>hive-jdbc</artifactId>
           <version>${hive_version}</version>
       </dependency>
       <dependency>
           <groupId>org.apache.hadoop</groupId>
           <artifactId>hadoop-common</artifactId>
           <version>${hadoop_version}</version>
       </dependency>
</dependencies>
说明：
其中 ${hive_version} 为您集群中的 Hive 版本，${hadoop_version} 为您集群中的 Hadoop 版本。
 继续在 pom.xml 中添加打包和编译插件：
<build>
<plugins>
 <plugin>
   <groupId>org.apache.maven.plugins</groupId>
   <artifactId>maven-compiler-plugin</artifactId>
   <configuration>
     <source>1.8</source>
     <target>1.8</target>
     <encoding>utf-8</encoding>
   </configuration>
 </plugin>
 <plugin>
   <artifactId>maven-assembly-plugin</artifactId>
   <configuration>
     <descriptorRefs>
     <descriptorRef>jar-with-dependencies</descriptorRef>
     </descriptorRefs>
   </configuration>
   <executions>
     <execution>
       <id>make-assembly</id>
       <phase>package</phase>
       <goals>
         <goal>single</goal>
       </goals>
     </execution>
   </executions>
 </plugin>
</plugins>
</build>
在 src>main>java 下右键新建一个 Java Class，输入您的 Class 名，本示例使用 App.java，在 Class 添加样例代码：
package org.example;
﻿
import java.sql.*;
/**
 * Created by tencent on 2023/8/11.
 */
public class App {
    private static final String DRIVER_NAME = "org.apache.hive.jdbc.HiveDriver";
    public static void main(String[] args) throws SQLException {
        try {
            // 加载hive-jdbc驱动
            Class.forName(DRIVER_NAME);
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
            System.exit(1);
        }
        // 根据连接信息和账号密码获取连接
        Connection conn = DriverManager.getConnection("jdbc:hive2://$hs2_ip:$hs2_port/default", "hadoop", "");
        // 创建状态参数（使用conn.prepareStatement(sql)预编译sql防止sql注入，但常用于参数化执行sql，批量执行不同的sql建议使用下面这种方式）
        Statement stmt = conn.createStatement();
        // 以下是执行简单的建表和增查操作
        String tableName = "hive_test";
        stmt.execute("drop table if exists " + tableName);
        stmt.execute("create table " + tableName + " (key int, value string)");
        System.out.println("Create table success!");
        // show tables
        String sql = "show tables '" + tableName + "'";
        System.out.println("Running: " + sql);
        ResultSet res = stmt.executeQuery(sql);
        if (res.next()) {
            System.out.println(res.getString(1));
        }
        // describe table
        sql = "describe " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(res.getString(1) + "\\t" + res.getString(2));
        }
        sql = "insert into " + tableName + " values (42,\\"hello\\"),(48,\\"world\\")";
        stmt.execute(sql);
        sql = "select * from " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(res.getInt(1) + "\\t" + res.getString(2));
        }
        sql = "select count(1) from " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(res.getString(1));
        }
    }
}
程序会先连接 HiveServer2 服务，然后在 default 数据库中建立一个名为 hive_test 的表。之后在该表中插入两个元素，并输出整个表的内容。
如果您的 Maven 配置正确并且成功的导入了依赖包，那么整个工程即可直接编译。在本地 shell 下进入工程目录，执行下面的命令对整个工程进行打包：
mvn clean package -DskipTests
运行过程中可能还需要下载一些文件，直到出现 build success 表示打包成功。然后您可以在工程目录下的 target 文件夹中看到打好的 jar 包。
上传并运行程序
首先需要把压缩好的 jar 包上传到 EMR 集群中，使用 scp 或者 sftp 工具来进行上传。在本地 shell 下运行：(输入 yes 后输入密码验证)
scp ${localfile} root@${master_public_ip}:/usr/local/service/hive
说明：
1. ${localfile} 是您的本地文件的路径加名称，root 为 CVM 服务器用户名，公网 IP 可以在 EMR 控制台的节点信息中或者在云服务器控制台查看。
2. ${master_public_ip} 是您集群 Master 节点的公网IP。
将打好的 jar 包上传到 EMR 集群的 /home/hadoop/ 目录下。上传完成后，在 EMR 命令行中即可查看对应文件夹下是否有相应文件。一定要上传具有依赖的 jar 包。
登录 EMR 集群切换到 hadoop 用户并且进入目录 /home/hadoop/。执行程序：
yarn jar ./hive-test-1.0-SNAPSHOT-jar-with-dependencies.jar org.example.App
说明：
其中 ./hive-test-1.0-SNAPSHOT-jar-with-dependencies.jar 为您的 jar 包的路径 + 名字，org.example.App 为之前的 Java Class 的包名 + 类名。
运行结果如下：
Create table success!
Running: show tables 'hive_test'
hive_test
Running: describe hive_test
key     int
value   string
Running: select * from hive_test
42      hello
48      world
Running: select count(1) from hive_test
2
方式三：通过 Python 连接 Hive
本文示例使用 PyHive项目 通过 Python 3 进行 hive 连接。
首先登录 EMR 集群的 Master 节点，切换到 root 用户，进入 /usr/local/service/hive/ 目录下执行命令安装需要的工具及依赖包：
pip3 install sasl
pip3 install thrift
pip3 install thrift-sasl
pip3 install pyhive
安装完成后切换回 hadoop 用户。然后在 /usr/local/service/hive/ 目录下新建一个 Python 文件 hivetest.py，并且添加以下代码：
from pyhive import hive
﻿
import sys
﻿
default_encoding = 'utf-8'
﻿
conn = hive.connect(host='${hs2_host}',
                    port='${hs2_port}',
                    username='hadoop',
                    password='hadoop',
                    database='default',
                    auth="CUSTOM",)
﻿
﻿
tablename = 'HiveByPython'
cur = conn.cursor()
﻿
print("\\n")
print('show the tables in default: ')
cur.execute('show tables')
for i in cur.fetchall():
        print(i)
﻿
cur.execute('drop table if exists ' + tablename)
cur.execute('create table ' + tablename + ' (key int,value string)')
﻿
print("\\n")
print('show the new table: ')
cur.execute('show tables ' +"'" +tablename+"'")
for i in cur.fetchall():
        print(i)
﻿
print("\\n")
print("contents from " + tablename + ":")
cur.execute('insert into ' + tablename + ' values (42,"hello"),(48,"world")')
cur.execute('select * from ' + tablename)
for i in cur.fetchall():
        print(i)
该程序连接 HiveServer2 后，首先输出所有的数据库，然后显示“default”数据库中的表。创建一个名为“hivebypython”的表，在表中插入两个数据并输出。
说明：
1. ${hs2_host} 为集群  HiveServer2 的 hostID，您可以在集群详情页中 集群服务 >  Hive > 配置管理 中 hive-site.xml 配置文件的 hive.server2.thrift.bind.host 配置项中查看。
2. ${hs2_port} 为集群 HiveServer2 的端口号，默认值是 7001，您可以在集群详情页中 集群服务 > Hive > 配置管理 中 hive-site.xml 配置文件的 hive.server2.thrift.port 配置项中查看。
保存后直接运行该程序：
python3 hivetest.py
可以看到命令行输出以下信息：
show the tables in default:
﻿
show the new table:
('hivebypython',)
﻿
contents from HiveByPython:
(42, 'hello')
(48, 'world')
﻿
Hive 连接方式

本页目录：

开发准备

连接Hive

方式一：通过 Hive 客户端

方式二：通过 Beeline 连接 HiveServer2

方式三：通过 Java 连接 Hive

上传并运行程序

方式三：通过 Python 连接 Hive