专栏首页杂烩spark整合hive+hbase做数据实时插入及实时查询分析

spark整合hive+hbase做数据实时插入及实时查询分析

    声明

        使用的spark是2.0.1,hive是1.2.1,hbase是1.2.4,hadoop是2.6.0,zookeeper是3.4.9

        各依赖安装这里不再赘述,如需要可自行查看以前博客或百度,这里着重说明如何配置。

hbase

        hbase不需要特殊配置,正常启动即可。

hadoop

        hadoop不需要也属配置,正常启动即可。

hive

编辑hive-env.sh,增加HBASE_HOME变量

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hive and Hadoop environment variables here. These variables can be used
# to control the execution of Hive. It should be used by admins to configure
# the Hive installation (so that users do not have to set environment variables
# or set command line parameters to get correct behavior).
#
# The hive service being invoked (CLI/HWI etc.) is available via the environment
# variable SERVICE


# Hive Client memory usage can be an issue if a large number of clients
# are running at the same time. The flags below have been useful in 
# reducing memory usage:
#
# if [ "$SERVICE" = "cli" ]; then
#   if [ -z "$DEBUG" ]; then
#     export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit"
#   else
#     export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit"
#   fi
# fi

# The heap size of the jvm stared by hive shell script can be controlled via:
#
# export HADOOP_HEAPSIZE=1024
#
# Larger heap size may be required when running queries over large number of files or partitions. 
# By default hive shell scripts use a heap size of 256 (MB).  Larger heap size would also be 
# appropriate for hive server (hwi etc).


# Set HADOOP_HOME to point to a specific hadoop install directory
export HADOOP_HOME=${HADOOP_HOME}
export HBASE_HOME=/opt/hbase/hbase-1.2.4
# export HIVE_CLASSPATH=$HIVE_CLASSPATH:/opt/hive/apache-hive-1.2.1-bin/lib/*

# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=${HIVE_HOME}/conf

# Folder containing extra ibraries required for hive compilation/execution can be controlled by:
# export HIVE_AUX_JARS_PATH=

        编辑hive-site.xml,增加hbase相关配置

<property>
		<name>hbase.zookeeper.quorum</name>
		<value>hadoop-n,hadoop-d1,hadoop-d2</value>
	</property>
	<property>  
    		<name>hbase.zookeeper.property.clientPort</name>  
    		<value>2181</value>  
	    	<description>
			Property from ZooKeeper's config zoo.cfg.  
	    		The port at which the clients will connect.  
	    	</description>  
  	</property> 
	<property>  
    		<name>hbase.master</name>  
    		<value>hadoop-n:60000</value>  
	</property> 

spark

拷贝hbase安装目录下的如下jar,注意不要偷懒在spark-env.sh增加hbase的classpath,那样会导致spark无法启动。

hbase-protocol
hbase-common
hbase-client
hbase-server
hive-hbase-handler-2.1.0
htrace-core
metrice-core

测试

1、在hbase建表,并增加三条数据

create 'hbase_test',{NAME=>'cf1'}
put 'hbase_test','a','cf1:v1','1'
put 'hbase_test','b','cf1:v1','2'
put 'hbase_test','b','cf1:v1','3'

        2、在hive建表

create external table hbase_test(key string,value string) 
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:v1") 
TBLPROPERTIES("hbase.table.name" = "hbase_test");

        3、启动sparksql

cd $SPAR_HOME/bin
./spark-sql
spark-sql> select * from hbase_test;
16/11/18 11:20:48 INFO execution.SparkSqlParser: Parsing command: select * from hbase_test
16/11/18 11:20:49 INFO parser.CatalystSqlParser: Parsing command: string
16/11/18 11:20:49 INFO parser.CatalystSqlParser: Parsing command: string
16/11/18 11:20:49 INFO parser.CatalystSqlParser: Parsing command: string
16/11/18 11:20:49 INFO parser.CatalystSqlParser: Parsing command: string
16/11/18 11:20:49 INFO memory.MemoryStore: Block broadcast_7 stored as values in memory (estimated size 222.0 KB, free 365.5 MB)
16/11/18 11:20:49 INFO memory.MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 21.4 KB, free 365.5 MB)
16/11/18 11:20:49 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on 10.5.3.100:39358 (size: 21.4 KB, free: 366.2 MB)
16/11/18 11:20:49 INFO spark.SparkContext: Created broadcast 7 from processCmd at CliDriver.java:376
16/11/18 11:20:50 INFO hbase.HBaseStorageHandler: Configuring input job properties
16/11/18 11:20:50 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x165634aa connecting to ZooKeeper ensemble=localhost:2181
16/11/18 11:20:50 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x165634aa0x0, quorum=localhost:2181, baseZNode=/hbase
16/11/18 11:20:50 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
16/11/18 11:20:50 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
16/11/18 11:20:50 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x158751d4c19000d, negotiated timeout = 40000
16/11/18 11:20:50 INFO util.RegionSizeCalculator: Calculating region sizes for table "hbase_test".
16/11/18 11:20:50 INFO client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
16/11/18 11:20:50 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x158751d4c19000d
16/11/18 11:20:50 INFO zookeeper.ZooKeeper: Session: 0x158751d4c19000d closed
16/11/18 11:20:50 INFO zookeeper.ClientCnxn: EventThread shut down
16/11/18 11:20:50 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:376
16/11/18 11:20:50 INFO scheduler.DAGScheduler: Got job 3 (processCmd at CliDriver.java:376) with 1 output partitions
16/11/18 11:20:50 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (processCmd at CliDriver.java:376)
16/11/18 11:20:50 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/11/18 11:20:50 INFO scheduler.DAGScheduler: Missing parents: List()
16/11/18 11:20:50 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[23] at processCmd at CliDriver.java:376), which has no missing parents
16/11/18 11:20:50 INFO memory.MemoryStore: Block broadcast_8 stored as values in memory (estimated size 15.2 KB, free 365.5 MB)
16/11/18 11:20:50 INFO memory.MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 8.3 KB, free 365.5 MB)
16/11/18 11:20:50 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 10.5.3.100:39358 (size: 8.3 KB, free: 366.2 MB)
16/11/18 11:20:50 INFO spark.SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1012
16/11/18 11:20:50 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[23] at processCmd at CliDriver.java:376)
16/11/18 11:20:50 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
16/11/18 11:20:50 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, 10.5.3.101, partition 0, ANY, 5544 bytes)
16/11/18 11:20:50 INFO cluster.CoarseGrainedSchedulerBackend$DriverEndpoint: Launching task 4 on executor id: 1 hostname: 10.5.3.101.
16/11/18 11:20:50 INFO storage.BlockManagerInfo: Added broadcast_8_piece0 in memory on 10.5.3.101:57818 (size: 8.3 KB, free: 366.3 MB)
16/11/18 11:20:50 INFO storage.BlockManagerInfo: Added broadcast_7_piece0 in memory on 10.5.3.101:57818 (size: 21.4 KB, free: 366.3 MB)
16/11/18 11:20:51 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 4) in 509 ms on 10.5.3.101 (1/1)
16/11/18 11:20:51 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
16/11/18 11:20:51 INFO scheduler.DAGScheduler: ResultStage 4 (processCmd at CliDriver.java:376) finished in 0.511 s
16/11/18 11:20:51 INFO scheduler.DAGScheduler: Job 3 finished: processCmd at CliDriver.java:376, took 0.611485 s
a	1
b	2
c	3
Time taken: 2.33 seconds, Fetched 3 row(s)
16/11/18 11:20:51 INFO CliDriver: Time taken: 2.33 seconds, Fetched 3 row(s)
spark-sql> 

注意

        由于本例全部依赖都安装在三台虚拟机上,并且每台只有2G内存,故只能用作软件流程测试,而不能用做性能测试,本文所列所有数据,不能做性能测试的依据。

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • rancher导入rke 原

        REK是Rancher Kubernetes Engine,通过rke工具可以快速简单地搭建一套 Kubernetes集群。

    尚浩宇
  • 一个综合的分布式项目之项目部署 原

        项目打包有很多种方式,不管是eclipse的export还是mvn的package,最终目的就是把编译后的项目丢到服务器的tomcat下。我们要部署代码...

    尚浩宇
  • spring集成activemq 原

    两个项目,一个生产者一个消费者,这里只贴出关键代码(队列模式和订阅模式),文章最后会附上项目地址,有需要的可以自行下载。项目访问地址http://localho...

    尚浩宇
  • Visual Studio 2008 每日提示(二十三)

    #221、在对象浏览器中按对象和成员排序 原文链接:You can sort objects and members in the Object Browse...

    Jianbo
  • 【开发者福利】手把手教你用android studio进行NDK开发

    NDK其实是提供了一系列的工具,帮助开发者快速开发C(或C++)的动态库,并能自动将so和java应用一起打包成apk。至于为什么要用NDK,一般都是出于一下几...

    WeTest质量开放平台团队
  • 深入浅出MyBatis:MyBatis的所有配置

    上一篇介绍了JDBC的相关概念、MyBatis的特性与Hibernate的区别、MyBatis的基本组件与生命周期,基本可以使用MyBatis了。

    情情说
  • Oracle 12c多租户特性详解:PDB 的备份与恢复

    ? 由于 PDB 的引入,Oracle 数据库的备份和恢复也发生了很多变化,基于 PDB 级别的表空间、库备份同时被支持。以下通过实际测试介绍一下12c中关于...

    数据和云
  • 【Excel系列】Excel数据分析:时间序列预测

    移动平均 18.1 移动平均工具的功能 “移动平均”分析工具可以基于特定的过去某段时期中变量的平均值,对未来值进行预测。移动平均值提供了由所有历史数据的简单的平...

    数据科学社区
  • 各种姿势解决CentOS 7下无法启动网络的问题

    今天在CentOS 7下更改完静态ip后发现network服务重启不了,翻遍了网络,尝试了各种方法,终于解决了。

    我的小碗汤
  • C++11就地初始化与列表初始化

    在C++11中,结构体或类的数据成员在申明时可以直接赋予一个默认值,初始化的方式有两种,一是使用等号“=”,二是使用大括号列表初始化的方式。注意,使用参考如下代...

    Dabelv

扫码关注云+社区

领取腾讯云代金券