专栏首页大数据成神之路基于SparkStreaming+Kafka+HBase实时点击流案例

基于SparkStreaming+Kafka+HBase实时点击流案例

背景

Kafka实时记录从数据采集工具Flume或业务系统实时接口收集数据,并作为消息缓冲组件为上游实时计算框架提供可靠数据支撑,Spark 1.3版本后支持两种整合Kafka机制(Receiver-based Approach 和 Direct Approach),具体细节请参考文章最后官方文档链接,数据存储使用HBase

实现思路

  • 实现Kafka消息生产者模拟器
  • Spark-Streaming采用Direct Approach方式实时获取Kafka中数据
  • Spark-Streaming对数据进行业务计算后数据存储到HBase

本地虚拟机集群环境配置

由于笔者机器性能有限,hadoop/zookeeper/kafka集群都搭建在一起主机名分别为hadoop1,hadoop2,hadoop3; hbase为单节点在hadoop1

缺点及不足

代码设计上有些许缺陷,比如spark-streaming计算后数据保存hbase逻辑性能待优化。

代码实现

Kafka消息模拟器

package clickstream
import java.util.{Properties, Random, UUID}
import kafka.producer.{KeyedMessage, Producer, ProducerConfig}
import org.codehaus.jettison.json.JSONObject

object KafkaMessageGenerator {
  private val random = new Random()
  private var pointer = -1
  private val os_type = Array(
    "Android", "IPhone OS",
    "None", "Windows Phone")

  def click() : Double = {
    random.nextInt(10)
  }

  def getOsType() : String = {
    pointer = pointer + 1
    if(pointer >= os_type.length) {
      pointer = 0
      os_type(pointer)
    } else {
      os_type(pointer)
    }
  }

  def main(args: Array[String]): Unit = {
    val topic = "user_events"
    //本地虚拟机ZK地址
    val brokers = "hadoop1:9092,hadoop2:9092,hadoop3:9092"
    val props = new Properties()
    props.put("metadata.broker.list", brokers)
    props.put("serializer.class", "kafka.serializer.StringEncoder")

    val kafkaConfig = new ProducerConfig(props)
    val producer = new Producer[String, String](kafkaConfig)

    while(true) {
      // prepare event data
      val event = new JSONObject()
      event
        .put("uid", UUID.randomUUID())//随机生成用户id
        .put("event_time", System.currentTimeMillis.toString) //记录时间发生时间
        .put("os_type", getOsType) //设备类型
        .put("click_count", click) //点击次数

      // produce event message
      producer.send(new KeyedMessage[String, String](topic, event.toString))
      println("Message sent: " + event)

      Thread.sleep(200)
    }
  }
}

Spark-Streaming主类

package clickstream
import kafka.serializer.StringDecoder
import net.sf.json.JSONObject
import org.apache.hadoop.hbase.client.{HTable, Put}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, TableName}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}


object PageViewStream {
  def main(args: Array[String]): Unit = {
    var masterUrl = "local[2]"
    if (args.length > 0) {
      masterUrl = args(0)
    }

    // Create a StreamingContext with the given master URL
    val conf = new SparkConf().setMaster(masterUrl).setAppName("PageViewStream")
    val ssc = new StreamingContext(conf, Seconds(5))

    // Kafka configurations
    val topics = Set("PageViewStream")
    //本地虚拟机ZK地址
    val brokers = "hadoop1:9092,hadoop2:9092,hadoop3:9092"
    val kafkaParams = Map[String, String](
      "metadata.broker.list" -> brokers,
      "serializer.class" -> "kafka.serializer.StringEncoder")

    // Create a direct stream
    val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)

    val events = kafkaStream.flatMap(line => {
      val data = JSONObject.fromObject(line._2)
      Some(data)
    })
    // Compute user click times
    val userClicks = events.map(x => (x.getString("uid"), x.getInt("click_count"))).reduceByKey(_ + _)
    userClicks.foreachRDD(rdd => {
      rdd.foreachPartition(partitionOfRecords => {
        partitionOfRecords.foreach(pair => {
          //Hbase配置
          val tableName = "PageViewStream"
          val hbaseConf = HBaseConfiguration.create()
          hbaseConf.set("hbase.zookeeper.quorum", "hadoop1:9092")
          hbaseConf.set("hbase.zookeeper.property.clientPort", "2181")
          hbaseConf.set("hbase.defaults.for.version.skip", "true")
          //用户ID
          val uid = pair._1
          //点击次数
          val click = pair._2
          //组装数据
          val put = new Put(Bytes.toBytes(uid))
          put.add("Stat".getBytes, "ClickStat".getBytes, Bytes.toBytes(click))
          val StatTable = new HTable(hbaseConf, TableName.valueOf(tableName))
          StatTable.setAutoFlush(false, false)
          //写入数据缓存
          StatTable.setWriteBufferSize(3*1024*1024)
          StatTable.put(put)
          //提交
          StatTable.flushCommits()
        })
      })
    })
    ssc.start()
    ssc.awaitTermination()

  }

}

Maven POM文件

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.guofei.spark</groupId>
  <artifactId>RiskControl</artifactId>
  <version>1.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>RiskControl</name>
  <url>http://maven.apache.org</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <dependencies>
    <!--Spark core 及 streaming -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.10</artifactId>
      <version>1.3.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming_2.10</artifactId>
      <version>1.3.0</version>
    </dependency>
    <!-- Spark整合Kafka-->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-streaming-kafka_2.10</artifactId>
      <version>1.3.0</version>
    </dependency>

    <!-- 整合Hbase-->
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase</artifactId>
      <version>0.96.2-hadoop2</version>
      <type>pom</type>
    </dependency>
    <!--Hbase依赖 -->
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-server</artifactId>
      <version>0.96.2-hadoop2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-client</artifactId>
      <version>0.96.2-hadoop2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-common</artifactId>
      <version>0.96.2-hadoop2</version>
    </dependency>
    <dependency>
      <groupId>commons-io</groupId>
      <artifactId>commons-io</artifactId>
      <version>1.3.2</version>
    </dependency>
    <dependency>
      <groupId>commons-logging</groupId>
      <artifactId>commons-logging</artifactId>
      <version>1.1.3</version>
    </dependency>
    <dependency>
      <groupId>log4j</groupId>
      <artifactId>log4j</artifactId>
      <version>1.2.17</version>
    </dependency>

    <dependency>
      <groupId>com.google.protobuf</groupId>
      <artifactId>protobuf-java</artifactId>
      <version>2.5.0</version>
    </dependency>
    <dependency>
      <groupId>io.netty</groupId>
      <artifactId>netty</artifactId>
      <version>3.6.6.Final</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-protocol</artifactId>
      <version>0.96.2-hadoop2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.zookeeper</groupId>
      <artifactId>zookeeper</artifactId>
      <version>3.4.5</version>
    </dependency>
    <dependency>
      <groupId>org.cloudera.htrace</groupId>
      <artifactId>htrace-core</artifactId>
      <version>2.01</version>
    </dependency>
    <dependency>
      <groupId>org.codehaus.jackson</groupId>
      <artifactId>jackson-mapper-asl</artifactId>
      <version>1.9.13</version>
    </dependency>
    <dependency>
      <groupId>org.codehaus.jackson</groupId>
      <artifactId>jackson-core-asl</artifactId>
      <version>1.9.13</version>
    </dependency>
    <dependency>
      <groupId>org.codehaus.jackson</groupId>
      <artifactId>jackson-jaxrs</artifactId>
      <version>1.9.13</version>
    </dependency>
    <dependency>
      <groupId>org.codehaus.jackson</groupId>
      <artifactId>jackson-xc</artifactId>
      <version>1.9.13</version>
    </dependency>
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-api</artifactId>
      <version>1.6.4</version>
    </dependency>
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-log4j12</artifactId>
      <version>1.6.4</version>
    </dependency>

    <!-- Hadoop依赖包-->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>2.6.4</version>
    </dependency>
    <dependency>
      <groupId>commons-configuration</groupId>
      <artifactId>commons-configuration</artifactId>
      <version>1.6</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-auth</artifactId>
      <version>2.6.4</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>2.6.4</version>
    </dependency>

    <dependency>
      <groupId>net.sf.json-lib</groupId>
      <artifactId>json-lib</artifactId>
      <version>2.4</version>
      <classifier>jdk15</classifier>
    </dependency>

    <dependency>
      <groupId>org.codehaus.jettison</groupId>
      <artifactId>jettison</artifactId>
      <version>1.1</version>
    </dependency>

    <dependency>
      <groupId>redis.clients</groupId>
      <artifactId>jedis</artifactId>
      <version>2.5.2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.commons</groupId>
      <artifactId>commons-pool2</artifactId>
      <version>2.2</version>
    </dependency>
  </dependencies>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>
    <plugins>
      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.2.2</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
            <configuration>
              <args>
                <arg>-make:transitive</arg>
                <arg>-dependencyfile</arg>
                <arg>${project.build.directory}/.scala_dependencies</arg>
              </args>
            </configuration>
          </execution>
        </executions>
      </plugin>

      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>2.4.3</version>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <filters>
                <filter>
                  <artifact>*:*</artifact>
                  <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                  </excludes>
                </filter>
              </filters>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

FAQ

  • Maven导入json-lib报错 Failure to find net.sf.json-lib:json-lib:jar:2.3 in http://repo.maven.apache.org/maven2 was cached in the local repository 解决: http://stackoverflow.com/questions/4173214/maven-missing-net-sf-json-lib <dependency> <groupId>net.sf.json-lib</groupId> <artifactId>json-lib</artifactId> <version>2.4</version> <classifier>jdk15</classifier> </dependency>
  • 执行Spark-Streaming程序报错 org.apache.spark.SparkException: Task not serializable
userClicks.foreachRDD(rdd => {
rdd.foreachPartition(partitionOfRecords => {
partitionOfRecords.foreach(
这里面的代码中所包含的对象必须是序列化的
这里面的代码中所包含的对象必须是序列化的
这里面的代码中所包含的对象必须是序列化的
})
})
})

执行Maven打包报错,找不到依赖的jar包 error:not found: object kafka ERROR import kafka.javaapi.producer.Producer 解决:win10本地系统 用户/xxx/.m2/ 目录含有中文

参考文档

  • spark-streaming官方文档 http://spark.apache.org/docs/latest/streaming-programming-guide.html
  • spark-streaming整合kafka官方文档 http://spark.apache.org/docs/latest/streaming-kafka-integration.html
  • spark-streaming整合flume官方文档 http://spark.apache.org/docs/latest/streaming-flume-integration.html
  • spark-streaming整合自定义数据源官方文档 http://spark.apache.org/docs/latest/streaming-custom-receivers.html
  • spark-streaming官方scala案例 https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming
  • 简单之美博客 http://shiyanjun.cn/archives/1097.html

作者:MichaelFly 链接:https://www.jianshu.com/p/ccba410462ba

欢迎点赞+收藏+转发朋友圈素质三连

文章不错?点个【在看】吧! ?

本文分享自微信公众号 - 大数据技术与架构(import_bigdata)

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2020-01-08

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 使用Kafka+Spark+Cassandra构建实时处理引擎

    Apache Kafka 是一个可扩展,高性能,低延迟的平台,允许我们像消息系统一样读取和写入数据。我们可以很容易地在 Java 中使用 Kafka。

    王知无
  • Flink Logback日志与邮件报警配置

    Flink官方推荐使用Logback替代默认的Log4j作为日志框架。我们之前一直用Log4j,最近切换成了更优秀的Logback,但是配置起来略有点麻烦,本文...

    王知无
  • PDFT/Paxos/Raft-分布式一致性协议解析

    分布式系统中有个著名的原则CAP原则,C为Consistency(一致性)、A为Availability(可用性)、P为Partition tolerance(...

    王知无
  • 用Java如何实现接口测试

    关于接口测试, 我们之前介绍过很多方法了, 有postman, soapUI, Jmeter等, 他们各有优势和劣势, 今天和大家分享的是如何用java Tes...

    louiezhou001
  • maven教程5(聚合工程)

      所谓聚合项目,实际上就是对项目分模块,互联网项目一般来说按照业务分(订单模块、VIP模块、支付模块、CMS模块…),传统的软件项目,大多采用分层的方式(Da...

    用户4919348
  • 【Maven篇】---解决Maven线上部署java.lang.ClassNotFoundException和no main manifest attribute解决方法

    maven 线上部署的话会出现一些问题比如java.lang.ClassNotFoundException或者no main manifest attribut...

    LhWorld哥陪你聊算法
  • day01_品优购电商项目_01_走进电商 + 分布式框架-Dubbox + 品优购-框架搭建 + 逆向工程 + 品牌列表展示 + 常见错误_用心笔记

      参考链接:https://www.cnblogs.com/chenmingjun/p/9943123.html#_label1_3

    黑泽君
  • spring + c3p0+hibern

    由于C3P0使用比较广泛,下面介绍C3P0在spring和hibernate3配置中的一些常用配置项,首先先把配置文件贴出来先,配置的文件名为dbContext...

    用户2398817
  • [有人@我]SSH框架整合教程

    SSH:Struts2+Spring+Hibernate整合的web应用程序开源框架。

    南风
  • 记录一个pom文件

    用户1225216

扫码关注云+社区

领取腾讯云代金券