前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Introducing Spark-Kafka integration for realtime Kafka SQL queries

Introducing Spark-Kafka integration for realtime Kafka SQL queries

作者头像
用户2936994
发布2019-07-28 14:52:00
4420
发布2019-07-28 14:52:00
举报
文章被收录于专栏:祝威廉祝威廉

Apache Kafka has been all the rage for the key join of the data pipeline. But in most cases, we only treat Kafka as a stream source or a message queue. This means if you wanna do some AdHoc query, you need to sync the data to HDFS or other storage firstly.

People may have forgotten that Kafka is really good at high throughput since Kafka makes full use of both parallel consuming and sequential reading.

In order to satisfy the use cases around ad hoc analytics, data exploration and trend discovery based on Kafka directly, a new project called spark-adhoc-kafkais open sourced.

With this project, what you can do include:

  1. Treat Kafka topics/streams as tables;
  2. Support for SQL
  3. Support complex joins(join other Kafka topics or other tables stored in any place)
  4. Support for MLSQLand Spark

Notice that you can speed up the ad hoc query in spark-adhoc-kafkaby :

  1. specify the startingOffsets and endingOffsets to narrow the number of records Spark is going to fetch.
  2. specify the startingTime and endingTime to narrow the number of records Spark is going to fetch
  3. specify multiplyFactor or maxSizePerPartitions to control the parallelism of data fetching.

How spark-adhoc-kafka works? The default Kafka implementation of consuming model in Spark is like this:

default model

Both Kafka and Spark have the concept of partition. A partition means a task or a collection of data. To change the number of Kafka partition is a really heavy operation, and the Spark will start the same number of tasks to consume data from Kafka. Suppose Kafka have three partitions, and Spark will start three tasks and consume all data in the Kafka partitions and map them to spark partitions then do the processing. If we have 100 cores in Spark, and we only use three cores, what a waste! And this also really slow down the query performance.

new model

本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2019.07.27 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档