构建第一个Flink应用-WordCount

Eights

发布于 2020-07-10 11:59:26

3780

发布于 2020-07-10 11:59:26

文章被收录于专栏：Eights做数据Eights做数据

本篇文章大概5143字，阅读时间大约13分钟

体验flink的hello world

使用maven初始化第一个flink的wordcount应用，将应用打包上传到flink-standalone集群，运行起来。

文档编写目的

使用maven生成flink的模板应用
开发wordcount应用

构建maven工程

进入模板工程的目录，构建一个maven工程

mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-java \
-DarchetypeVersion=1.10.1

运行该命令会提示输入maven项目的groupId artifactId version信息，输入即可

将工程导入idea，引入flink-scala的依赖，去除模板项目中java依赖的scope

    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-scala_${scala.binary.version}</artifactId>
      <version>${flink.version}</version>
    </dependency>
    <dependency>
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
      <version>${flink.version}</version>
    </dependency>

添加scala编译插件

      <plugin>
        <groupId>net.alchim31.maven</groupId>
        <artifactId>scala-maven-plugin</artifactId>
        <version>3.4.6</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
      </plugin>

      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>3.0.0</version>
        <configuration>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
        </configuration>
        <executions>
          <execution>
            <id>make-assembly</id>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>

Scala

StreamingWordCount

本地调试

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}

object StreamingWordCount {

  val HOST:String = "localhost"
  val PORT:Int = 9001

  /**
   * stream word count
   * @param args input params
   */
  def main(args: Array[String]): Unit = {

    //get streaming env
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //get socket text stream
    val wordsDstream: DataStream[String] = env.socketTextStream(HOST, PORT)

    import org.apache.flink.api.scala._

    //word count
    val wordRes: DataStream[(String, Int)] = wordsDstream.flatMap(_.split(","))
      .filter(_.nonEmpty)
      .map((_, 1))
      .keyBy(0)
      .sum(1)

    wordRes.print()
    
    env.execute("Flink Streaming WordCount!")
  }
}

启动应用，在终端进行socket word输入

nc -lk 9001

终端输入word数据流

streaming应用的控制台中可以看到

streaming word count调试完成

集群运行

按照之前文章中编译的flink-1.10.1的包，启动集群

./bin/start-cluster.sh

访问localhost:8081出现flink-web

在submit new job中上传刚才打包好的应用程序，在maven中package一下就可以，点击submit运行

在终端上输入words，采用逗号分隔

查看task managers中的stdout

BatchWordCount

import org.apache.flink.api.scala.ExecutionEnvironment

object BatchWordCount {

  /**
   * batch word count
   *
   * @param args input params
   */
  def main(args: Array[String]): Unit = {

    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    import org.apache.flink.api.scala._

    val words: DataSet[String] = env.fromElements("spark,flink,hbase", "impala,hbase,kudu", "flink,flink,flink")

    //word count
    val wordRes: AggregateDataSet[(String, Int)] = words.flatMap(_.split(","))
      .map((_, 1))
      .groupBy(0)
      .sum(1)

    wordRes.print()
  }
}

运行结果如下：

Java

BatchWordCount

package com.eights;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.AggregateOperator;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
import org.apache.flink.util.StringUtils;

public class BatchJob {

    public static void main(String[] args) throws Exception {
        // set up the batch execution environment
        final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSource<String> words = env.fromElements("spark,flink,hbase", "impala,hbase,kudu", "flink,flink,flink");

        AggregateOperator<Tuple2<String, Integer>> wordCount = words.flatMap(new WordLineSplitter())
                .groupBy(0)
                .sum(1);

        wordCount.print();

    }

    public static final class WordLineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {

        @Override
        public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) {
            String[] wordsArr = s.split(",");

            for (String word : wordsArr) {
                if (!StringUtils.isNullOrWhitespaceOnly(word)) {
                    collector.collect(new Tuple2<>(word, 1));
                }
            }

        }
    }
}

运行结果

StreamingWordCount

package com.eights;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
import org.apache.flink.util.StringUtils;

public class StreamingJob {

    public static void main(String[] args) throws Exception {
        // set up the streaming execution environment
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        String HOST = "localhost";
        int PORT = 9001;

        DataStreamSource<String> wordsSocketStream = env.socketTextStream(HOST, PORT);

        SingleOutputStreamOperator<Tuple2<String, Integer>> wordRes = wordsSocketStream.flatMap(new WordsLineSplitter())
                .keyBy(0)
                .sum(1);

        wordRes.print();

        // execute program
        env.execute("Flink Streaming Java API Word Count");
    }

    private static class WordsLineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
        @Override
        public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) {
            String[] wordsArr = s.split(",");

            for (String word : wordsArr) {
                if (!StringUtils.isNullOrWhitespaceOnly(word)) {
                    collector.collect(new Tuple2<>(word, 1));
                }
            }
        }
    }
}

运行结果如下