Akka(17): Stream:数据流基础组件-Source,Flow,Sink简介

    在大数据程序流行的今天,许多程序都面临着共同的难题:程序输入数据趋于无限大,抵达时间又不确定。一般的解决方法是采用回调函数(callback-function)来实现的,但这样的解决方案很容易造成“回调地狱(callback hell)”,即所谓的“goto-hell”:程序控制跳来跳去很难跟踪,特别是一些变量如果在回调函数中更改后产生不可预料的结果。数据流(stream)是一种解决问题的有效编程方式。Stream是一个抽象概念,能把程序数据输入过程和其它细节隐蔽起来,通过申明方式把数据处理过程描述出来,使整体程序逻辑更容易理解跟踪。当然,牺牲的是对一些运算细节的控制能力。我们在前面介绍过scalaz-stream,它与akka-stream的主要区别在于:

1、scalaz-stream是pull模式的,而akka-stream是push模式的。pull模式的缺点是接收数据效率问题,因为在这种模式里程序必须不断重复检测(polling)输入端口是否有数据存在。而push模式则会把数据推到输入端口后直接进入程序,但如果数据源头动作太快程序无法及时处理所有推送的数据时就会造成所谓的数据溢出问题,遗失数据。不过akka-stream实现了reactive-stream的back-pressure规范:数据发送方和接收方之间互动提示,使过快的数据产生能按接收方要求慢下来甚至暂时停下来。

2、scalaz-sstream和akka-stream的数据流都是一种申明式的数据处理流程描述,属于一种运算方案,最终都需要某种运算器来对数据流按运算方案进行具体的运算,得出运算结果和产生副作用。scalaz-stream的运算器是自备的函数式程序,特点是能很好的控制线程使用和进行并行运算。akka-stream的运算器是materializer。materializer在actor系统上运行,具备了actor模式程序的优点包括:消息驱动、集群运算、监管策略(SupervisorStrategy)等等。

akka-stream的数据流是由三类基础组件组合而成,不同的组合方式代表不同的数据处理及表达功能。三类组件分别是:

1、Source:数据源。akka-stream属于push模式,所以Source也就是Publisher(数据发布方),Source的形状SourceShape代表只有一个输出端口的形状。Source可以从单值、集合、某种Publisher或另一个数据流产生数据流的元素(stream-element),包括:

  /**
   * Helper to create [[Source]] from `Iterable`.
   * Example usage: `Source(Seq(1,2,3))`
   *
   * Starts a new `Source` from the given `Iterable`. This is like starting from an
   * Iterator, but every Subscriber directly attached to the Publisher of this
   * stream will see an individual flow of elements (always starting from the
   * beginning) regardless of when they subscribed.
   */
  def apply[T](iterable: immutable.Iterable[T]): Source[T, NotUsed] =
    single(iterable).mapConcat(ConstantFun.scalaIdentityFunction).withAttributes(DefaultAttributes.iterableSource)

  /**
   * Create a `Source` with one element.
   * Every connected `Sink` of this stream will see an individual stream consisting of one element.
   */
  def single[T](element: T): Source[T, NotUsed] =
    fromGraph(new GraphStages.SingleSource(element))

  /**
   * Helper to create [[Source]] from `Iterator`.
   * Example usage: `Source.fromIterator(() => Iterator.from(0))`
   *
   * Start a new `Source` from the given function that produces anIterator.
   * The produced stream of elements will continue until the iterator runs empty
   * or fails during evaluation of the `next()` method.
   * Elements are pulled out of the iterator in accordance with the demand coming
   * from the downstream transformation steps.
   */
  def fromIterator[T](f: () ⇒ Iterator[T]): Source[T, NotUsed] =
    apply(new immutable.Iterable[T] {
      override def iterator: Iterator[T] = f()
      override def toString: String = "() => Iterator"
    })

  /**
   * Starts a new `Source` from the given `Future`. The stream will consist of
   * one element when the `Future` is completed with a successful value, which
   * may happen before or after materializing the `Flow`.
   * The stream terminates with a failure if the `Future` is completed with a failure.
   */
  def fromFuture[T](future: Future[T]): Source[T, NotUsed] =
    fromGraph(new FutureSource(future))

  /**
   * Helper to create [[Source]] from `Publisher`.
   *
   * Construct a transformation starting with given publisher. The transformation steps
   * are executed by a series of [[org.reactivestreams.Processor]] instances
   * that mediate the flow of elements downstream and the propagation of
   * back-pressure upstream.
   */
  def fromPublisher[T](publisher: Publisher[T]): Source[T, NotUsed] =
    fromGraph(new PublisherSource(publisher, DefaultAttributes.publisherSource, shape("PublisherSource")))

  /**
   * A graph with the shape of a source logically is a source, this method makes
   * it so also in type.
   */
  def fromGraph[T, M](g: Graph[SourceShape[T], M]): Source[T, M] = g match {
    case s: Source[T, M]         ⇒ s
    case s: javadsl.Source[T, M] ⇒ s.asScala
    case other ⇒ new Source(
      LinearTraversalBuilder.fromBuilder(other.traversalBuilder, other.shape, Keep.right),
      other.shape)
  }

下面还有几个特殊的Source:

  /**
   * A `Source` with no elements, i.e. an empty stream that is completed immediately for every connected `Sink`.
   */
  def empty[T]: Source[T, NotUsed] = _empty
  private[this] val _empty: Source[Nothing, NotUsed] =
    Source.fromGraph(EmptySource)

  /**
   * Create a `Source` that will continually emit the given element.
   */
  def repeat[T](element: T): Source[T, NotUsed] = {
    val next = Some((element, element))
    unfold(element)(_ ⇒ next).withAttributes(DefaultAttributes.repeat)
  }

  /**
   * Creates [[Source]] that will continually produce given elements in specified order.
   *
   * Starts a new 'cycled' `Source` from the given elements. The producer stream of elements
   * will continue infinitely by repeating the sequence of elements provided by function parameter.
   */
  def cycle[T](f: () ⇒ Iterator[T]): Source[T, NotUsed] = {
    val iterator = Iterator.continually { val i = f(); if (i.isEmpty) throw new IllegalArgumentException("empty iterator") else i }.flatten
    fromIterator(() ⇒ iterator).withAttributes(DefaultAttributes.cycledSource)
  }

2、Sink:数据终端。属于数据元素的使用方,主要作用是消耗数据流中的元素。SinkShape是有一个输入端的数据流形状。Sink消耗流元素的例子有:

  /**
   * A `Sink` that will consume the stream and discard the elements.
   */
  def ignore: Sink[Any, Future[Done]] = fromGraph(GraphStages.IgnoreSink)

  /**
   * A `Sink` that will invoke the given procedure for each received element. The sink is materialized
   * into a [[scala.concurrent.Future]] will be completed with `Success` when reaching the
   * normal end of the stream, or completed with `Failure` if there is a failure signaled in
   * the stream..
   */
  def foreach[T](f: T ⇒ Unit): Sink[T, Future[Done]] =
    Flow[T].map(f).toMat(Sink.ignore)(Keep.right).named("foreachSink")

注意,akka-stream实际是在actor上进行运算的。actor的内部状态最终可以形成运算结果。上面的例子可以得出Sink的运算结果是Future[??]类型的。

3、Flow:数据处理节点。对通过输入端口输入数据流的元素进行转变处理(transform)后经过输出端口输出。FlowShape有一个输入端和一个输出端。

在akka-stream里数据流组件一般被称为数据流图(graph)。我们可以用许多数据流图组成更大的stream-graph。

akka-stream最简单的完整(或者闭合)线性数据流(linear-stream)就是直接把一个Source和一个Sink相接。这种方式代表一种对数据流所有元素的直接表现,如:source.runWith(Sink.foreach(println))。我们可以用Source.via来连接Flow,用Source.to连接Sink:

  override def via[T, Mat2](flow: Graph[FlowShape[Out, T], Mat2]): Repr[T] = viaMat(flow)(Keep.left)

  override def viaMat[T, Mat2, Mat3](flow: Graph[FlowShape[Out, T], Mat2])(combine: (Mat, Mat2) ⇒ Mat3): Source[T, Mat3] = {
    val toAppend =
      if (flow.traversalBuilder eq Flow.identityTraversalBuilder)
        LinearTraversalBuilder.empty()
      else
        flow.traversalBuilder

    new Source[T, Mat3](
      traversalBuilder.append(toAppend, flow.shape, combine),
      SourceShape(flow.shape.out))
  }

  /**
   * Connect this [[akka.stream.scaladsl.Source]] to a [[akka.stream.scaladsl.Sink]],
   * concatenating the processing steps of both.
   */
  def to[Mat2](sink: Graph[SinkShape[Out], Mat2]): RunnableGraph[Mat] = toMat(sink)(Keep.left)

  /**
   * Connect this [[akka.stream.scaladsl.Source]] to a [[akka.stream.scaladsl.Sink]],
   * concatenating the processing steps of both.
   */
  def toMat[Mat2, Mat3](sink: Graph[SinkShape[Out], Mat2])(combine: (Mat, Mat2) ⇒ Mat3): RunnableGraph[Mat3] = {
    RunnableGraph(traversalBuilder.append(sink.traversalBuilder, sink.shape, combine))
  }

可以发现via和to分别是viaMat和toMat的简写,分别固定了Keep.left。意思是选择左边数据流图的运算结果。我们上面提过akka-stream是在actor系统里处理数据流元素的。在这个过程中同时可以用actor内部状态来产生运算结果。via和to连接了左右两个graph,并且选择了左边graph的运算结果。我们可以用viaMat和toMat来选择右边graph运算结果。这是通过combine: (Mat,Mat2)=>Mat3这个函数实现的。akka-stream提供了一个Keep对象来表达这种选择:

/**
 * Convenience functions for often-encountered purposes like keeping only the
 * left (first) or only the right (second) of two input values.
 */
object Keep {
  private val _left = (l: Any, r: Any) ⇒ l
  private val _right = (l: Any, r: Any) ⇒ r
  private val _both = (l: Any, r: Any) ⇒ (l, r)
  private val _none = (l: Any, r: Any) ⇒ NotUsed

  def left[L, R]: (L, R) ⇒ L = _left.asInstanceOf[(L, R) ⇒ L]
  def right[L, R]: (L, R) ⇒ R = _right.asInstanceOf[(L, R) ⇒ R]
  def both[L, R]: (L, R) ⇒ (L, R) = _both.asInstanceOf[(L, R) ⇒ (L, R)]
  def none[L, R]: (L, R) ⇒ NotUsed = _none.asInstanceOf[(L, R) ⇒ NotUsed]
}

既然提到运算结果的处理方式,我们就来看看Source,Flow,Sink的类型参数:

Source[+Out, +Mat]       //Out代表元素类型,Mat为运算结果类型
Flow[-In, +Out, +Mat]    //In,Out为数据流元素类型,Mat是运算结果类型
Sink[-In, +Mat]          //In是数据元素类型,Mat是运算结果类型

Keep对象提供的是对Mat的选择。上面源代码中to,toMat函数的返回结果都是RunnableGraph[Mat3],也就是说只有连接了Sink的数据流才能进行运算。RunnableGraph提供一个run()函数来运算数据流:

/**
 * Flow with attached input and output, can be executed.
 */
final case class RunnableGraph[+Mat](override val traversalBuilder: TraversalBuilder) extends Graph[ClosedShape, Mat] {
  override def shape = ClosedShape

  /**
   * Transform only the materialized value of this RunnableGraph, leaving all other properties as they were.
   */
  def mapMaterializedValue[Mat2](f: Mat ⇒ Mat2): RunnableGraph[Mat2] =
    copy(traversalBuilder.transformMat(f.asInstanceOf[Any ⇒ Any]))

  /**
   * Run this flow and return the materialized instance from the flow.
   */
  def run()(implicit materializer: Materializer): Mat = materializer.materialize(this)
...

上面shape = ClosedShape代表RunnableGraph的形状是闭合的(ClosedShape),意思是说:一个可运行的graph所有输人输出端口都必须是连接的。

下面我们就用一个最简单的线性数据流来做些详细解释:

import akka.actor._
import akka.stream._
import akka.stream.scaladsl._
import akka._
import scala.concurrent._

object SourceDemo extends App {
  implicit val sys=ActorSystem("demo")
  implicit val mat=ActorMaterializer()
  implicit val ec=sys.dispatcher

  val s1: Source[Int,NotUsed] = Source(1 to 10)
  val sink: Sink[Any,Future[Done]] = Sink.foreach(println)
  val rg1: RunnableGraph[NotUsed] = s1.to(sink)
  val rg2: RunnableGraph[Future[Done]]  = s1.toMat(sink)(Keep.right)
  val res1: NotUsed = rg1.run()

  Thread.sleep(1000)

  val res2: Future[Done] = rg2.run()
  res2.andThen {
    case _ => sys.terminate()
  }
  
}

我们把焦点放在特别注明的结果类型上面:Source的运算结果Mat类型是NotUsed,Sink的运算结果Mat类型是Future[Done]。从上面这段代码我们看到用toMat选择返回Sink的运算结果Future[Done]才能捕捉到运算终止节点。下面的另一个例子包括了一些组合动作:

  val seq = Seq[Int](1,2,3)
  def toIterator() = seq.iterator
  val flow1: Flow[Int,Int,NotUsed] = Flow[Int].map(_ + 2)
  val flow2: Flow[Int,Int,NotUsed] = Flow[Int].map(_ * 3)
  val s2 = Source.fromIterator(toIterator)
  val s3 = s1 ++ s2

  val s4: Source[Int,NotUsed] = s3.viaMat(flow1)(Keep.right)
  val s5: Source[Int,NotUsed] = s3.via(flow1).async.viaMat(flow2)(Keep.right)
  val s6: Source[Int,NotUsed] = s4.async.viaMat(flow2)(Keep.right)
  (s5.toMat(sink)(Keep.right).run()).andThen {case _ => sys.terminate()}

一般来讲,数据流元素的所有处理过程都合并在一个actor上进行(steps-fusing),这样可以免去actor之间的消息传递,但同时也会限制数据元素的并行处理。aync的作用是指定左边的graph在一个独立的actor上运行。注意:s6=s5。

从上面例子里的组合结果类型我们发现:把一个Flow连接到一个Source上形成了一个新的Source。

实际上我们可以用akka-stream Source提供的方法糖来直接运算数据流,如下:

  s1.runForeach(println)
  val fres = s6.runFold(0)(_ + _)
  fres.onSuccess{case a => println(a)}
  fres.andThen{case _ => sys.terminate()}

下面是Source中的一些runner:

 /**
   * Shortcut for running this `Source` with a fold function.
   * The given function is invoked for every received element, giving it its previous
   * output (or the given `zero` value) and the element as input.
   * The returned [[scala.concurrent.Future]] will be completed with value of the final
   * function evaluation when the input stream ends, or completed with `Failure`
   * if there is a failure signaled in the stream.
   */
  def runFold[U](zero: U)(f: (U, Out) ⇒ U)(implicit materializer: Materializer): Future[U] = runWith(Sink.fold(zero)(f))

  /**
   * Shortcut for running this `Source` with a foldAsync function.
   * The given function is invoked for every received element, giving it its previous
   * output (or the given `zero` value) and the element as input.
   * The returned [[scala.concurrent.Future]] will be completed with value of the final
   * function evaluation when the input stream ends, or completed with `Failure`
   * if there is a failure signaled in the stream.
   */
  def runFoldAsync[U](zero: U)(f: (U, Out) ⇒ Future[U])(implicit materializer: Materializer): Future[U] = runWith(Sink.foldAsync(zero)(f))

  /**
   * Shortcut for running this `Source` with a reduce function.
   * The given function is invoked for every received element, giving it its previous
   * output (from the second element) and the element as input.
   * The returned [[scala.concurrent.Future]] will be completed with value of the final
   * function evaluation when the input stream ends, or completed with `Failure`
   * if there is a failure signaled in the stream.
   *
   * If the stream is empty (i.e. completes before signalling any elements),
   * the reduce stage will fail its downstream with a [[NoSuchElementException]],
   * which is semantically in-line with that Scala's standard library collections
   * do in such situations.
   */
  def runReduce[U >: Out](f: (U, U) ⇒ U)(implicit materializer: Materializer): Future[U] =
    runWith(Sink.reduce(f))

  /**
   * Shortcut for running this `Source` with a foreach procedure. The given procedure is invoked
   * for each received element.
   * The returned [[scala.concurrent.Future]] will be completed with `Success` when reaching the
   * normal end of the stream, or completed with `Failure` if there is a failure signaled in
   * the stream.
   */
  // FIXME: Out => Unit should stay, right??
  def runForeach(f: Out ⇒ Unit)(implicit materializer: Materializer): Future[Done] = runWith(Sink.foreach(f))

它们的功能都是通过runWith实现的:

 /**
   * Connect this `Source` to a `Sink` and run it. The returned value is the materialized value
   * of the `Sink`, e.g. the `Publisher` of a [[akka.stream.scaladsl.Sink#publisher]].
   */
  def runWith[Mat2](sink: Graph[SinkShape[Out], Mat2])(implicit materializer: Materializer): Mat2 = toMat(sink)(Keep.right).run()

实际上是使用了Sink类里的对应方法Sink.???。

下面是本次的示范源代码:

import akka.actor._
import akka.stream._
import akka.stream.scaladsl._
import akka._
import scala.concurrent._

object SourceDemo extends App {
  implicit val sys=ActorSystem("demo")
  implicit val mat=ActorMaterializer()
  implicit val ec=sys.dispatcher

  val s1: Source[Int,NotUsed] = Source(1 to 10)
  val sink: Sink[Any,Future[Done]] = Sink.foreach(println)
  val rg1: RunnableGraph[NotUsed] = s1.to(sink)
  val rg2: RunnableGraph[Future[Done]]  = s1.toMat(sink)(Keep.right)
  val res1: NotUsed = rg1.run()

  Thread.sleep(1000)

  val res2: Future[Done] = rg2.run()
  res2.andThen {
    case _ =>   //sys.terminate()
  }

  val seq = Seq[Int](1,2,3)
  def toIterator() = seq.iterator
  val flow1: Flow[Int,Int,NotUsed] = Flow[Int].map(_ + 2)
  val flow2: Flow[Int,Int,NotUsed] = Flow[Int].map(_ * 3)
  val s2 = Source.fromIterator(toIterator)
  val s3 = s1 ++ s2

  val s4: Source[Int,NotUsed] = s3.viaMat(flow1)(Keep.right)
  val s5: Source[Int,NotUsed] = s3.via(flow1).async.viaMat(flow2)(Keep.right)
  val s6: Source[Int,NotUsed] = s4.async.viaMat(flow2)(Keep.right)
  (s5.toMat(sink)(Keep.right).run()).andThen {case _ => } //sys.terminate()}

  s1.runForeach(println)
  val fres = s6.runFold(0)(_ + _)
  fres.onSuccess{case a => println(a)}
  fres.andThen{case _ => sys.terminate()}
  
}

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

发表于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏码匠的流水账

聊聊storm的JoinBolt

storm-2.0.0/storm-client/src/jvm/org/apache/storm/bolt/JoinBolt.java

1434
来自专栏函数式编程语言及工具

Akka(23): Stream:自定义流构件功能-Custom defined stream processing stages

    从总体上看:akka-stream是由数据源头Source,流通节点Flow和数据流终点Sink三个框架性的流构件(stream components)...

4188
来自专栏程序员宝库

Java 8 时间 API 快速入门

Java 8 出来很久了,各位也可能已经在用了,不过其中新的时间日期 API 可能很少人用,甚至不知道怎么上手。本文快速介绍一下其中的主要的类的概念和用法。 一...

2885
来自专栏积累沉淀

研究MapReduce源码之实现自定义LineRecordReader完成多行读取文件内容

TextInputFormat是Hadoop默认的数据输入格式,但是它只能一行一行的读记录,如果要读取多行怎么办? 很简单 自己写一个输入格式,然后写一个对...

1939
来自专栏数据结构与算法

BZOJ2434: [Noi2011]阿狸的打字机(AC自动机 树状数组)

1083
来自专栏码匠的流水账

聊聊flink的RichParallelSourceFunction

flink-streaming-java_2.11-1.6.2-sources.jar!/org/apache/flink/streaming/api/func...

1481
来自专栏码匠的流水账

聊聊jump consistent hash

jump consistent hash是一致性哈希的一种实现,论文见A Fast, Minimal Memory, Consistent Hash Algor...

1672
来自专栏java沉淀

urlconnection post请求

1464
来自专栏Netkiller

Struts Ajax Json

Netkiller Java 手札 Java, Servlet, JavaBean ... 5.4. Ajax + JSON struts.xml 中加入 ...

2593
来自专栏码匠的流水账

聊聊storm的LoggingMetricsConsumer

storm-2.0.0/storm-client/src/jvm/org/apache/storm/metric/LoggingMetricsConsumer....

1033

扫码关注云+社区