额，关于笛卡尔积CartesianProduct

数据仓库践行者

发布于 2022-11-25 20:01:11

5520

发布于 2022-11-25 20:01:11

文章被收录于专栏：数据仓库践行者

笛卡尔积会产生shuffle吗？
有关窄依赖解惑
最后送一道面试题

1、笛卡尔积会产生shuffle吗？

结论是：不会

如果从网上搜的话，排在前几的答案，基本都是这样：

但是仔细分析笛卡尔积源码，就会发现，它的运行原理是这样的：

select tmp1.a,tmp2.b from testdata2 tmp1 join testdata2 tmp2 

== executedPlan ==
CartesianProduct
:- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#3]
:  +- Scan[obj#2]
+- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#12]
   +- Scan[obj#10]

笛卡尔积的分片方法：

getDependencies方法：

整个过程在map端执行，没有shuffle

2、有关窄依赖

这个如果在百度上搜相关内容，大部分都这么定义：

如果这样理解的话，就会很矛盾，笛卡尔积的依赖中，一个父RDD的分区明明被多个子RDD的分区消费了，可它是窄依赖

我们看窄依赖的源码：

**
 * :: DeveloperApi ::
 * Base class for dependencies where each partition of the child RDD depends on a small number
 * of partitions of the parent RDD. Narrow dependencies allow for pipelined execution. * 
 */
@DeveloperApi
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
  /**
   * Get the parent partitions for a child partition.
   * @param partitionId a partition of the child RDD
   * @return the partitions of the parent RDD that the child partition depends upon
   */
  def getParents(partitionId: Int): Seq[Int]

  override def rdd: RDD[T] = _rdd
}

从这个注释上来看，应该这么翻译：其中子RDD的每个分区依赖于父RDD的小部分区

后来，我就想网上的说法是怎么来的呢？

翻了之前版本的源码，发现了出处：