blocks|key|699878|text|当从静态类型结构(不依赖于Dataset参数)创建schema时，Spark使用一组相对简单的规则来确定nullable属性。|type|unstyled|depth|inlineStyleRanges|offset|length|style|CODE|entityRanges|data|699879|如果给定类型的对象可以是null，那么它的DataFrame表示是nullable。|unordered-list-item|699880|如果对象是Option[_]，那么它的DataFrame表示是nullable，None被认为是SQL+NULL。|699881|在任何其他情况下，它都将被标记为非nullable。|699882|因为Scala+String是java.lang.String，可以是null，所以生成的列可以是nullable。出于同样的原因，bar列是初始数据集中的nullable：|699883|val+data1+=+Seq[(Int,+String)]((2,+"A"),+(2,+"B"),+(1,+"C"))
val+df1+=+data1.toDF("foo",+"bar")
df1.schema("bar").nullable|code-block|syntax|javascript|699884|Boolean+=+true|699885|但foo不是(scala.Int不能是null)。|699886|df1.schema("foo").nullable|699887|Boolean+=+false|699888|如果我们将数据定义更改为：|699889|val+data2+=+Seq[(Integer,+String)]((2,+"A"),+(2,+"B"),+(1,+"C"))|699890|foo将是nullable+(Integer是java.lang.Integer，装箱整数可以是null)：|699891|data2.toDF("foo",+"bar").schema("foo").nullable|699892|Boolean+=+true|699893|还请参阅：火花-20668修改ScalaUDF以处理空性。|699894|entityMap|0|LINK|mutability|MUTABLE|url|https://issues.apache.org/jira/browse/SPARK-20668^0|D|7|P|6|1G|8|0|C|4|L|9|X|8|0|5|9|J|9|V|8|14|4|1G|4|0|H|8|0|8|6|F|G|Z|4|1D|8|1U|3|26|8|0|0|0|1|3|7|9|J|4|0|0|0|0|0|0|3|5|8|F|7|N|H|1C|4|0|0|0|5|8|0|0^^$0|@$1|2|3|4|5|6|7|1L|8|@$9|1M|A|1N|B|C]|$9|1O|A|1P|B|C]|$9|1Q|A|1R|B|C]]|D|@]|E|$]]|$1|F|3|G|5|H|7|1S|8|@$9|1T|A|1U|B|C]|$9|1V|A|1W|B|C]|$9|1X|A|1Y|B|C]]|D|@]|E|$]]|$1|I|3|J|5|H|7|1Z|8|@$9|20|A|21|B|C]|$9|22|A|23|B|C]|$9|24|A|25|B|C]|$9|26|A|27|B|C]|$9|28|A|29|B|C]]|D|@]|E|$]]|$1|K|3|L|5|H|7|2A|8|@$9|2B|A|2C|B|C]]|D|@]|E|$]]|$1|M|3|N|5|6|7|2D|8|@$9|2E|A|2F|B|C]|$9|2G|A|2H|B|C]|$9|2I|A|2J|B|C]|$9|2K|A|2L|B|C]|$9|2M|A|2N|B|C]|$9|2O|A|2P|B|C]]|D|@]|E|$]]|$1|O|3|P|5|Q|7|2Q|8|@]|D|@]|E|$R|S]]|$1|T|3|U|5|Q|7|2R|8|@]|D|@]|E|$R|S]]|$1|V|3|W|5|6|7|2S|8|@$9|2T|A|2U|B|C]|$9|2V|A|2W|B|C]|$9|2X|A|2Y|B|C]]|D|@]|E|$]]|$1|X|3|Y|5|Q|7|2Z|8|@]|D|@]|E|$R|S]]|$1|Z|3|10|5|Q|7|30|8|@]|D|@]|E|$R|S]]|$1|11|3|12|5|6|7|31|8|@]|D|@]|E|$]]|$1|13|3|14|5|Q|7|32|8|@]|D|@]|E|$R|S]]|$1|15|3|16|5|6|7|33|8|@$9|34|A|35|B|C]|$9|36|A|37|B|C]|$9|38|A|39|B|C]|$9|3A|A|3B|B|C]|$9|3C|A|3D|B|C]]|D|@]|E|$]]|$1|17|3|18|5|Q|7|3E|8|@]|D|@]|E|$R|S]]|$1|19|3|1A|5|Q|7|3F|8|@]|D|@]|E|$R|S]]|$1|1B|3|1C|5|6|7|3G|8|@]|D|@$9|3H|A|3I|1|3J]]|E|$]]|$1|1D|3|-4|5|6|7|3K|8|@]|D|@]|E|$]]]|1E|$1F|$5|1G|1H|1I|E|$1J|1K]]]]

When creating <code>Dataset</code> from statically typed structure (without depending on <code>schema</code> argument) Spark uses a relatively simple set of rules to determine <code>nullable</code> property.

<ul>
<li>If object of the given type can be <code>null</code> then its <code>DataFrame</code> representation is <code>nullable</code>.</li>
<li>If object is an <code>Option[_]</code> then then its <code>DataFrame</code> representation is <code>nullable</code> with <code>None</code> considered to be SQL <code>NULL</code>.</li>
<li>In any other case it will be marked as not <code>nullable</code>.</li>
</ul>

Since Scala <code>String</code> is <code>java.lang.String</code>, which can be <code>null</code>, generated column can is <code>nullable</code>. For the same reason <code>bar</code> column is <code>nullable</code> in the initial dataset:

<pre class="lang-scala prettyprint-override"><code>val data1 = Seq[(Int, String)]((2, "A"), (2, "B"), (1, "C"))
val df1 = data1.toDF("foo", "bar")
df1.schema("bar").nullable
</code></pre>

<pre class="lang-none prettyprint-override"><code>Boolean = true
</code></pre>

but <code>foo</code> is not (<code>scala.Int</code> cannot be <code>null</code>).

<pre class="lang-scala prettyprint-override"><code>df1.schema("foo").nullable
</code></pre>

<pre class="lang-none prettyprint-override"><code>Boolean = false
</code></pre>

If we change data definition to:

<pre class="lang-scala prettyprint-override"><code>val data2 = Seq[(Integer, String)]((2, "A"), (2, "B"), (1, "C"))
</code></pre>

<code>foo</code> will be <code>nullable</code> (<code>Integer</code> is <code>java.lang.Integer</code> and boxed integer can be <code>null</code>):

<pre class="lang-scala prettyprint-override"><code>data2.toDF("foo", "bar").schema("foo").nullable
</code></pre>

<pre class="lang-none prettyprint-override"><code>Boolean = true
</code></pre>

See also: <a href="https://issues.apache.org/jira/browse/SPARK-20668" rel="noreferrer">SPARK-20668</a> Modify ScalaUDF to handle nullability.

blocks|key|699905|text|您也可以非常快速地更改dataframe模式。像这样的事就能完成任务-|type|unstyled|depth|inlineStyleRanges|entityRanges|data|699906|def+setNullableStateForAllColumns(+df:+DataFrame,+columnMap:+Map[String,+Boolean])+:+DataFrame+=+{
++++import+org.apache.spark.sql.types.{StructField,+StructType}
++++//+get+schema
++++val+schema+=+df.schema
++++val+newSchema+=+StructType(schema.map+{
++++case+StructField(+c,+d,+n,+m)+=>
++++++StructField(+c,+d,+columnMap.getOrElse(c,+default+=+n),+m)
++++})
++++//+apply+new+schema
++++df.sqlContext.createDataFrame(+df.rdd,+newSchema+)
}|code-block|syntax|javascript|699907|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

You could change schema of dataframe very quickly as well. something like this would do the job - 

<pre><code>def setNullableStateForAllColumns( df: DataFrame, columnMap: Map[String, Boolean]) : DataFrame = {
 import org.apache.spark.sql.types.{StructField, StructType}
 // get schema
 val schema = df.schema
 val newSchema = StructType(schema.map {
 case StructField( c, d, n, m) =&gt;
 StructField( c, d, columnMap.getOrElse(c, default = n), m)
 })
 // apply new schema
 df.sqlContext.createDataFrame( df.rdd, newSchema )
}
</code></pre>

Why is <code>nullable = true</code> used after some functions are executed even though there are no NaN values in the <code>DataFrame</code>.

<pre class="lang-scala prettyprint-override"><code>val myDf = Seq((2,"A"),(2,"B"),(1,"C"))
 .toDF("foo","bar")
 .withColumn("foo", 'foo.cast("Int"))

myDf.withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2").show
</code></pre>

When <code>df.printSchema</code> is called now <code>nullable</code> will be <code>false</code> for both columns.

<pre class="lang-scala prettyprint-override"><code>val foo: (Int =&gt; String) = (t: Int) =&gt; {
 fooMap.get(t) match {
 case Some(tt) =&gt; tt
 case None =&gt; "notFound"
 }
 }

val fooMap = Map(
 1 -&gt; "small",
 2 -&gt; "big"
 )
val fooUDF = udf(foo)

myDf
 .withColumn("foo", fooUDF(col("foo")))
 .withColumn("foo_2", when($"foo" === 2 , 1).otherwise(0)).select("foo", "foo_2")
 .select("foo", "foo_2")
 .printSchema
</code></pre>

However now, <code>nullable</code> is <code>true</code> for at least one column which was <code>false</code> before. How can this be explained?

Why do columns change to nullable in Apache Spark SQL?

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

为什么在执行某些函数之后使用nullable = true，即使DataFrame中没有NaN值。val myDf = Seq((2,"A"),(2,"B"),(1,"C"))         .toDF("foo","bar")         .withColumn("foo", 'foo.cast("Int"))...

问为什么在中列更改为可空？
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问为什么在中列更改为可空？EN