blocks|key|2121842|text|原因其实很简单。按列进行分区时，每个分区只能包含该列的一个值。因此，实际上在文件中到处写入相同的值是无用的，这就是为什么Spark没有这样做。读取文件时，Spark使用文件名称中包含的信息来重构分区列，并将其放在架构的末尾。列的类型不是存储的，而是在读取时推断出来的，因此在您的情况下是整数类型。注:没有什么特别的理由说明为什么在后面加列。可能是刚开始的时候。我想这只是一个武断的实现选择。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|2121843|为了避免丢失列的类型和顺序，您可以像下面的df.withColumn("X",+'YOUR_COLUMN).write.partitionBy("X").parquet("...")一样复制分区列。|offset|length|style|CODE|2121844|不过你会浪费空间的。另外，spark使用分区来优化过滤器。在读取数据之后，不要忘记使用X列作为筛选器，而不是您的列，否则Spark将无法执行任何优化。|2121845|entityMap^0|0|L|1Y|0|0^^$0|@$1|2|3|4|5|6|7|L|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|M|8|@$D|N|E|O|F|G]]|9|@]|A|$]]|$1|H|3|I|5|6|7|P|8|@]|9|@]|A|$]]|$1|J|3|-4|5|6|7|Q|8|@]|9|@]|A|$]]]|K|$]]

The reason is in fact pretty simple. When you partition by a column, each partition can only contain one value of the said column. Therefore it is useless to actually write the same value everywhere in the file, and this is why Spark does not. When the file is read, Spark uses the information contained in the names of the files to reconstruct the partitioning column and it is put at the end of the schema. The type of the column is not stored, it is inferred when reading, hence the integer type in your case. 
NB: There is no particular reason as to why the column is added at the end. It could have been at the beginning. I guess it is just an arbitrary choice of implementation.

To avoid losing the type and the order of the columns, you could duplicate the partitioning column like this <code>df.withColumn("X", 'YOUR_COLUMN).write.partitionBy("X").parquet("...")</code>.

You will waste space though. Also, spark uses the partitioning to optimize filters for instance. Don't forget to use the X column for filters after reading the data and not your column or Spark won't be able to perform any optimizations.

blocks|key|3412665|text|当您将分区字段保存为文件夹时，这对以后读取数据是有益的，因为(对于某些文件类型，包含了拼花)，它可以优化地从您使用的分区读取数据(也就是说，如果您已经读取并筛选了centroid0==1+just不会读取其他分区)|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3412666|这样做的效果是，分区字段(在您的例子中是centroid0)不只是作为文件夹名(centroid0=1、centroid0=2等)写入到拼花文件中。|offset|length|style|CODE|3412667|它们的副作用是:+1.分区的类型是在运行时推断的(因为模式没有保存在parquet中)，在您的情况下，只存在整数值，因此它被推断为整数。|3412668|另一个副作用是，分区字段是在模式的末尾/开头添加的，因为它将模式从拼花文件中读取为一个块，然后将该分区字段作为另一个块添加到该分区字段(同样，它不再是存储在拼花中的模式的一部分)。|3412669|entityMap^0|0|K|9|14|B|1G|B|0|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|O|8|@$D|P|E|Q|F|G]|$D|R|E|S|F|G]|$D|T|E|U|F|G]]|9|@]|A|$]]|$1|H|3|I|5|6|7|V|8|@]|9|@]|A|$]]|$1|J|3|K|5|6|7|W|8|@]|9|@]|A|$]]|$1|L|3|-4|5|6|7|X|8|@]|9|@]|A|$]]]|M|$]]

When you <code>write.partitionBy(...)</code> Spark saves the partition field(s) as folder(s)
This is can be beneficial for reading data later as (with some file types, parquet included) it can optimize to read data just from partitions that you use (i.e. if you'd read and filter for centroid0==1 spark wouldn't read the other partitions

The effect of this is that the partition fields (<code>centroid0</code> in your case) are not written into the parquet file only as folder names (<code>centroid0=1</code>, <code>centroid0=2</code>, etc.)

The side effect of these are 1. the type of the partition is inferred at run time (since the schema is not saved in the parquet) and in your case it happened that you only had integer values so it was inferred to integer.

The other side effect is that the partition field is added at the end/beginning of the schema as it reads the schema from the parquet files as one chunk and then it adds to that the partition field(s) as another (again, it is no longer part of the schema that is stored in the parquet)

blocks|key|3173056|text|实际上，您可以很容易地使用包含分区数据模式的case类的列的排序。您需要从路径中读取数据，在路径中存储分区列，以使Spark推断这些列的值。然后，只需使用case类模式和如下语句应用重新排序：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3173057|val+encoder:+Encoder[RecordType]+=+Encoders.product[RecordType]
spark.read
++++++.schema(encoder.schema)
++++++.format("parquet")
++++++.option("mergeSchema",+"true")
++++++.load(myPath)
++++++//+reorder+columns,+since+reading+from+partitioned+data,+the+partitioning+columns+are+put+to+end
++++++.select(encoder.schema.fieldNames.head,+encoder.schema.fieldNames.tail:+_*)
++++++.as[RecordType]|code-block|syntax|javascript|3173058|entityMap^0|0|0^^$0|@$1|2|3|4|5|6|7|I|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|J|8|@]|9|@]|A|$E|F]]|$1|G|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|H|$]]

You can actually pretty easily make use of ordering of the columns of a case class that holds the schema of your partitioned data. You will need to read the data from the path, inside which the partitioning columns are stored underneath to make Spark infer the values of these columns. Then simply apply re-ordering by using the case class schema with a statement like:
<pre><code>val encoder: Encoder[RecordType] = Encoders.product[RecordType]
spark.read
 .schema(encoder.schema)
 .format(&quot;parquet&quot;)
 .option(&quot;mergeSchema&quot;, &quot;true&quot;)
 .load(myPath)
 // reorder columns, since reading from partitioned data, the partitioning columns are put to end
 .select(encoder.schema.fieldNames.head, encoder.schema.fieldNames.tail: _*)
 .as[RecordType]
</code></pre>

For a given DataFrame just before being <code>save</code>'d to <code>parquet</code> here is the schema: notice that the <code>centroid0</code> is the first column and is <code>StringType</code>:

<a href="https://i.stack.imgur.com/JkQhl.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/JkQhl.png" alt="enter image description here"></a>

However when saving the file using:

<pre><code> df.write.partitionBy(dfHolder.metadata.partitionCols: _*).format("parquet").mode("overwrite").save(fpath)
</code></pre>

and with the <code>partitionCols</code> as <code>centroid0</code>:

<a href="https://i.stack.imgur.com/8eNp2.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/8eNp2.png" alt="enter image description here"></a>

then there is a (to me) surprising result:

<ul>
<li>the <code>centroid0</code> partition column has been moved to the end of the Row</li>
<li>the data type has been changed to <code>Integer</code></li>
</ul>

I confirmed the output path via <code>println</code> :

<pre><code> path=/git/block/target/scala-2.11/test-classes/data/output/blocking/out//level1/clusters
</code></pre>

And here is the schema upon reading back from the saved <code>parquet</code>:

<a href="https://i.stack.imgur.com/0oZhI.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/0oZhI.png" alt="enter image description here"></a>

Why are those two modifications to the input schema occurring - and how can they be avoided - while still maintaining the <code>centroid0</code> as a partitioning column?

Update A preferred answer should mention why /when the partitions were added to the end (vs the beginning) of the columns list. We need an understanding of the deterministic ordering. 

In addition - is there any way to cause <code>spark</code> to "change it's mind" on the inferred column types? I have had to change the partitions from <code>0</code>, <code>1</code> etc to <code>c0</code>, <code>c1</code> etc in order to get the inference to map to <code>StringType</code>. Maybe that were required .. but if there were some spark setting to change the behavior that would make for an excellent answer.

Partition column is moved to end of row when saving a file to Parquet

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

对于给定的DataFrame，在成为saved到parquet之前，这里是一个模式:注意，centroid0是第一个列，是StringType。​​但是，当使用以下方法保存文件时：      df.write.partitionBy(dfHolder.metadata.partitionCols: _*).format...

问将文件保存到Parquet时，分区列被移动到行尾
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将文件保存到Parquet时，分区列被移动到行尾EN