blocks|key|1043256|text|缓存中的母化数据格式，您应该这样做：|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1043257|val+factTraffic+=+spark.read.parquet(factTrafficData)
factTraffic.write.mode(SaveMode.Overwrite).saveAsTable("f_traffic")
val+df_factTraffic+=+spark.table("f_traffic").cache
df_factTraffic.rdd.count
//+now+df_factTraffic+is+materalized+in+memory|code-block|syntax|javascript|1043258|另见https://stackoverflow.com/a/42719358/1138523|offset|length|1043259|但是，这是否有意义是值得怀疑的，因为parquet是一种柱状文件格式(这意味着投影非常有效)，而且如果每个查询都需要不同的列，缓存将不会对您有所帮助。|1043260|entityMap|0|LINK|mutability|MUTABLE|url|https://stackoverflow.com/a/42719358/1138523^0|0|0|2|18|0|0|0^^$0|@$1|2|3|4|5|6|7|U|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|V|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|W|8|@]|9|@$I|X|J|Y|1|Z]]|A|$]]|$1|K|3|L|5|6|7|10|8|@]|9|@]|A|$]]|$1|M|3|-4|5|6|7|11|8|@]|9|@]|A|$]]]|N|$O|$5|P|Q|R|A|$S|T]]]]

The materalize dataframe in cache, you should do:

<pre><code>val factTraffic = spark.read.parquet(factTrafficData)
factTraffic.write.mode(SaveMode.Overwrite).saveAsTable("f_traffic")
val df_factTraffic = spark.table("f_traffic").cache
df_factTraffic.rdd.count
// now df_factTraffic is materalized in memory
</code></pre>

See also <a href="https://stackoverflow.com/a/42719358/1138523">https://stackoverflow.com/a/42719358/1138523</a> 

But it's questionable whether this makes sense at all because parquet is a columnar file format (meaning that projection is very efficient), and if you need different columns for each query the caching will not help you.

blocks|key|1043279|text|听起来您在Databricks上运行，所以您的查询可能会自动受益于数据库IO缓存。从Databricks+文档|type|unstyled|depth|inlineStyleRanges|entityRanges|offset|length|data|1043280|Databricks+IO缓存通过使用快速中间数据格式在节点本地存储中创建远程文件副本来加速数据读取。每当必须从远程位置获取文件时，数据就会自动缓存。然后在本地执行对相同数据的连续读取，从而显着地提高读取速度。
Databricks+IO缓存支持从DBFS、Amazon+S3、HDFS、Azure+Blob存储和Azure+Data读取Parquet文件。它不支持其他存储格式，如CSV、JSON和ORC。|blockquote|1043281|Databricks运行时3.3或更高版本支持Databricks+IO缓存。默认情况下是否启用它取决于您为集群上的工作人员选择的实例类型:目前，它自动为Azure+Ls实例和AWS+i3实例启用(有关详细信息，请参阅Databricks文档的AWS和Azure版本)。|1043282|如果此Databricks+IO缓存生效，那么显式地使用带有未转换基表的Spark的RDD缓存可能会损害查询性能，因为它将存储数据的第二个冗余副本(并为此支付往返解码和编码)。|1043283|如果您正在缓存一个转换的数据集，那么显式缓存仍然是有意义的，例如，在对数据进行过滤以显著减少数据量之后，但是如果您只想缓存一个大型和未转换的基本关系，那么我个人建议依赖Databricks+IO缓存，并避免Spark的内置RDD缓存。|1043284|有关更多细节，请参见完整的Databricks+IO缓存文档，包括关于缓存升温、监视以及RDD和Databricks+IO缓存的比较的信息。|1043285|entityMap|0|LINK|mutability|MUTABLE|url|https://databricks.com/blog/2018/01/09/databricks-cache-boosts-apache-spark-performance.html|1|https://docs.databricks.com/user-guide/databricks-io-cache.html|2|3|https://docs.azuredatabricks.net/user-guide/databricks-io-cache.html^0|X|7|0|1H|2|1|0|0|3E|3|2|3I|5|3|0|0|0|0^^$0|@$1|2|3|4|5|6|7|11|8|@]|9|@$A|12|B|13|1|14]|$A|15|B|16|1|17]]|C|$]]|$1|D|3|E|5|F|7|18|8|@]|9|@]|C|$]]|$1|G|3|H|5|6|7|19|8|@]|9|@$A|1A|B|1B|1|1C]|$A|1D|B|1E|1|1F]]|C|$]]|$1|I|3|J|5|6|7|1G|8|@]|9|@]|C|$]]|$1|K|3|L|5|6|7|1H|8|@]|9|@]|C|$]]|$1|M|3|N|5|6|7|1I|8|@]|9|@]|C|$]]|$1|O|3|-4|5|6|7|1J|8|@]|9|@]|C|$]]]|P|$Q|$5|R|S|T|C|$U|V]]|W|$5|R|S|T|C|$U|X]]|Y|$5|R|S|T|C|$U|X]]|Z|$5|R|S|T|C|$U|10]]]]

It sounds like you're running on Databricks, so your query might be automatically benefitting from the <a href="https://databricks.com/blog/2018/01/09/databricks-cache-boosts-apache-spark-performance.html" rel="nofollow noreferrer">Databricks IO Cache</a>. From the Databricks <a href="https://docs.databricks.com/user-guide/databricks-io-cache.html" rel="nofollow noreferrer">docs</a>:

<blockquote>
 The Databricks IO cache accelerates data reads by creating copies of remote files in nodes’ local storage using fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then executed locally, which results in significantly improved reading speed.
 
 The Databricks IO cache supports reading Parquet files from DBFS, Amazon S3, HDFS, Azure Blob Storage, and Azure Data Lake. It does not support other storage formats such as CSV, JSON, and ORC.
</blockquote>

The Databricks IO Cache is supported on Databricks Runtime 3.3 or newer. Whether it is enabled by default depends on the instance type that you choose for the workers on your cluster: currently it is enabled automatically for Azure Ls instances and AWS i3 instances (see the <a href="https://docs.databricks.com/user-guide/databricks-io-cache.html" rel="nofollow noreferrer">AWS</a> and <a href="https://docs.azuredatabricks.net/user-guide/databricks-io-cache.html" rel="nofollow noreferrer">Azure</a> versions of the Databricks documentation for full details).

If this Databricks IO cache is taking effect then explicitly using Spark's RDD cache with an untransformed base table may harm query performance because it will be storing a second redundant copy of the data (and paying a roundtrip decoding and encoding in order to do so). 

Explicit caching can still can make sense if you're caching a transformed dataset, e.g. after filtering to significantly reduce the data volume, but if you only want to cache a large and untransformed base relation then I'd personally recommend relying on the Databricks IO cache and avoiding Spark's built-in RDD cache.

See the full Databricks IO cache documentation for more details, including information on cache warming, monitoring, and a comparision of RDD and Databricks IO caching.

We have fact table(30 columns) stored in parquet files on S3 and also created table on this files and cache it afterwards. Table is created using this code snippet: 

<pre><code>val factTraffic = spark.read.parquet(factTrafficData)
factTraffic.write.mode(SaveMode.Overwrite).saveAsTable("f_traffic")
%sql CACHE TABLE f_traffic
</code></pre>

We run many different calculations on this table(files) and are looking the best way to cache data for faster access in subsequent calculations. Problem is, that for some reason it's faster to read the data from parquet and do the calculation then access it from memory. One important note is that we do not utilize every column. Usually, around 6-7 columns per calculation and different columns each time.

Is there a way to cache this table in memory so we can access it faster then reading from parquet?

Spark on Databricks - Caching Hive table

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我们将事实表(30列)存储在S3上的拼花文件中，并在此文件上创建表并随后缓存它。表使用以下代码片段创建：val factTraffic = spark.read.parquet(factTrafficData)factTraffic.write.mode(SaveMode.Overwrite).saveAsTable(...

问数据库上的火花-缓存Hive表
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问数据库上的火花-缓存Hive表EN