blocks|key|4817376|text|这里有一个建议，看起来你正在AWS上运行你的databricks笔记本。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|4817377|优化它的方法是同时使用Hive+metastore或任何目录服务。现在这会有什么帮助呢？|4817378|在保存数据时，您可以使用bucketing根据合并关键字对数据进行排序，这些元数据信息需要存储在需要配置单元的元存储中。|offset|length|style|CODE|4817379|如果你使用bucketing，数据将是有序的，并且不会导致数据的过度混洗，这将不可避免地提高你的工作性能。|4817380|我对databricks不是很确定，但是如果你使用EMR，你可以选择使用glue+catalog作为元存储，或者你也可以在EMR中有自己的元存储。|4817381|entityMap^0|0|0|C|9|0|0|0^^$0|@$1|2|3|4|5|6|7|P|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|Q|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|R|8|@$F|S|G|T|H|I]]|9|@]|A|$]]|$1|J|3|K|5|6|7|U|8|@]|9|@]|A|$]]|$1|L|3|M|5|6|7|V|8|@]|9|@]|A|$]]|$1|N|3|-4|5|6|7|W|8|@]|9|@]|A|$]]]|O|$]]

Here's a recommendation, it seems you are running your databricks notebook on AWS.
The way to optimize it to use Hive metastore or any catalog service alongside. Now how this will help?
While saving the data you can use <code>bucketing</code> to order you data according to the merge keys and this metadata information needs to be stored in the metastore which will require hive.
If you use bucketing the data will be in order and will not result in excessive shuffling of data which will inevitably improve the performance of your job.
I am not very sure about databricks but if you use EMR you gets the options to use glue catalog as metastore or you can have your own metastore in EMR also.

blocks|key|3348620|text|根据我的经验，20分钟听起来很不错；)你的分区方案是什么？合并的速度和SELECTS的速度一样慢，所以如果你可以通过分区过滤器来消除lake扫描，那应该会有很大的帮助。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3348621|还要看看spark中的随机分区设置，因为我发现这些设置对性能有很大的影响。|3348622|最后，压缩数据将对合并性能产生巨大影响。|3348623|entityMap^0|0|0|0^^$0|@$1|2|3|4|5|6|7|H|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|I|8|@]|9|@]|A|$]]|$1|D|3|E|5|6|7|J|8|@]|9|@]|A|$]]|$1|F|3|-4|5|6|7|K|8|@]|9|@]|A|$]]]|G|$]]

20 min sounds pretty good in my experience;) What is your partition scheme? Merges are slowed in the same way that SELECTS are, so if you can eliminate lake scans by way of partition filters, that should help tremendously.
Also take a look at the shuffle partitions settings in spark, as I have found these to have a huge impact on performance.
Lastly, compacting you data will have a huge impact on merge performance.

blocks|key|1032055|text|如果你真的想通过代码来优化它，你可以启动并行任务。这是我们用来并行化S3编写的示例代码。您也可以对adls位置使用相同的逻辑。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|1032056|with+futures.ThreadPoolExecutor(max_workers=total_days%2B1)+as+e:
++print(f"{raw_bucket}/{db}/{table}/")
++for+single_date+in+daterange(start_date,+end_date):
++++curr_date+=+single_date.strftime("%25Y-%25m-%25d")
++++jobs.append(e.submit(writeS3,+curr_date))

++for+job+in+futures.as_completed(jobs):
++++result_done+=+job.result()
++++print(f"Job+Completed+-+{result_done}")

print("Task+complete")|code-block|syntax|javascript|1032057|参考：https://docs.python.org/3/library/concurrent.futures.html|offset|length|1032058|entityMap|0|LINK|mutability|MUTABLE|url|https://docs.python.org/3/library/concurrent.futures.html^0|0|0|3|1L|0|0^^$0|@$1|2|3|4|5|6|7|S|8|@]|9|@]|A|$]]|$1|B|3|C|5|D|7|T|8|@]|9|@]|A|$E|F]]|$1|G|3|H|5|6|7|U|8|@]|9|@$I|V|J|W|1|X]]|A|$]]|$1|K|3|-4|5|6|7|Y|8|@]|9|@]|A|$]]]|L|$M|$5|N|O|P|A|$Q|R]]]]

If you really want optimize it through code , you can Launching parallel tasks. this is sample code which we have used to parallelized for S3 writing . You can use same logic for adls location as well .
<pre><code>with futures.ThreadPoolExecutor(max_workers=total_days+1) as e:
 print(f&quot;{raw_bucket}/{db}/{table}/&quot;)
 for single_date in daterange(start_date, end_date):
 curr_date = single_date.strftime(&quot;%Y-%m-%d&quot;)
 jobs.append(e.submit(writeS3, curr_date))

 for job in futures.as_completed(jobs):
 result_done = job.result()
 print(f&quot;Job Completed - {result_done}&quot;)

print(&quot;Task complete&quot;)
</code></pre>
ref : <a href="https://docs.python.org/3/library/concurrent.futures.html" rel="nofollow noreferrer">https://docs.python.org/3/library/concurrent.futures.html</a>

blocks|key|3348678|text|我也有同样的问题，同样的数据大小。我将去掉merge语句，并将其分成两部分。|type|unstyled|depth|inlineStyleRanges|entityRanges|data|3348679|INSERT+where+Join+where+match，因此UPSATE+statement.|3348680|Join+where|unordered-list-item|3348681|，INSERT。|ordered-list-item|3348682|3348683|3348684|entityMap^0|0|0|0|0|0|0^^$0|@$1|2|3|4|5|6|7|N|8|@]|9|@]|A|$]]|$1|B|3|C|5|6|7|O|8|@]|9|@]|A|$]]|$1|D|3|E|5|F|7|P|8|@]|9|@]|A|$]]|$1|G|3|H|5|I|7|Q|8|@]|9|@]|A|$]]|$1|J|3|-4|5|6|7|R|8|@]|9|@]|A|$]]|$1|K|3|-4|5|6|7|S|8|@]|9|@]|A|$]]|$1|L|3|-4|5|6|7|T|8|@]|9|@]|A|$]]]|M|$]]

I am having the same issues, same data size as well. I'm going to get rid of the merge statement and break it into two pieces.
<ol>
<li>Join where match, so UPSATE statement.</li>
<li>Join where no match, INSERT.</li>
</ol>

I am trying to implement merge using delta lake oss and my history data is around 7 billions records and delta is around 5 millions.
The merge is based on the composite key(5 columns).
I am spinning up a 10 node cluster r5d.12xlarge(~3TB MEMORY / ~480 CORES).
The job took 35 Minutes for first time and the subsequent runs are taking more time.
Tried using optimization techniques , but nothing worked and i started to get heap memory issues after 3 runs , i see lot spill on disk while data shuffles, tried with re writing the history using order by on merge keys ,got performance improvement and merge completed in 20 minutes and the spill was around 2TB ,however the problem is that the data written as part of merge process was not in same order as I have no control on order of writing data ,so subsequent runs are taking longer .
I was not able to use Zorder in delta lake oss as it only comes with subscription .I tried compaction but that did not help either .
Please let me know if there is a better way to optimize the merge process .

Optimizing Merge in Delta Lake (Databricks Open Source )

翻译质量差，导致语言生硬或混乱。

没有提供实际的解决方法或示例。

解答不清晰，无法理解或解决问题。

页面排版不美观，阅读体验差。

文章

问答

视频

学习中心

腾讯云实验室

直播

竞赛

腾讯云代码分析专区

腾讯iOA零信任安全管理系统专区

腾讯云架构师技术同盟交流圈

腾讯云数据库专区

腾讯云顾问专区

腾讯云原生专区

腾讯混元专区

腾讯云TCE专区

腾讯云Lighthouse专区

腾讯云HAI专区

腾讯云Edgeone专区

腾讯云存储专区

腾讯云智能专区

腾讯轻联专区 

腾讯云开发专区

TAPD专区

腾讯轻量云游戏服专区

腾讯云最具价值专家

腾讯云架构师技术同盟

腾讯云创作之星

腾讯云开发者先锋

腾讯云代码助手

云原生构建

TAPD 敏捷项目管理

Cloud Studio

SDK中心

API中心

命令行工具

涵盖代码开发、场景应用、自动测试全流程，助你从零构建专属AI助手

一站式MCP教程库，解锁AI应用新玩法

我正在尝试使用delta lake oss实现合并，我的历史数据大约是70亿条记录，delta大约是500万条记录。合并基于组合键(5列)。我正在启动一个10节点集群r5d.12xlarge(~3TB内存/ ~480个内核)。该作业第一次花费了35分钟，后续运行将花费更多时间。我尝试过使用优化技术，但都不起作用，并且我...

问在Delta Lake中优化合并(Databricks开源)
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Delta Lake中优化合并(Databricks开源)EN