我有一个强大的spark应用程序,可以不断重试,通过UI我可以找到的唯一有用的日志是来自stdout的:
2021-08-16 17:30:52 ERROR TransportRequestHandler:293 - Error sending result RpcResponse{requestId=6321165190495215882, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /192.168.56; closing connection
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2021-08-16 17:30:52 ERROR TransportRequestHandler:293 - Error sending result RpcResponse{requestId=7370805066606093965, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /192.168.56; closing connection
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2021-08-16 17:30:52 ERROR TransportRequestHandler:293 - Error sending result RpcResponse{requestId=8523609779541081889, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /192.168.56; closing connection
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2021-08-16 17:30:52 ERROR TransportRequestHandler:293 - Error sending result RpcResponse{requestId=8861954111730219182, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /192.168.56; closing connection
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2021-08-16 17:30:52 ERROR TransportRequestHandler:293 - Error sending result RpcResponse{requestId=5535068542584258152, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /192.168.562; closing connection
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
2021-08-16 17:35:34 ERROR YarnClusterScheduler:70 - Lost executor 205 on compute006: Container container_e434_1628615141783_154721_01_000245 on host: compute006 was preempted.
2021-08-16 17:35:59 ERROR YarnClusterScheduler:70 - Lost executor 203 on compute007: Container container_e434_1628615141783_154721_01_000242 on host: compute007 was preempted.
2021-08-16 17:38:50 ERROR YarnClusterScheduler:70 - Lost executor 209 on data267: Container container_e434_1628615141783_154721_01_000241 on host: data267 was preempted.
2021-08-16 17:40:56 ERROR YarnClusterScheduler:70 - Lost executor 211 on data133: Container container_e434_1628615141783_154721_01_000248 on host: data133 was preempted.
2021-08-16 17:44:01 ERROR YarnClusterScheduler:70 - Lost executor 157 on data034: Container container_e434_1628615141783_154721_01_000185 on host: data034 was preempted.
2021-08-16 17:44:26 ERROR YarnClusterScheduler:70 - Lost executor 202 on data234: Container container_e434_1628615141783_154721_01_000244 on host: data234 was preempted.
2021-08-16 18:05:34 ERROR YarnClusterScheduler:70 - Lost executor 225 on data001: Container container_e434_1628615141783_154721_01_000262 on host: data001 was preempted.
2021-08-16 18:05:49 ERROR YarnClusterScheduler:70 - Lost executor 227 on data244: Container container_e434_1628615141783_154721_01_000264 on host: data244 was preempted.
2021-08-16 18:06:16 ERROR YarnClusterScheduler:70 - Lost executor 214 on data027: Container container_e434_1628615141783_154721_01_000251 on host: data027 was preempted.
2021-08-16 18:06:23 ERROR ApplicationMaster:43 - RECEIVED SIGNAL TERM
2021-08-16 18:06:23 ERROR ApplicationMaster:70 - User application exited with status 143
2021-08-16 18:06:23 ERROR FileFormatWriter:91 - Aborting job ea540d12-ad13-4e88-95fb-d8ac7f250503.
org.apache.spark.SparkException: Job 127 cancelled because SparkContext was shut down
Spark应用程序经常成功运行,没有错误。听起来143是典型的OOM错误,但我的内存配置相当高:
'executor_memory': '10G',
'driver_memory': '12G',
'spark.executor.memoryOverhead': '5G',
'spark.driver.memoryOverhead': '4G',
了解这件事的最好方法是什么?
发布于 2021-08-16 21:24:52
您应该利用spark UI中的信息来更好地了解应用程序中发生的事情。注意溢出、随机读取大小和随机读取大小之间的偏差。这些信息应该能很好地告诉你发生了什么,以及如何适当地修复或调优应用程序,例如,你可能需要增加spark.sql.shuffle.partitions等
https://stackoverflow.com/questions/68808623
复制相似问题