我遇到了这样的问题:当我在星火上运行机器学习任务时,经过几个阶段之后,所有的任务都分配给了一台机器(执行器),并且阶段执行变得越来越慢。
火花conf设置
val conf = new SparkConf().setMaster(sparkMaster).setAppName("ModelTraining").setSparkHome(sparkHome).setJars(List(jarFile))
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrator", "LRRegistrator")
conf.set("spark.storage.memoryFraction", "0.7")
conf.set("spark.executor.memory", "8g")
conf.set("spark.cores.max", "150")
conf.set("spark.speculation", "true")
conf.set("spark.storage.blockManagerHeartBeatMs", "300000")
val sc = new SparkContext(conf)
val lines = sc.textFile("hdfs://xxx:52310"+inputPath , 3)
val trainset = lines.map(parseWeightedPoint).repartition(50).persist(StorageLevel.MEMORY_ONLY)从发出警告日志
14/09/19 10:26:23 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(45, TS-BH109, 48384, 0)
14/09/19 10:27:18 WARN TaskSetManager: Lost TID 726 (task 14.0:9)
14/09/19 10:29:03 WARN SparkDeploySchedulerBackend: Ignored task status update (737 state FAILED) from unknown executor Actor[akka.tcp://sparkExecutor@TS-BH96:33178/user/Executor#-913985102] with ID 39
14/09/19 10:29:03 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(30, TS-BH136, 28518, 0)
14/09/19 11:01:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(47, TS-BH136, 31644, 0) with no recent heart beats: 47765ms exceeds 45000ms有什么建议吗?
发布于 2014-10-01 19:54:50
你能把遗嘱执行人的日志寄出去吗-有什么可疑的吗?特别是,您丢失了TID 726 (任务14.0:9)。在驱动程序日志中,您应该看到分配给哪个执行者TID 726 --我会检查机器上的错误日志,看看是否有什么有趣的东西出现在那里。
我的猜测(只是猜测)是你的遗嘱执行人崩溃了。此时,系统将尝试启动一个新的执行器,但这通常是缓慢的。同时,当前任务可能会对正在进一步运行系统的现有执行者产生不满。
https://stackoverflow.com/questions/25926009
复制相似问题