spark问题集锦

6,819次阅读
没有评论

共计 1112 个字符,预计需要花费 3 分钟才能阅读完成。

spark问题集锦

今天刚打了一个jar包放在yarn集群运行报了错误,干脆整个文章记录所有遇到的问题

1、Lost Executor Due to Heartbeat Timeout

If you see errors like the following:

2016-10-09T19:56:51,174 - WARN  [dispatcher-event-loop-5:Loggingclass@70] - Removing executor 1 with no recent heartbeats: 160532 ms exceeds timeout 120000 ms

2016-10-09T19:56:51,175 - ERROR [dispatcher-event-loop-5:Loggingclass@74] - Lost executor 1 on ip-10-44-188-82.ec2.internal: Executor heartbeat timed out after 160532 ms

2016-10-09T19:56:51,178 - WARN  [dispatcher-event-loop-5:Logging$class@70] - Lost task 22.0 in stage 1.0 (TID 166, ip-10-44-188-82.ec2.internal): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 160532 ms

This is most likely due to an OOM in the executor JVM (preventing it from maintaining the heartbeat with the application driver). However, we’ve seen cases where tasks fail, but the job still completes, so you’ll need to wait it out to see if the job recovers.

Another situation when this may occur is when a shuffle size (incoming data for a particular task) exceeds 2GB. This is hard to predict in advance because it depends on job parallelism and the number of records produced by earlier stages. The solution is to re-submit the job with increased job parallelism.

正文完
请博主喝杯咖啡吧!
post-qrcode
 
admin
版权声明:本站原创文章,由 admin 2018-04-09发表,共计1112字。
转载说明:除特殊说明外本站文章皆由CC-4.0协议发布,转载请注明出处。
评论(没有评论)
验证码