Yarn的JobHistory目录权限问题导致MapReduce作业异常

温馨提示:要看高清无码套图,请使用手机打开并单击图片放大查看。

1.问题描述

Hive的MapReduce作业无法正常运行,日志如下:

0: jdbc:hive2://localhost:10000>select count(*) from student;

command(queryId=hive_20170902081616_d676f921-c62c-4fac-84b9-272663a2fca0); _Time_taken: 10.029 seconds

Error: Error while processing statement: FAILED: Execution Error,return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)

0: jdbc:hive2://localhost:10000>

MapRedecu作业无法正常运行,日志如下:

[root@ip-172-31-6-148 hadoop-mapreduce]# hadoop jar hadoop-mapreduce-examples.jar pi 5 5
...
Diagnostics: Exception from container-launch.
Container id: container_1504338960864_0005_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
        at org.apache.hadoop.util.Shell.run(Shell.java:504)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
17/09/02 08:19:36 INFO mapreduce.Job: Counters: 0
Job Finished in 8.452 seconds
java.io.FileNotFoundException: File does not exist: hdfs://ip-172-31-6-148:8020/user/root/QuasiMonteCarlo_1504340365604_1994724640/out/reduce-out
        at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
        at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1820)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1844)
        at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:314)
        at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
        at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
        at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
[root@ip-172-31-6-148 hadoop-mapreduce]# 

通过JobHistory页面无法查看作业的日志:

2.问题分析

1.查看Yarn的ResourceManager日志,无法正常创建Container,异常如下:

Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
        at org.apache.hadoop.util.Shell.run(Shell.java:504)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
…
Container id: container_1504341269835_0001_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
        at org.apache.hadoop.util.Shell.run(Shell.java:504)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

2.查看NodeManager节点日志,异常日志如下:

2017-09-02 08:37:35,317 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1504341269835_0001_01_000001 and exit code: 1
ExitCodeException exitCode=1: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
        at org.apache.hadoop.util.Shell.run(Shell.java:504)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2017-09-02 08:37:35,326 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch.
2017-09-02 08:37:35,326 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_1504341269835_0001_01_000001

3.查看JobHistory服务的log日志

2017-09-02 08:40:31,676 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move intermediate done files
2017-09-02 08:40:32,880 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:root (auth:PROXY) via mapred (auth:SIMPLE) cause:java.io.FileNotFoundException:
File does not exist: /user/root/.staging/job_1504341269835_0001/job_1504341269835_0001.summary
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2037)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2007)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1920)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:572)
        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:89)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2211)

2017-09-02 08:40:32,882 WARN org.apache.hadoop.mapreduce.v2.hs.KilledHistoryService: Could not process job files
java.io.FileNotFoundException: File does not exist: /user/root/.staging/job_1504341269835_0001/job_1504341269835_0001.summary
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2037)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2007)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1920)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:572)
        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:89)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2211)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

4.查看HDFS的Namenode日志,异常如下:

2017-09-02 08:37:29,445 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/root/.staging/job_1504341269835_0001/job.xml is closed by DFSClient_NONMAPREDUCE_478129775_1
2017-09-02 08:37:29,451 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 172.31.10.118:50010 is added to blk_1073744484_3660 size 106954
2017-09-02 08:37:35,265 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: P
ermission denied: user=root, access=EXECUTE, inode="/user/history":mapred:supergroup:drwxrwx---
2017-09-02 08:37:35,265 INFO org.apache.hadoop.ipc.Server: IPC Server handler 29 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 172.31.5.190:46293 Call#5 
Retry#0: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=EXECUTE, inode="/user/history":mapred:supergroup:drwxrwx---
2017-09-02 08:37:40,188 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: P
ermission denied: user=root, access=EXECUTE, inode="/user/history":mapred:supergroup:drwxrwx---
2017-09-02 08:37:40,188 INFO org.apache.hadoop.ipc.Server: IPC Server handler 17 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 172.31.10.118:49343 Call#5
 Retry#0: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=EXECUTE, inode="/user/history":mapred:supergroup:drwxrwx---
2017-09-02 08:37:41,200 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hadoop-yarn/fail/root_appattempt_1504341269835_0001_000002 is closed by DFSClient_NONMAPREDUCE_-
860670620_215
2017-09-02 08:37:41,276 INFO BlockStateChange: BLOCK* addToInvalidates: blk_1073744476_3652 172.31.10.118:50010 172.31.9.33:50010 172.31.5.190:50010 

分析过程:

  1. 查看ResourceManager日志未发现原因
  2. 查看NodeManager日志未发现原因
  3. JobHistory日志无法正常查看,由于MapReduce作业先在(/user/xxx用户/xxxJob)目录下创建临时日志文件,然后将日志文件移至/user/history目录。
  4. 查看HDFS的NameNode日志,作业产生的临时日志文件无法正常写入/user/history目录
  5. 问题原因是由于HDFS的/user/history目录权限低,导致Yarn作业日志无法记录

3.解决方法

修改/user/history目录的权限及属主

sudo -u hdfs hadoop dfs -chmod 777 /user/history
sudo –u hdfs hadoop dfs –chown mapred:hadoop /user/history

修改权限前

修改权限后,数据正常写入,MapReduce任务正常

醉酒鞭名马,少年多浮夸! 岭南浣溪沙,呕吐酒肆下!挚友不肯放,数据玩的花! 温馨提示:要看高清无码套图,请使用手机打开并单击图片放大查看。

欢迎关注Hadoop实操,第一时间,分享更多Hadoop干货,喜欢请关注分享。

原创文章,欢迎转载,转载请注明:转载自微信公众号Hadoop实操

原文发布于微信公众号 - Hadoop实操(gh_c4c535955d0f)

原文发表时间:2017-09-08

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

编辑于

我来说两句

0 条评论
登录 后参与评论

相关文章

来自专栏Kubernetes

Controlling Access to the Kubernetes API

? API Server Ports and IPs By default the Kubernetes API server serves HTTP o...

2898
来自专栏分布式系统进阶

Librdkafka的Transport层

rd_kafka_recv按kafka的协议来收包, 先收4字节,拿到payload长度, 再根据这个长度收够payload内容, 这样一个完整的respons...

2201
来自专栏cmazxiaoma的架构师之路

VMware WorkStation的烦心事

1632
来自专栏技术博文

php QR Code二维码生成类

<?php /* * PHP QR Code encoder * * This file contains MERGED version of PHP ...

4025
来自专栏乐沙弥的世界

RMAN duplicate from active 时遭遇 ORA-17627 ORA-12154

    最近在从活动数据库进行异机克隆时碰到了ORA-17629,ORA-17627,ORA-12154的错误,起初以为是一个Bug呢。Oracle Bug着实...

1492
来自专栏逍遥剑客的游戏开发

做了Nebula3的应用程序向导

1273
来自专栏数据和云

偷梁换柱 | 无备份情况下的数据恢复实践

在实际环境中,许多数据库环境并没有做好完整的数据备份恢复计划及容灾方案,无法保证数据安全,并且出现一些灾难性的错误。那么我们就面临这样的问题:在什么样的最极端情...

3325
来自专栏蓝天

Redis模块开发示例

实现一个Redis module,支持两个扩展命令: 1) 可同时对hash的多个field进行incr操作; 2) incrby同时设置一个key的过期时...

1193
来自专栏10km的专栏

cmake:vs2015/MinGW静态编译leveldb

leveldb是google的开源项目(https://github.com/google/leveldb), 在linux下编译很方便,然而官方版本却没有提供...

5876
来自专栏高性能分布式系统设计

Clojure 启动出错

X:\clojure-1.3.0>java -cp clojure.jar main Exception in thread "main" java.lang....

3536

扫码关注云+社区

领取腾讯云代金券