专栏首页Hadoop实操Yarn的JobHistory目录权限问题导致MapReduce作业异常

Yarn的JobHistory目录权限问题导致MapReduce作业异常

温馨提示:要看高清无码套图,请使用手机打开并单击图片放大查看。

1.问题描述

Hive的MapReduce作业无法正常运行,日志如下:

0: jdbc:hive2://localhost:10000>select count(*) from student;

command(queryId=hive_20170902081616_d676f921-c62c-4fac-84b9-272663a2fca0); _Time_taken: 10.029 seconds

Error: Error while processing statement: FAILED: Execution Error,return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2)

0: jdbc:hive2://localhost:10000>

MapRedecu作业无法正常运行,日志如下:

[root@ip-172-31-6-148 hadoop-mapreduce]# hadoop jar hadoop-mapreduce-examples.jar pi 5 5
...
Diagnostics: Exception from container-launch.
Container id: container_1504338960864_0005_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
        at org.apache.hadoop.util.Shell.run(Shell.java:504)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
17/09/02 08:19:36 INFO mapreduce.Job: Counters: 0
Job Finished in 8.452 seconds
java.io.FileNotFoundException: File does not exist: hdfs://ip-172-31-6-148:8020/user/root/QuasiMonteCarlo_1504340365604_1994724640/out/reduce-out
        at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1266)
        at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1258)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1258)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1820)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1844)
        at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:314)
        at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
        at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
        at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
[root@ip-172-31-6-148 hadoop-mapreduce]# 

通过JobHistory页面无法查看作业的日志:

2.问题分析

1.查看Yarn的ResourceManager日志,无法正常创建Container,异常如下:

Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
        at org.apache.hadoop.util.Shell.run(Shell.java:504)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
…
Container id: container_1504341269835_0001_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
        at org.apache.hadoop.util.Shell.run(Shell.java:504)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

2.查看NodeManager节点日志,异常日志如下:

2017-09-02 08:37:35,317 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1504341269835_0001_01_000001 and exit code: 1
ExitCodeException exitCode=1: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:601)
        at org.apache.hadoop.util.Shell.run(Shell.java:504)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:786)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2017-09-02 08:37:35,326 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch.
2017-09-02 08:37:35,326 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_1504341269835_0001_01_000001

3.查看JobHistory服务的log日志

2017-09-02 08:40:31,676 INFO org.apache.hadoop.mapreduce.v2.hs.JobHistory: Starting scan to move intermediate done files
2017-09-02 08:40:32,880 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:root (auth:PROXY) via mapred (auth:SIMPLE) cause:java.io.FileNotFoundException:
File does not exist: /user/root/.staging/job_1504341269835_0001/job_1504341269835_0001.summary
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2037)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2007)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1920)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:572)
        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:89)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2211)

2017-09-02 08:40:32,882 WARN org.apache.hadoop.mapreduce.v2.hs.KilledHistoryService: Could not process job files
java.io.FileNotFoundException: File does not exist: /user/root/.staging/job_1504341269835_0001/job_1504341269835_0001.summary
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2037)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2007)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1920)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:572)
        at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:89)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2211)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

4.查看HDFS的Namenode日志,异常如下:

2017-09-02 08:37:29,445 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/root/.staging/job_1504341269835_0001/job.xml is closed by DFSClient_NONMAPREDUCE_478129775_1
2017-09-02 08:37:29,451 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 172.31.10.118:50010 is added to blk_1073744484_3660 size 106954
2017-09-02 08:37:35,265 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: P
ermission denied: user=root, access=EXECUTE, inode="/user/history":mapred:supergroup:drwxrwx---
2017-09-02 08:37:35,265 INFO org.apache.hadoop.ipc.Server: IPC Server handler 29 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 172.31.5.190:46293 Call#5 
Retry#0: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=EXECUTE, inode="/user/history":mapred:supergroup:drwxrwx---
2017-09-02 08:37:40,188 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: P
ermission denied: user=root, access=EXECUTE, inode="/user/history":mapred:supergroup:drwxrwx---
2017-09-02 08:37:40,188 INFO org.apache.hadoop.ipc.Server: IPC Server handler 17 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 172.31.10.118:49343 Call#5
 Retry#0: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=EXECUTE, inode="/user/history":mapred:supergroup:drwxrwx---
2017-09-02 08:37:41,200 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /tmp/hadoop-yarn/fail/root_appattempt_1504341269835_0001_000002 is closed by DFSClient_NONMAPREDUCE_-
860670620_215
2017-09-02 08:37:41,276 INFO BlockStateChange: BLOCK* addToInvalidates: blk_1073744476_3652 172.31.10.118:50010 172.31.9.33:50010 172.31.5.190:50010 

分析过程:

  1. 查看ResourceManager日志未发现原因
  2. 查看NodeManager日志未发现原因
  3. JobHistory日志无法正常查看,由于MapReduce作业先在(/user/xxx用户/xxxJob)目录下创建临时日志文件,然后将日志文件移至/user/history目录。
  4. 查看HDFS的NameNode日志,作业产生的临时日志文件无法正常写入/user/history目录
  5. 问题原因是由于HDFS的/user/history目录权限低,导致Yarn作业日志无法记录

3.解决方法

修改/user/history目录的权限及属主

sudo -u hdfs hadoop dfs -chmod 777 /user/history
sudo –u hdfs hadoop dfs –chown mapred:hadoop /user/history

修改权限前

修改权限后,数据正常写入,MapReduce任务正常

醉酒鞭名马,少年多浮夸! 岭南浣溪沙,呕吐酒肆下!挚友不肯放,数据玩的花! 温馨提示:要看高清无码套图,请使用手机打开并单击图片放大查看。

欢迎关注Hadoop实操,第一时间,分享更多Hadoop干货,喜欢请关注分享。

原创文章,欢迎转载,转载请注明:转载自微信公众号Hadoop实操

本文分享自微信公众号 - Hadoop实操(gh_c4c535955d0f),作者:Fayson

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2017-09-08

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 0524-6.1-如何使用Cloudera Manager启用HDFS的HA

    在HDFS集群中NameNode存在单点故障(SPOF),对于只有一个NameNode的集群,如果NameNode机器出现意外,将导致整个集群无法使用。为了解决...

    Fayson
  • JournalNode的edits目录没有格式化异常分析

    我们有时候通过CM启动NameNode的HA时,反正就是不知道什么原因,失败了,为了不影响集群的使用,又会通过CM把HA先取消掉。然后过了两天,又想作为一个生产...

    Fayson
  • 答应我,别在CDH5中使用ORC好吗

    当我们在使用ORC文件格式创建Hive表,并且对Hive表的schema进行更改后,然后进行如insert into…select或insert overwri...

    Fayson
  • Flink 连接 hive 解决 java.net.UnknownHostException

    今天在实验 Flink 连接 hive 的操作,由于 CDH 的 hadoop 是 HA,连接过程中报错如下:

    shengjk1
  • hadoop2.6分布式部署时 livenodes等于1的原因

    1.问题描述 在进行hadoop2.x版本的hdfs分布式部署时,遇到了一个奇怪的问题: 使用start-dfs.sh命令启动dfs之后,所有的datanode...

    老白
  • Sqoop导入数据时异常java.net.ConnectException: Connection refused

    java.net.ConnectException: Call From node4/192.168.179.143 to node4:8032 failed ...

    时间静止不是简史
  • Hadoop ha之Journal Storage Directory nor formatted

        情况是这样的,Hadoop ha下,集群QJM的数据丢了,之后启动namenode后报Journal Storage Directory nor for...

    克虏伯
  • Amabri hive权限设置

    DataScience
  • 0524-6.1-如何使用Cloudera Manager启用HDFS的HA

    在HDFS集群中NameNode存在单点故障(SPOF),对于只有一个NameNode的集群,如果NameNode机器出现意外,将导致整个集群无法使用。为了解决...

    Fayson
  • java客户端无法上传文件到hdfs

    学些hadoop。遇到这个问题,查找网上好多资料,一般都是说namenode和datanode不同步导致的,或者防火墙没开50010端口,或者nameNode和...

    frontoldman

扫码关注云+社区

领取腾讯云代金券