部署某个镜像到Mesos集群的某个Agent一直停留在Waiting,但是在Mesos UI上发现这个Agent的资源是够的(4CPU/14G mem,只使用了1CPU/256M mem)。为了重现这个问题,我在这台Agent上部署了2048镜像,对应的Marathon Json文件:
{
"id": "/2048-test",
"cmd": null,
"cpus": 0.01,
"mem": 32,
"disk": 0,
"instances": 1,
"constraints": [
[ "hostname", "CLUSTER", "10.140.0.15"
]
],
"container": {
"type": "DOCKER",
"volumes": [],
"docker": {
"image": "alexwhen/docker-2048",
"network": "BRIDGE",
"privileged": false,
"parameters": [],
"forcePullImage": false
} },
"portDefinitions": [
{
"port": 10008,
"protocol": "tcp",
"labels": {} }
]}
docker logs marathon_container
...
run_jar --task_launch_timeout 900000 --zk zk://10.140.0.14:2181/marathon --event_subscriber http_callback --https_address 10.140.0.14 --http_address 10.140.0.14 --hostname 10.140.0.14 --master zk://10.140.0.14:2181/mesos --logging_level warn
run_jar --task_launch_timeout 900000 --zk zk://10.140.0.14:2181/marathon --event_subscriber http_callback --https_address 10.140.0.14 --http_address 10.140.0.14 --hostname 10.140.0.14 --master zk://10.140.0.14:2181/mesos --logging_level warn
...
没发现异常。
目前位置笔者一直认为问题处在Marathon这边,所以就尝试去Marathon的Doc看看有没有常见的Troubleshooting。
果然有!An app Does Not Leave “Waiting”
This means that Marathon does not receive “Resource Offers” from Mesos that allow it to start tasks of this application. The simplest failure is that there are not sufficient resources available in the cluster or another framework hords all these resources. You can check the Mesos UI for available resources. Note that the required resources (such as CPU, Mem, Disk) have to be all available on a single host. If you do not find the solution yourself and you create a github issue, please append the output of Mesos /state endpoint to the bug report so that we can inspect available cluster resources.
根据提示去找Mesos的/state
信息。
根据Mesos state API得到当前Mesos集群的所有状态信息的Json文件:
然后到在线Json编辑器中格式化后查看Agent中的资源分配现状:
"resources": {
"cpus": 4,
"disk": 97267,
"mem": 14016,
"ports": "[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 8182-32000]"
},
"used_resources": {
"cpus": 1,
"disk": 0,
"mem": 128,
"ports": "[16957-16957]"
},
"offered_resources": {
"cpus": 0,
"disk": 0,
"mem": 0
},
"reserved_resources": {
"foo": {
"cpus": 3,
"disk": 0,
"mem": 10000
}
},
"unreserved_resources": {
"cpus": 1,
"disk": 97267,
"mem": 4016,
"ports": "[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 8182-32000]"
}
从中可以发现:虽然只使用了1CPU 128M mem,但是为foo
保留了3CPU 10000M mem,这直接导致没有足够的CPU资源。这是Marathon无法部署container到Mesos Agent的根本原因。
只需要将这个Agent上的资源疼出来就好了:
log
documentation
,一般像Marathon/Spark开源项目都会提供Troubleshooting
类似的文档说明/state
API是分析集群的好帮手简言之就是Marathon部署的container一直显示waiting
,但是这个可不是资源的问题,这个是docker image的问题。
公司同事开发了开源项目linkerConnector,主要目的就是读取Linux的/proc目录,收集进程的信息。为了方便部署,我把这个Golang Project容器化了,容器化的使用方法在这里。但是部署到Mesos Cluster一直失败,Marathon一直显示waiting
。
同问题一
登录到Mesos Agent,docker ps -a
:
b13e79caca0a linkerrepository/linker_connector "/bin/sh -c '/linkerC" 17 minutes ago Created mesos-c64aa327-a803-40bb-9239-91fbd
docker inspect container
:
"State": {
"Status": "Created",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 0,
"Error": "",
"StartedAt": "2016-08-26T08:22:40.713934966Z",
"FinishedAt": "0001-01-01T00:00:00Z"
}
因为之前失败的container都被我删除了,上述输出是根据现有container修改的,但是信息是和之前对应的。
随着个项目的更新以及重新构建镜像后,这个问题解决了,但是我分析出了原因:
/proc
目录-v /proc:/proc
/proc
目录,主机同时也会写信息到主机的/proc
目录,因为容器的/proc和主机的/proc挂载在一起,这就导致读写冲突了,所以容器一直启动失败。将主机的/proc挂在到容器的非/proc目录,同时传餐告诉容器中的服务要到哪读取/proc信息
docker ps
/docker inspect
/docker logs