文章/答案/技术大牛

发布

社区首页 >问答首页 >用于多个测试的Slurm异构作业

问用于多个测试的Slurm异构作业
EN

Stack Overflow用户

提问于 2022-10-24 07:32:57

回答 1查看 37关注 0票数 0

我必须在HPC集群上执行一些测试，并且我使用Slurm作为工作负载管理器。由于我必须对不同的分配执行类似的测试，所以我决定利用Slurm的异构工作支持。这是我的Slurm脚本：

# begin of slurm_script.sh
#!/bin/bash 

#SBATCH -p my_partition
#SBATCH --exclusive
#SBATCH --time 16:00:00          # format: HH:MM:SS
#SBATCH -N 1                     # 1 node
#SBATCH --ntasks-per-node=32     # tasks out of 128
#SBATCH --gres=gpu:4             # gpus per node out of 4
#SBATCH --mem=246000             # memory per node out of 246000MB

#SBATCH hetjob

#SBATCH -p my_partition
#SBATCH --exclusive
#SBATCH --time 16:00:00          # format: HH:MM:SS
#SBATCH -N 2                     # 2 nodes
#SBATCH --ntasks-per-node=32     # tasks out of 128
#SBATCH --gres=gpu:4             # gpus per node out of 4
#SBATCH --mem=246000             # memory per node out of 246000MB

#SBATCH hetjob

#SBATCH -p my_partition
#SBATCH --exclusive
#SBATCH --time 16:00:00          # format: HH:MM:SS
#SBATCH -N 4                     # 4 nodes
#SBATCH --ntasks-per-node=32     # tasks out of 128
#SBATCH --gres=gpu:4             # gpus per node out of 4
#SBATCH --mem=246000             # memory per node out of 246000MB

#SBATCH hetjob

#SBATCH -p my_partition
#SBATCH --exclusive
#SBATCH --time 16:00:00          # format: HH:MM:SS
#SBATCH -N 8                     # 8 nodes
#SBATCH --ntasks-per-node=32     # tasks out of 128
#SBATCH --gres=gpu:4             # gpus per node out of 4
#SBATCH --mem=246000             # memory per node out of 246000MB

#SBATCH hetjob

#SBATCH -p my_partition
#SBATCH --exclusive
#SBATCH --time 16:00:00          # format: HH:MM:SS
#SBATCH -N 16                    # 16 nodes
#SBATCH --ntasks-per-node=32     # tasks out of 128
#SBATCH --gres=gpu:4             # gpus per node out of 4
#SBATCH --mem=246000             # memory per node out of 246000MB

srun --job-name=job1 --output=4cpu_%N_%j.out --het-group=0 script.sh 4

srun --job-name=job2 --output=8cpu_%N_%j.out --het-group=0 script.sh 8

srun --job-name=job3 --output=16cpu_%N_%j.out --het-group=0 script.sh 16

srun --job-name=job4 --output=32cpu_%N_%j.out --het-group=0 script.sh 32

srun --job-name=job5 --output=64cpu_%N_%j.out --het-group=1 script.sh 64

srun --job-name=job6 --output=128cpu_%N_%j.out --het-group=2 script.sh 128

srun --job-name=job7 --output=256cpu_%N_%j.out --het-group=3 script.sh 256

srun --job-name=job8 --output=512cpu_%N_%j.out --het-group=4 script.sh 512

这里，script.sh以处理器的数量作为参数，它的形式是

make cpp_program_I_need_to_run
mkdir -p my_results

mpirun -n $1 cpp_program_I_need_to_run

# other tasks

当我执行时，在集群sbatch slurm_script.slurm作业上启动崩溃，退出代码8和以下错误：

cat slurm-8482798.out 
srun: error: r242n13: tasks 0-31: Exited with exit code 8
srun: launch/slurm: _step_signal: Terminating StepId=8482798.0
srun: error: r242n13: tasks 0-31: Exited with exit code 8
srun: launch/slurm: _step_signal: Terminating StepId=8482798.1
srun: error: r242n13: tasks 0-31: Exited with exit code 8
srun: launch/slurm: _step_signal: Terminating StepId=8482798.2
srun: error: r242n13: tasks 0-31: Exited with exit code 8
srun: launch/slurm: _step_signal: Terminating StepId=8482798.3
...

也是

slurmstepd: error: Unable to create TMPDIR [/scratch_local/slurm_job.8482798]: Permission denied
slurmstepd: error: Setting TMPDIR to /tmp
slurmstepd: error: execve(): /cluster/home/userexternal/username/myfolder/script.sh: Exec format error
slurmstepd: error: execve(): /cluster/home/userexternal/username/myfolder/script.sh: Exec format error
...

很多行都是这样。

有办法让它起作用吗？我唯一能想到的是，我的script.sh中的script.sh调用是多余的，但是我没有太多的想法。

提前谢谢你

cluster-computing

slurm

hpc

Stack Overflow用户

发布于 2022-10-24 10:20:38

实际上，mpirun命令是多余的。你能澄清script.sh应该执行什么吗？

我的方法是提前执行make，将mkdir -p my_results放在#SBATCH指令之后(我假设目录应该在所有作业元素之间共享，否则您应该使用环境变量指向节点本地存储)，并删除mpirun以支持srun ... cpp_program_I_need_to_run。

票数 0

查看全部 1 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/74177839

复制

相似问题

问用于多个测试的Slurm异构作业
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于多个测试的Slurm异构作业EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问用于多个测试的Slurm异构作业
EN