cluster management

最后发布时间 : 2022-10-21 16:36:29 浏览量 :

https://slurm.schedmd.com/qos.html

NodeName=X11QPH CPUs=192 Boards=1 SocketsPerBoard=4 CoresPerSocket=24 ThreadsPerCore=2 State=UNKNOWN
NodeName=PowerEdge-R720 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN
NodeName=s1 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2  State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

Job state, compact form: PD (pending), R (running), CA (cancelled), CF(configuring), CG (completing), CD (completed), F (failed), TO (timeout), NF (node failure), RV (revoked) and SE (special exit state). See the JOB STATE CODES section below for more information. (Valid for jobs only)

scontrol  reconfig
scontrol show config

图片alt

账户配置

# 增加none和test账户并赋予相应权限
sacctmgr add account none,test Cluster=MyCluster Description="My slurm cluster" Organization="USTC"

# 增加test1用户属于test账户
sacctmgr -i add user test1 account=test

QOS：Quality of Service，服务质量，作业优先级
sacctmgr show qos format=name,priority
查看账户关联的QOS
sacctmgr show assoc
sacctmgr show assoc where name=hmli
sacctmgr show assoc format=cluster,user,qos

配置文件

Job Completion Logging

 # Job Completion Logging：作业完成记录
JobCompType=jobcomp/filetxt
 # 指定作业完成是采用的记录机制，默认为None，可为以下值之一:
    # None: 不记录作业完成信息
    # Elasticsearch: 将作业完成信息记录到Elasticsearch服务器
    # FileTxt: 将作业完成信息记录在一个纯文本文件中
    # Lua: 利用名为jobcomp.lua的文件记录作业完成信息
    # Script: 采用任意脚本对原始作业完成信息进行处理后记录
    # MySQL: 将完成状态写入MySQL或MariaDB数据库
JobCompLoc=/var/log/slurm/jobcomp
 # 设定数据库在哪里运行，且如何连接
 JobCompHost=localhost # 存储作业完成信息的数据库主机名
 # JobCompPort= # 存储作业完成信息的数据库服务器监听端口
 JobCompUser=slurm # 用于与存储作业完成信息数据库进行对话的用户名
 JobCompPass=SomePassWD # 用于与存储作业完成信息数据库进行对话的用户密码

图片alt

https://cloud.tencent.com/developer/ask/sof/1037834

Job Accounting Gather

 JobAcctGatherType=jobacct_gather/linux # Slurm记录每个作业消耗的资源，JobAcctGatherType值可为以下之一：
    # jobacct_gather/none: 不对作业记账
    # jobacct_gather/cgroup: 收集Linux cgroup信息
    # jobacct_gather/linux: 收集Linux进程表信息，建议
JobAcctGatherFrequency=30 # 设定轮寻间隔，以秒为单位。若为-，则禁止周期性抽样

 # Job Accounting Storage：作业记账存储
 AccountingStorageType=accounting_storage/slurmdbd # 与作业记账收集一起，Slurm可以采用不同风格存储可以以许多不同的方式存储会计信息，可为以下值之一：
     # accounting_storage/none: 不记录记账信息
     # accounting_storage/slurmdbd: 将作业记账信息写入Slurm DBD数据库
 # AccountingStorageLoc: 设定文件位置或数据库名，为完整绝对路径或为数据库的数据库名，当采用slurmdb时默认为slurm_acct_db

 # 设定记账数据库信息，及如何连接
 AccountingStorageHost=localhost # 记账数据库主机名
 # AccountingStoragePort= # 记账数据库服务监听端口
 AccountingStorageUser=slurm # 记账数据库用户名
 AccountingStoragePass=SomePassWD # 记账数据库用户密码。对于SlurmDBD，这是一个替代套接字socket名，用于Munge守护进程，提供企业范围的身份验证
 # AccountingStoreFlags= # 以逗号（,）分割的列表。选项是：
     # job_comment：在数据库中存储作业说明域
     # job_script：在数据库中存储脚本
     # job_env：存储批处理作业的环境变量
 # AccountingStorageTRES=gres/gpu # 设置GPU时需要
 # GresTypes=gpu # 设置GPU时需要

slurmdbd: error: Database settings not recommended values: innodb_buffer_pool_size innodb_lock_wait_timeout

SET GLOBAL innodb_lock_wait_timeout = 1;

https://slurm.schedmd.com/accounting.html

srun -w s1 hostname
# -w, --nodelist=hosts...     request a specific list of hosts
sacct -j 515064 --format=JobID,Start,End,Elapsed,NCPUS

图片alt

https://en.wikipedia.org/wiki/Comparison_of_cluster_software

图片alt

slurmd: error: If munged is up, restart with --num-threads=10
slurmd: error: Munge encode failed: Failed to access "/run/munge/munge.socket.2": No such file or directory
slurmd: error: slurm_send_node_msg: auth_g_create: MESSAGE_NODE_REGISTRATION_STATUS has authentication error
slurmd: error: Unable to register: Protocol authentication error

图片alt

https://blog.csdn.net/muyuu/article/details/119780385

http://hmli.ustc.edu.cn/doc/linux/slurm-install/slurm-install.html
https://blog.wanghaiqing.com/article/9353fb21-cc9a-4acc-a8d2-2c2f7de4a9d6/
gqqnbig
https://www.modb.pro/db/327877

sinfo -e -p <partition_name> -o"%9P %3c %.5D %6t" -t idle,mix

/usr/sbin/slurmd --conf-server admin1:6817

资源

https://hpc.nmsu.edu/discovery/slurm/slurm-commands/
https://hpc.nmsu.edu/discovery/software/modules/
https://modules.readthedocs.io/en/latest/INSTALL.html#installation-instructions
https://ucdavis-bioinformatics-training.github.io/2017-June-RNA-Seq-Workshop/monday/cluster.html
https://bicmr.pku.edu.cn/~wenzw/pages/examples.html
https://docs.hpc.sjtu.edu.cn/system/index.html
https://docs.slurm.cn/users/
https://hpc.sicau.edu.cn/syzn/slurm.htm
https://doc.sist.aaaab3n.moe/job/slurm.html#job-slurm--page-root

NFS的安装

https://blog.csdn.net/m0_59474046/article/details/123802030
https://blog.csdn.net/weixin_44377280/article/details/106965314
https://docs.slurm.cn/users/
sudo apt-get install nfs-kernel-server
sudo apt-get install nfs-common

sudo systemctl status nfs-kernel-server.service

scontrol show node
scontrol show slurm reports

/etc/exports文件内容修改后，需要重启NFS服务器进程才能生效，还有一种使之生效的办法是执行exportfs命令

exportfs命令可用的选型及功能如下：

-a:导出所有列在/etc/exports文件中的目录

-v:输出每一个被导出或取消导出的目录

-r:重新导出所有列在/etc/exports文件中的目录

-u:取消置顶目录的导出，与-a同时使用时，取消所有列在/etc/exports文件的目录导出

-i:允许导出没有在/etc/exports文件中列出的目录或者不按/etc/exports文件所列的选项导出

-f指定另一个文件来代替/etc/exports

-o:指定导出目录的选项

https://blog.csdn.net/weixin_29304021/article/details/116621254

测试资源限制

使用《浅谈Linux Cgroups机制》的C++代码。

#include <unistd.h>
#include <stdio.h>
#include <cstring>
#include <thread>

void test_cpu() {
    printf("thread: test_cpu start\n");
    int total = 0;
    while (1) {
        ++total;
    }
}

void test_mem() {
    printf("thread: test_mem start\n");
    int step = 20;
    int size = 10 * 1024 * 1024; // 10Mb
    for (int i = 0; i < step; ++i) {
        char* tmp = new char[size];
        memset(tmp, i, size);
        sleep(1);
    }
    printf("thread: test_mem done\n");
}

int main(int argc, char** argv) {
    std::thread t1(test_cpu);
    std::thread t2(test_mem);
    t1.join();
    t2.join();
    return 0;
}

编译并运行

$ g++ -o test test.cc --std=c++11 -lpthread
$ ./test

htop

发现CPU占用为100%，内存慢慢涨到约400MB。（如果htop显示三个test，按H切换到进程模式，就只会显示一个了。）运行srun ./test，内存占用相同。

限制内存

现在实现内存限制。在/etc/slurm-lnl/cgroup.conf写入

CgroupAutomount=yes
MaxRAMPercent=0.1
ConstrainRAMSpace=yes

在/etc/slurm-llnl/slurm.conf写入或修改

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup

重启slurmctld和slurmd，确保没报任何异常。这时再运行srun ./test，发现RES被限制在12MB，VIRT则很大，说明内存限制成功。（但我们没有限制虚拟内存）

限制CPU

创建文件test-cpu.py，运行它。test-cpu.py输出CPU的数量，并且创建该数量的线程。

#!/usr/bin/env python3

import threading, multiprocessing
import time

print(multiprocessing.cpu_count())

def loop():
    x = 0
    while True:
        x = x ^ 1

for i in range(multiprocessing.cpu_count()):
    t = threading.Thread(target=loop)
    print(f'create a thread {i}...', flush=True)

    t.start()

ps -eLF | grep test-cpu | sort -n -k9 第9列显示的是test-cpu的线程所在的CPU，发现test-cpu.py运行在数个CPU上。

现在用slurm限制，运行srun -c4 ./test-cpu.py，再运行ps，发现test-cpu.py运行在4个CPU上。

这说明，向slurm申请CPU，slurm就只分配那么多个CPU。虽然multiprocessing.cpu_count()能获取到真实的CPU数量，但无法全部使用。

另外也可以通过scontrol show job来检查-c的设置是否生效。

申请GPU

运行tf-GPU-test.py并申请4个GPU。

srun --gres=gpu:4 python tf-GPU-test.py

import time

if __name__ == '__main__':
	# 大概需要300MB内存
	arr = [None] * 10000000

	time.sleep(10)
	for i in range(len(arr)):
		arr[i] = i


	time.sleep(10)
	print('done')

https://stackoverflow.com/questions/52421171/slurm-exceeded-job-memory-limit-with-python-multiprocessing

$ srun --mem=100G hostname
srun: error: Memory specification can not be satisfied
srun: error: Unable to allocate resources: Requested node configuration is not available

我们的机器内存没有100G，所以该任务无法运行。把命令改为srun --mem=10M hostname就可以运行。同时也发现，--mem并不限制任务本身的内存占用。

问题

图片alt

configure: WARNING: *** mysql_config not found. Evidently no MySQL development libs installed on system.

 libmysqlclient-dev

Why are my slurm job steps not launching in parallel? 服务质量 (QOS)