slurm的基本使用

最后发布时间 : 2023-11-19 20:47:57 浏览量 :

术语

资源
- GRES：Generic Resource，通用资源。
- TRES：Trackable RESources，可追踪资源。
- QOS：Quality of Service，服务质量，作业优先级。
- association：关联。可利用其实现，如用户的关联不在数据库中，这将阻止用户运行作业。该选项可以阻止用户访问无效账户。
- Partition：队列、分区。用于对计算节点、作业并行规模、作业时长、用户等进行分组管理，以合理分配资源。

state解释

idle	节点空闲，可接受作业
alloacted	该节点已经分配作业且所有核心用满，在作业释放前不能再被分配作业
mix	使用部分核心，仍可以被分配作业
drain	对应节点已经下线
drng	节点已下线但仍有作业在运行

squeue

adev0: squeue
JOBID PARTITION  NAME  USER ST  TIME NODES NODELIST(REASON)
65646     batch  chem  mike  R 24:19     2 adev[7-8]
65647     batch   bio  joan  R  0:09     1 adev14
65648     batch  math  phil PD  0:00     6 (Resources)

使用squeue命令确定系统上存在的jobs，ST列表示job的状态，两个job是运行的状态（运行的缩写是R）,一个job是等待的状态（PD是等待的简写）。TIME表示该job运行了多久，使用的格式是days-hours:minutes:seconds。NODELIST(REASON)字段表示job运行在哪里，或者job仍然是pending状态的原因。等待状态典型的原因是：Resources(等待资源可用)、Priority(在优先级更高的job后面)。

日常操作

 sudo su - root
 scontrol update NodeName=node2 State=resume

squeue  | grep smk-gene | awk  '{print $1}' | xargs scancel

 scontrol show node   node1

NodeName=node1 Arch=x86_64 CoresPerSocket=26 
   CPUAlloc=0 CPUEfctv=208 CPUTot=208 CPULoad=0.88
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=node1 NodeHostName=node1 Version=22.05.3
   OS=Linux 5.4.0-26-generic #30-Ubuntu SMP Mon Apr 20 16:58:30 UTC 2020 
   RealMemory=512000 AllocMem=0 FreeMem=510881 Sockets=4 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=ubuntu,low,med,high 
   BootTime=2023-03-16T10:48:53 SlurmdStartTime=2023-03-16T10:49:52
   LastBusyTime=2023-03-16T10:51:36
   CfgTRES=cpu=208,mem=500G,billing=208
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=node1 CPUs=208 RealMemory=512000 Boards=1 SocketsPerBoard=4 CoresPerSocket=26 ThreadsPerCore=2 State=UNKNOWN

node: free -h

              total        used        free      shared  buff/cache   available
Mem:          503Gi       3.2Gi       498Gi       9.0Mi       1.4Gi       497Gi
Swap:         9.3Gi          0B       9.3Gi

squeue --format="%.18i %.9P %.30j %.8u %.8T %.10M %.9l %.6D %R"

full name如果要显示更多的job名称字符，只需更改％.30J的数量即可。

资源的管理

 srun -w node4  --cpus-per-task=50  --mem-per-cpu=2G  sleep 1000

-w(- -nodelist) 表示:在计算机节点 node 上执行; - -cpus-per-task=50 表示每个 task 分配50 个线程。--mem-per-cpu=2G表示每个cpu上分配2G内存

使用scontrol查看改job

scontrol show job  578444

JobId=578444 JobName=sleep
   UserId=zyd(1001) GroupId=zyd(1001) MCS_label=N/A
   Priority=1 Nice=0 Account=test QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:16 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-03-16T17:56:13 EligibleTime=2023-03-16T17:56:13
   AccrueTime=Unknown
   StartTime=2023-03-16T17:56:13 EndTime=Unknown Deadline=N/A
   PreemptEligibleTime=2023-03-16T17:56:13 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-16T17:56:13 Scheduler=Main
   Partition=ubuntu AllocNode:Sid=master:167597
   ReqNodeList=node4 ExcNodeList=(null)
   NodeList=node4
   BatchHost=node4

   NumNodes=1 NumCPUs=50 NumTasks=1 CPUs/Task=50 ReqB:S:C:T=0:0:*:*
   TRES=cpu=50,mem=100G,node=1,billing=50

   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=50 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=sleep
   WorkDir=/data/metagenomics/pml_nextflow
   Power=

可以看到TRES=cpu=50,mem=100G表示，可追踪的CPU有50个，由于我们设置了，每个CPU 2G内存，因此mem的值为100

srun -w node4  --mem=2G --cpus-per-task=50  sleep 1000

将--mem-per-cpu改为--mem

JobId=578445 JobName=sleep
   UserId=zyd(1001) GroupId=zyd(1001) MCS_label=N/A
   Priority=1 Nice=0 Account=test QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:01:03 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-03-16T18:11:38 EligibleTime=2023-03-16T18:11:38
   AccrueTime=Unknown
   StartTime=2023-03-16T18:11:38 EndTime=Unknown Deadline=N/A
   PreemptEligibleTime=2023-03-16T18:11:38 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-16T18:11:38 Scheduler=Main
   Partition=ubuntu AllocNode:Sid=master:167597
   ReqNodeList=node4 ExcNodeList=(null)
   NodeList=node4
   BatchHost=node4
   
   NumNodes=1 NumCPUs=50 NumTasks=1 CPUs/Task=50 ReqB:S:C:T=0:0:*:*
   TRES=cpu=50,mem=2G,node=1,billing=50
   
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=50 MinMemoryNode=2G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=sleep
   WorkDir=/data/metagenomics/pml_nextflow
   Power=

可以看到TRES=cpu=50,mem=2G中的mem已经变成了2G

在nextflow中配置这些参数

账户配置

# 增加none和test账户并赋予相应权限
sacctmgr add account none,test Cluster=MyCluster Description="My slurm cluster" Organization="USTC"

# 增加test1用户属于test账户
sacctmgr -i add user test1 account=test

QOS：Quality of Service，服务质量，作业优先级
sacctmgr show qos format=name,priority
查看账户关联的QOS
sacctmgr show assoc
sacctmgr show assoc where name=hmli
sacctmgr show assoc format=cluster,user,qos

Usage: srun [OPTIONS(0)... [executable(0) [args(0)...]]] [ : [OPTIONS(N)...]] executable(N) [args(N)...]

Parallel run options:
  -A, --account=name          charge job to specified account
      --acctg-freq=<datatype>=<interval> accounting and profiling sampling
                              intervals. Supported datatypes:
                              task=<interval> energy=<interval>
                              network=<interval> filesystem=<interval>
      --bb=<spec>             burst buffer specifications
      --bbf=<file_name>       burst buffer specification file
      --bcast=<dest_path>     Copy executable file to compute nodes
      --bcast-exclude=<paths> Shared object directory paths to exclude
  -b, --begin=time            defer job until HH:MM MM/DD/YY
  -c, --cpus-per-task=ncpus   number of cpus required per task
      --comment=name          arbitrary comment
      --compress[=library]    data compression library used with --bcast
      --container             Path to OCI container bundle
      --cpu-freq=min[-max[:gov]] requested cpu frequency (and governor)
  -d, --dependency=type:jobid[:time] defer job until condition on jobid is satisfied
      --deadline=time         remove the job if no ending possible before
                              this deadline (start > (deadline - time[-min]))
      --delay-boot=mins       delay boot for desired node features
  -D, --chdir=path            change remote current working directory
      --export=env_vars|NONE  environment variables passed to launcher with
                              optional values or NONE (pass no variables)
  -e, --error=err             location of stderr redirection
      --epilog=program        run "program" after launching job step
  -E, --preserve-env          env vars for node and task counts override
                              command-line flags
      --gres=list             required generic resources
      --gres-flags=opts       flags related to GRES management
  -H, --hold                  submit job in held state
  -i, --input=in              location of stdin redirection
  -I, --immediate[=secs]      exit if resources not available in "secs"
      --jobid=id              run under already allocated job
  -J, --job-name=jobname      name of job
  -k, --no-kill               do not kill job on node failure
  -K, --kill-on-bad-exit      kill the job if any task terminates with a
                              non-zero exit code
  -l, --label                 prepend task number to lines of stdout/err
  -L, --licenses=names        required license, comma separated
  -M, --clusters=names        Comma separated list of clusters to issue
                              commands to.  Default is current cluster.
                              Name of 'all' will submit to run on all clusters.
                              NOTE: SlurmDBD must up.
  -m, --distribution=type     distribution method for processes to nodes
                              (type = block|cyclic|arbitrary)
      --mail-type=type        notify on state change: BEGIN, END, FAIL or ALL
      --mail-user=user        who to send email notification for job state
                              changes
      --mcs-label=mcs         mcs label if mcs plugin mcs/group is used
      --mpi=type              type of MPI being used
      --multi-prog            if set the program name specified is the
                              configuration specification for multiple programs
  -n, --ntasks=ntasks         number of tasks to run
      --nice[=value]          decrease scheduling priority by value
      --ntasks-per-node=n     number of tasks to invoke on each node
  -N, --nodes=N               number of nodes on which to run (N = min[-max])
  -o, --output=out            location of stdout redirection
  -O, --overcommit            overcommit resources
      --overlap               Allow other steps to overlap this step
      --het-group=value       hetjob component allocation(s) in which to launch
                              application
  -p, --partition=partition   partition requested
      --power=flags           power management options
      --priority=value        set the priority of the job to value
      --prolog=program        run "program" before launching job step
      --profile=value         enable acct_gather_profile for detailed data
                              value is all or none or any combination of
                              energy, lustre, network or task
      --propagate[=rlimits]   propagate all [or specific list of] rlimits
      --pty                   run task zero in pseudo terminal
      --quit-on-interrupt     quit on single Ctrl-C
  -q, --qos=qos               quality of service
  -Q, --quiet                 quiet mode (suppress informational messages)
      --reboot                reboot block before starting job
  -r, --relative=n            run job step relative to node n of allocation
  -s, --oversubscribe         over-subscribe resources with other jobs
  -S, --core-spec=cores       count of reserved cores
      --send-libs[=yes|no]    autodetect and broadcast shared objects
      --signal=[R:]num[@time] send signal when time limit within time seconds
      --slurmd-debug=level    slurmd debug level
      --spread-job            spread job across as many nodes as possible
      --switches=max-switches{@max-time-to-wait}
                              Optimum switches and max time to wait for optimum
      --task-epilog=program   run "program" after launching task
      --task-prolog=program   run "program" before launching task
      --thread-spec=threads   count of reserved threads
  -T, --threads=threads       set srun launch fanout
  -t, --time=minutes          time limit
      --time-min=minutes      minimum time limit (if distinct)
  -u, --unbuffered            do not line-buffer stdout/err
      --use-min-nodes         if a range of node counts is given, prefer the
                              smaller count
  -v, --verbose               verbose mode (multiple -v's increase verbosity)
  -W, --wait=sec              seconds to wait after first task exits
                              before killing job
      --wckey=wckey           wckey to run job under
  -X, --disable-status        Disable Ctrl-C status feature

Constraint options:
      --cluster-constraint=list specify a list of cluster-constraints
      --contiguous            demand a contiguous range of nodes
  -C, --constraint=list       specify a list of constraints
      --mem=MB                minimum amount of real memory
      --mincpus=n             minimum number of logical processors (threads)
                              per node
      --reservation=name      allocate resources from named reservation
      --tmp=MB                minimum amount of temporary disk
  -w, --nodelist=hosts...     request a specific list of hosts
  -x, --exclude=hosts...      exclude a specific list of hosts
  -Z, --no-allocate           don't allocate nodes (must supply -w)

Consumable resources related options:
      --exact                 use only the resources requested for the step
                              (by default, all non-gres resources on each node
                              in the allocation will be used in the step)
      --exclusive[=user]      for job allocation, this allocates nodes in
                              in exclusive mode
                              for job steps, this is equivalent to --exact
      --exclusive[=mcs]       allocate nodes in exclusive mode when
                              cpu consumable resource is enabled
                              and mcs plugin is enabled (--exact implied)
                              or don't share CPUs for job steps
      --mem-per-cpu=MB        maximum amount of real memory per allocated
                              cpu required by the job.
                              --mem >= --mem-per-cpu if --mem is specified.
      --resv-ports            reserve communication ports

Affinity/Multi-core options: (when the task/affinity plugin is enabled)
                              For the following 4 options, you are
                              specifying the minimum resources available for
                              the node(s) allocated to the job.
      --sockets-per-node=S    number of sockets per node to allocate
      --cores-per-socket=C    number of cores per socket to allocate
      --threads-per-core=T    number of threads per core to allocate
  -B  --extra-node-info=S[:C[:T]]  combine request of sockets per node,
                              cores per socket and threads per core.
                              Specify an asterisk (*) as a placeholder,
                              a minimum value, or a min-max range.

      --ntasks-per-core=n     number of tasks to invoke on each core
      --ntasks-per-socket=n   number of tasks to invoke on each socket
      --cpu-bind=             Bind tasks to CPUs
                              (see "--cpu-bind=help" for options)
      --hint=                 Bind tasks according to application hints
                              (see "--hint=help" for options)
      --mem-bind=             Bind memory to locality domains (ldom)
                              (see "--mem-bind=help" for options)

GPU scheduling options:
      --cpus-per-gpu=n        number of CPUs required per allocated GPU
  -G, --gpus=n                count of GPUs required for the job
      --gpu-bind=...          task to gpu binding options
      --gpu-freq=...          frequency and voltage of GPUs
      --gpus-per-node=n       number of GPUs required per allocated node
      --gpus-per-socket=n     number of GPUs required per allocated socket
      --gpus-per-task=n       number of GPUs required per spawned task
      --mem-per-gpu=n         real memory required per allocated GPU

Help options:
  -h, --help                  show this help message
      --usage                 display brief usage message

Other options:
  -V, --version               output version information and exit

InfiniBand和RDMA网络