ARCTIC resource uses Slurm workload manager to manage the cluster. Slurm is an open-source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. Please refer to slurm quick-start guide for more information on architecture and general discussion.
ARCTIC resources use a non-exclustive allocation policy. This allows other users to use the computing resources you are using provided, sufficient resources availablility. One key point here is that once you request and slurm allocate resources, those resources are not available for other users to use. The resource can be requested using SBATCH keyword. You will need to create a text file, called the job submission script, and enter the following lines to the file.
In the example below, you are requesting all workloads to be executed in a single node, i.e. no inter-node communication requested, use a single task, and use 4 cores for that task. the task will use 600 MB of memory and it will have a maximum runtime of 5 min. if the task end earlier resources will be freed automatically on a successful exit. However if the job runs longer than 5 min, the workload manager will terminate the job prematurely as resource allocation only allows 5 min runtime.
#!/bin/bash
#SBATCH --nodes=1 # Run all processes on a single node
#SBATCH --ntasks=1 # Run a single task
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
#SBATCH --mem=600mb # Total memory limit
#SBATCH --time=00:05:00 # Time limit hrs:min:sec
`
It is possible to get email notifications on specific job events such as start, end etc.. Please add the following lines to your job submission script.
#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@gsu.edu # Where to send mail
ARCTIC resources organize their resource into different clusters. each cluster further divided into different partitions. Each user may belong to one or more projects and each project may have different permissions. It is important to specify which project and partition to use at the time of job submission. You can find your project by log in to elpis. Please refer to the partition documentation for more information about the partitions.
#SBATCH --account=name # Project name (RS00000, ECON0001, MAT0001 ...)
#SBATCH --partition=partition # partition requested (qBF, qTRD , qECON ...)
SLURM workload manager executes jobs asynchronize fashion in batch mode; meaning there will be no interaction with the jobs once you submit job to the workload manager. workload manager controls the job until its termination. you will not see any error or output from the job at your Linux shell. However, it is possible to capture any standard output/errors to a file and examine job progress later. The following two lines specify file locations for standard output and error files.
#SBATCH --output=<filename pattern> # nameof the output file, see following section for pattern switches
#SBATCH --error=<filename pattern> # nameof the output file, see following section for pattern switches
additional filename patters; sbatch allows for a filename pattern to contain one or more replacement symbols, which are a percent sign "%" followed by a letter (e.g. %j).
%A Job array's master job allocation number.
%a Job array ID (index) number.
%j jobid of the running job.
%N short hostname. This will create a separate IO file per node.
%s stepid of the running job.
%u User name.
%x Job name.
ARCTIC resources manage software as modules. Please refer to Lmod documentation for more information on how to deploy your own modules in our home directory.
load necessary software into your workspace using the batch submission file. in the following example, we will load python3.6 into the workspace.
module load Python3.6
`
Now you are ready to run your workload. before you use batch processing it is recommended to use the interactive web interface and make sure the logic works without errors. Then just add the command you like to run at the end of the batch submission script.
python first_script.py
#SBATCH --gres=gpu:1 # 1 GPU requested
If you need a specific GPU in the cluster, you can specify it directly.
#SBATCH --gres=gpu:V100:1 # 1 V100 GPU (32 GB VRAM)
ARCTIC uses different storage subsystems to store data. Main medium term storage is based of SUSE Enterprise storage servers. It provides flexible, reliable, cost-efficient and intelligent storage solution powered by Ceph storage. The Ceph storage is presented to users through iRODS volumes. iRODS provides centralized data mangement solution. Please visit iRODS web page for more information.
ARCTIC resources also provides high performance scratch space based of Beegfs parallel file system. This space is only intended to be used as temporary file storage for current running jobs and should not be used as a permanent storage place. All the files storage in this location will be clean after job completion.
However this add an extra step to the job submission process. Now you have to move files from the irods in to the scratch space before job begins and perform the computation and then at the end of the computation transfer files back to the irods for long term storage. Please refer to the irods documentation irods documentaion for more details
Finally, putting all together, the job submission file looks follows
# test_python.py
"""
This python code will create a file in the current folder wherever test_python.py is run and contains the text "This is python output" inside the file.
"""
f = open("results.txt", mode = "w")
f.write("This is python output")
f.close
#!/bin/bash
#SBATCH --nodes=1 # Run all processes on a single node
#SBATCH --ntasks=1 # Run a single task
#SBATCH --cpus-per-task=4 # Number of CPU cores per task
#SBATCH --mem=600mb # Total memory limit
#SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=email@gsu.edu # Where to send mail
#SBATCH --account=name # Project name (RS00000, ECON0001, MAT0001 ...)
#SBATCH --partition=partition # partition requested (qBF, qTRD , qECON ...)
#SBATCH --output=output_%j.txt # nameof the output file, see following section for pattern switches
#SBATCH --error=error_%j.txt # name of the output file, see the following section for pattern switchest
mkdir -p $SCRATCH
cd $SCRATCH
# copying test_python.py from irods projects directory to /scratch directory
cp $IRODS_PROJECT/UNIV1S16/test_python.py $SCRATCH
module purge
module load python
python3 test_python.py
# copying output(results.txt) to the irods projects directory
cp results.txt $IRODS_PROJECT/UNIV1S16/
rm -rf $SCRATCH
now we are ready to submit the workload to the cluster. sbatch command can be used to submit prepared jobsubmit file to the cluster
sbatch jobsubmit.sh