About: Wooki is a high performance computing cluster containing >2400 cpu cores and GPUs that was ‘rebuilt’ in January 2021 using compute nodes from an older iteration of the cluster. Newer hardware has since been added to it. This page is a quick guide to using Wooki.
Location: Currently the cluster is available only within the uOttawa network at: wooki.chem.uolocal. You will need to use a SSH program such as MobaXterm or PuTTY to connect to it. For off-campus access you need to use the University’s VPN.
Obtaining Access to the Cluster
If you would like an account on the new cluster, please email twoo@uottawa.ca with the following information:
1) your name
2) what group you are part of (i.e. the Chica lab)
3) If you had an account on old Wooki, your old wooki username
4) what software packages you wish to use.
5) the credit card number you want us to charge
Interacting with the Head Node:
The head node is intended for managing jobs and files, and not for heavy computing since it manages the queue and many other services including the interactive sessions of all the users. It is, however, a very powerful machine, and for tasks that will take less than 15 minutes to complete and one or two CPUs, it is fine to run these interactively. Anything longer than that is subject to being killed without notice.
Any longer or “heavier” job must be submitted to the compute hosts via the queuing system (see below), which manages the available resources and assigns them to waiting jobs.
DATA STORAGE
Wooki has two distinct data storage areas /home and /share_scratch:
-
-
/share_scratch/username is the user’s temporary working space. This is network storage run on a dedicated file server with no limits on use. You are expected to run the majority of your work from here. This is the most efficient place to work and keeps the head node responsive. This file system is network mounted onto every compute node. NOTE: /share_scratch is not backed up, although it is run from a redundant disk array and several hard drives need to fail at the same time before data is lost. One can store data long term on /share_scratch.
-
/home/username should be used as permanent storage, less intensive processing of files and transfer to and from the cluster. This file system is physically on the head node of Wooki, but it is network mounted on all compute nodes as well. You should use this space for data that you will process interactively on the frontend as it will reduce network traffic. If you store the results of your calculations on /home, please take care to exclude the large intermediate files, such as wave function files that you typically don’t need or can regenerate easily if needed. /home is incrementally backed up twice a week. Files that are deleted on /home are not deleted on the backup, but files that are modified on home, are also modified on the backup.
- Note that files one /home and /share_scratch are compressed as they are written so there is no need to compress your files.
-
-
- Each compute node also has a local drive mounted at: /local_scratch. If your compute jobs are disk IO heavy you should consider running your job on the /local_scratch. However, your script needs to copy the files back to /home or /share_scratch after it is complete.
Software
Centos 8.2 Linux distribution is installed on all nodes, using the openHPC clustering software. The compute nodes are running Centos in diskless mode, such that Linux is loaded from the head node and stored in memory upon boot up.
We are using the linux module environment manager to load software. Use the following command to see what is available:
module avail
For some of the packages, like VASP, scripts to submit to the queueing system have been written. They are typically called ‘packagename-submit’. i.e. for vasp the submit job is ‘vasp-submit’. Use vasp-submit -h to get details on how to use it and the options available.
Python
Python 3.6 is the default version that CentOS uses. It is invoked using ‘python3’, not ‘python’. The Anaconda Python environment have been installed with Python 2.7 and 3.8. These can be invoked with ‘module load python/{version}. i.e. ‘module load python/2.7’.
The Queuing System
We are using the ‘Slurm’ queueing system. This is the same queueing system that is used on the DRAC/Compute Canada systems.
Partitions available (slurm uses the term partitions instead of queues)
-
- General – this is a general CPU partition. Most jobs should be submitted to this queue.
- gpu – currently one node (gpu_1) has a GPU for machine learning. Use this queue to access it. (see below for instructions).
Useful commands (see individual help or man pages for more details):
-
-
squeue – prints jobs in the queue
-
scancel – to kill jobs.
-
qstat– alias we wrote that prints out more useful information that the default squeue command. Use ‘-a’ option to see all jobs instead of just your own jobs.
-
sinfo– prints out info on the queues/partitions.
-
sbatch – submit a script to the queue. See below for more information about job submission.
-
nodeinfo – print out detailed information about the nodes.
-
wookistat – gives an overview of current cluster usage.
- cluster_stat – gives an different overview of the cluster usage.
-
Currently, the ‘General’ queue/partition has a maximum 14 day time limit for jobs.
It is important to realize that run times on jobs are enforced. This means that the jobs will be killed by the queuing system once they go past their estimated run time. Job times are being enforced, to allow for more efficient scheduling.
Please try to specify accurate job times as it ensures efficient queuing. Therefore it is a good habit to include job times when submitting jobs because the defaults for a given submission script may be too long or short. Additionally, the queue currently has a 14 day maximum run time, but if you don’t specify a run time the default is about 8 hours.
If you have a job that desperately needs to have its time extended please contact Woo.
Please try to specify accurate memory requirements for you job. Please don’t ask for 4 Gb of memory for your job if it only uses less than 1 Gb. The problem with specifying more memory than you need is that one can use up all the available memory on a node, which means the queue will not assign anymore jobs to the node even though there are many CPU cores available.
You can check the memory usage of a completed job, using the ‘seff’ command: seff <jobid>
Note the job has to be finished for this to report anything other than 0.
Prewritten Submission Scripts:
Submission scripts for some packages have been written similar to those on old Wooki. When you module load a package, like vasp, you can use the corresponding submitscript, like ‘vasp-submit’ or ‘gaussian-submit’. They generally work such that the first argument is the jobname or input name, and the 2nd optional argument is the number of CPUs.
vasp-submit myjob 4 -t 1-5:30 --memory 6G
The above will submit a VASP job called ‘myjob’ with 4 CPUs, request a minimum of 6G memory, and run for 1 day 5 hours and 30 minutes. Many of these scripts will copy your job to the node’s local disk in the directory: /local_scratch/{your username}/{slurm job#}/. When the job is complete, it will copy the contents of the above run directory back into the directory where you submitted the job. Note: If your job is cancelled or terminated by the queue because it ran out of time, the contents of the /local_scratch directory above will not be copied back. However, that directory will still be available on the node it ran on. (One can ssh to the node). The location and node of the job should be written to the beginning of the slurm-{job #}.out file.
Use the ‘-h’ or ‘–help’ option to see all the options available for a particular submit script.
IMPORTANT NOTE: Default time limits for the submit scripts are typically short. So you may have to change these. Please try to give realistic job times to make the queueing more efficient.
Custom Submission Scripts
You can also write your own submission scripts. Below is a sample shell script that can be submitted to the queue with the command ‘sbatch script_name’. If you job uses a lot of IO, think about using the ‘/local_scratch’ of the compute nodes. The queue keywords at the beginning that start with ”#SBATCH“ can also be configured as arguments in the sbatch call. i.e. sbatch –time =500 script.name. See the slurm manual for all options.
#!/bin/bash #SBATCH --partition=General #SBATCH --ntasks=1 #SBATCH --cpus-per-task=3 #SBATCH --job-name=REPEAT #SBATCH --time=120 # The above are keywords that the queueing system reads. # This script will look in ALL directories in the current folder and # run REPEAT in it. # the locpot to cube script runs in python 2 module load python/2.7 REPEAT_EXE=/opt/ohpc/pub/repeat/bin/repeat.x VASP2CUBE=/opt/ohpc/pub/repeat/bin/vasp_to_cube.py REPEAT_INPUT=/opt/ohpc/pub/repeat/bin/REPEAT_param.inp export OMP_NUM_THREADS=3 for file in `ls | shuf` do # check if item is a directory if [ -d $file ] ; then # if the 'queued' file exists then skip if [ -f $file/queued ] ; then echo $file 'skipping because queued file found ' else cd $file touch queued cp $REPEAT_INPUT REPEAT_param.inp $VASP2CUBE #this converts the LOCPOT to cube file # Run the repeat code # important use the srun command when executing parallel tasks. srun $REPEAT_EXE> repeat.output echo "REPEAT completed" #clean up rm queued *.dat mof.cube REPEAT_param.inp cd .. fi fi done echo 'done '
Example for GPU jobs:
#!/bin/bash #SBATCH --gres=gpu:1 # Number of GPUs (per node) #SBATCH --partition=gpu #SBATCH --time=0-03:00 # time (DD-HH:MM) #SBATCH --nodes=1 # Number of nodes to run on #SBATCH --cpus-per-task=1 # Number of cpus