New Wooki Cluster

About: The new Wooki cluster is  the new computer cluster that was built in January 2021 to replace the old Wooki cluster.  The head node is completely new, and most of old Wooki’s compute nodes have been migrated to the new cluster.

Digital Location:  Currently the cluster is available only within the uOttawa network at:  chewie.woolab.uolocal. You will need to use a SSH program such as MobaXterm or PuTTY.

Obtaining Access to the Cluster

If you would like an account on the new cluster, please email twoo@uottawa.ca with the following information:
1) your name
2) what group you are part of (i.e. the Chica lab)
3) If you had an account on old Wooki, your old wooki username
4) what software packages you wish to use.
5) the credit card number you want us to charge

Hardware

Head node:
32 core AMD EPYC CPU 2.2 GHz base clock speed. 128 GB memory
HPE Proliant DL385 with 24 2.5“ SDD drive bays
base file system: 500 GB NVME m.2 sdd
home file system: 4 TB mirrored SDD (running ZFS)

Compute nodes:
nodese1 to e8 (e for Epyc)
24 core AMD EPYC CPU 2.8 GHz base clock speed. 64-128 GB memory
HPE Proliant DL325
/local_scratch file system: nveme 250 GB

Network:
1 Gb ethernet switch to all nodes.
10 Gb ethernet link to head node

Interacting with the Head Node:

The head node is intended for managing jobs and files, and not for heavy computing since it manages the queue and many other services including the interactive sessions of all the users. It is, however, a very powerful machine, and for tasks that will take less than 15 minutes to complete and one or two CPUs, it is fine to run these interactively.  Anything longer than that is subject to being killed without notice.

Any longer or “heavier” job must be submitted to the compute hosts via the queuing system (see below), which manages the available resources and assigns them to waiting jobs.

DATA STORAGE

New Wooki has two distinct data storage areas home and share_scratch:

    • /share_scratch/username is the user’s working space. This is a fast network storage with no limits on use. You are expected to run the majority of your work from here. This is the most efficient place to work and keeps the cluster and frontend responsive. This space is not backed up.

    • /home/username should be used as permanent storage, less intensive processing of files and transfer to and from the cluster. This space is rigorously backed up.  Once your calculations are complete you should move your results to this space taking care to exclude the large intermediate and scratch files. You should use this space for data that you will process interactively on the frontend as it will reduce network traffic.

If you are looking for your files from Old Wooki, they are available on:
/share_scratch/Old_wooki_home/username
/share_scratch/Old_wooki_scratch/username
If you want your files from old Wooki’s home on new Wooki’s home, you need to copy them manually.  Please note that many scripts (and your .bashrc) from old Wooki may not work properly on new Wooki.

    • Each compute node also has a local HDD or SDD mounted at: /local_scratch.

Software

Centos 8.2 Linux distribution is installed on all nodes, using the openHPC clustering software. The compute nodes are running Centos in diskless mode, such that Linux is loaded from the head node and stored in memory upon boot up.

We are using the linux module environment manager to load software. Use the following command to see what is available:

module avail

For some of the packages, like VASP, scripts to submit to the queueing system have been written, that work very similar to old Wooki. i.e. ‘vasp-submit’.

Python

Python 3.6 is the default version that CentOS uses. It is invoked using ‘python3’, not ‘python’. The Anaconda Python environment have been installed with Python 2.7 and 3.8. These can be invoked with ‘module load python/{version}. i.e. ‘module load python/2.7’.

The Queuing System

We are using the ‘Slurm’ queueing system. This is the same queueing system that is used on the Compute Canada systems, but different from what was used on the old Wooki cluster. Most of the Slurm commands start with ‘s’, like ‘squeue’ instead of ‘qstat’, but some aliases have been created for familiar old Wooki commands, like ‘qstat’. One important difference is that Slurm uses the term ‘partitions’ instead of queues.

NOTE: It is important to realize that run times on jobs are enforced. This means that the jobs will be killed by the queuing system once they go past their estimated run time. Job times are being enforced, to allow for more efficient scheduling. Please try to specify accurate job times as it ensures efficient queuing. So it is a good habit to include job times when submitting jobs because the defaults for a given submission script may be short.  Additionally, the queue currently has a 7 day maximum run time for any job. If you have a job that desperately needs to have its time extended please contact Woo.

Useful commands (see individual help or man pages for more details):

    • squeue– prints jobs in the queue
    • scancel– to kill jobs.
    • qstat– alias for squeue that prints out more useful information that the default squeue command. Use ‘-a’ option to see all jobs.
    • sinfo– prints out info on the queues/partitions.
    • sbatch – submit a script to the queue, similar to qsub.
    • nodeinfo – print out detailed information about the nodes.
    • wookistat – gives an overview of current cluster usage.

Currently, the ‘General’ queue has a maximum 7 day time limit for jobs.

Prewritten Submission Scripts

Submission scripts for some packages have been written similar to those on old Wooki. When you module load a package, like vasp, you can use the corresponding submitscript, like ‘vasp-submit’ or ‘gaussian-submit’. They generally work such that the first argument is the jobname or input name, and the 2nd optional argument is the number of CPUs.

vasp-submit  myjob 4 -t 1-5:30 --memory 6G

The above will submit a VASP job called ‘myjob’ with 4 CPUs, request a minimum of 6G memory, and run for 1 day 5 hours and 30 minutes. Many of these scripts will copy your job to the node’s local disk in the directory: /local_scratch/{your username}/{slurm job#}/. When the job is complete, it will copy the contents of the above run directory back into the directory where you submitted the job. Note: If your job is cancelled or terminated by the queue because it ran out of time, the contents of the /local_scratch directory above will not be copied back. However, that directory will still be available on the node it ran on. (One can ssh to the node). The location and node of the job should be written to the beginning of the slurm-{job #}.out file.

Use the ‘-h’ or ‘–help’ option to see all the options available for a particular submit script.

Custom Submission Scripts

You can also write your own submission scripts. Below is a sample shell script that can be submitted to the queue with the command ‘sbatch script_name’. If you job uses a lot of IO, think about using the ‘/local_scratch’ of the compute nodes. The queue keywords at the beginning that start with ”#SBATCH“ can also be configured as arguments in the sbatch call. i.e. sbatch –time =500 script.name. See the slurm manual for all options.

#!/bin/bash
#SBATCH --partition=General
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --job-name=REPEAT
#SBATCH --time=120

# The above are keywords that the queueing system reads.

# This script will look in ALL directories in the current folder and
# run REPEAT in it.

# the locpot to cube script runs in python 2
module load python/2.7

REPEAT_EXE=/opt/ohpc/pub/repeat/bin/repeat.x
VASP2CUBE=/opt/ohpc/pub/repeat/bin/vasp_to_cube.py
REPEAT_INPUT=/opt/ohpc/pub/repeat/bin/REPEAT_param.inp
export OMP_NUM_THREADS=3

for file in `ls | shuf`
do
    # check if item is a directory
    if [ -d $file ] ; then

        # if the 'queued' file exists then skip
        if [ -f $file/queued ] ; then
            echo $file 'skipping because queued file found '
        else
            cd $file

            touch queued
            cp $REPEAT_INPUT REPEAT_param.inp
            $VASP2CUBE   #this converts the LOCPOT to cube file

            # Run the repeat code
            # important use the srun command when executing parallel tasks.
            srun $REPEAT_EXE> repeat.output
            echo "REPEAT completed"

            #clean up
            rm queued *.dat mof.cube REPEAT_param.inp

            cd ..
        fi
    fi
done

echo 'done '