Home » New Wooki Cluster

New Wooki Cluster

About: New Wooki is  the new computer cluster that was built in January 2021 to replace the old Wooki cluster.  The head node is completely new, and most of old Wooki’s compute nodes have been migrated to the new cluster.

Location:  Currently the cluster is available only within the uOttawa network at:  wooki.chem.uolocal. You will need to use a SSH program such as MobaXterm or PuTTY.  For off-campus access you need to use the University’s VPN.

Obtaining Access to the Cluster

If you would like an account on the new cluster, please email twoo@uottawa.ca with the following information:

1) your name
2) what group you are part of (i.e. the Chica lab)
3) If you had an account on old Wooki, your old wooki username
4) what software packages you wish to use.
5) the credit card number you want us to charge

Interacting with the Head Node:

The head node is intended for managing jobs and files, and not for heavy computing since it manages the queue and many other services including the interactive sessions of all the users. It is, however, a very powerful machine, and for tasks that will take less than 15 minutes to complete and one or two CPUs, it is fine to run these interactively.  Anything longer than that is subject to being killed without notice.

Any longer or “heavier” job must be submitted to the compute hosts via the queuing system (see below), which manages the available resources and assigns them to waiting jobs.

DATA STORAGE

New Wooki has two distinct data storage areas home and share_scratch:

    • /share_scratch/username is the user’s working space. This is a fast network storage with no limits on use. You are expected to run the majority of your work from here. This is the most efficient place to work and keeps the cluster and frontend responsive. This space is not backed up.

    • /home/username should be used as permanent storage, less intensive processing of files and transfer to and from the cluster. This space is regularly backed up.  Once your calculations are complete you should move your results to this space taking care to exclude the large intermediate and scratch files. You should use this space for data that you will process interactively on the frontend as it will reduce network traffic.

If you are looking for your files from Old Wooki, they are available on:

/Backup/Old_Wooki

If you want your files from old Wooki’s home on new Wooki’s home, you need to copy them manually.  Please note that many scripts (and your .bashrc) from old Wooki may not work properly on new Wooki.

    • Each compute node also has a local HDD or SDD mounted at: /local_scratch.

 

Software

Centos 8.2 Linux distribution is installed on all nodes, using the openHPC clustering software. The compute nodes are running Centos in diskless mode, such that Linux is loaded from the head node and stored in memory upon boot up.

We are using the linux module environment manager to load software. Use the following command to see what is available:

module avail

For some of the packages, like VASP, scripts to submit to the queueing system have been written, that work very similar to old Wooki. i.e. ‘vasp-submit’.

Python

Python 3.6 is the default version that CentOS uses. It is invoked using ‘python3’, not ‘python’. The Anaconda Python environment have been installed with Python 2.7 and 3.8. These can be invoked with ‘module load python/{version}. i.e. ‘module load python/2.7’.

The Queuing System

We are using the ‘Slurm’ queueing system. This is the same queueing system that is used on the Compute Canada systems, but different from what was used on the old Wooki cluster. Most of the Slurm commands start with ‘s’, like ‘squeue’ instead of ‘qstat’, but some aliases have been created for old Wooki commands, like ‘qstat’. One important difference is that Slurm uses the term ‘partitions’ instead of ‘queues’.

NOTE: It is important to realize that run times on jobs are enforced. This means that the jobs will be killed by the queuing system once they go past their estimated run time. Job times are being enforced, to allow for more efficient scheduling. Please try to specify accurate job times as it ensures efficient queuing. Therefore it is a good habit to include job times when submitting jobs because the defaults for a given submission script may be short.  Additionally, the queue currently has a 14 day maximum run time, but if you don’t specify a run time the default is about 8 hours. If you have a job that desperately needs to have its time extended please contact Woo.

Queues/Partitions available:

    1. General – this is a general CPU  partition.  Most jobs should be submitted to this queue.
    2. gpu – currently one node (gpu_1) has a GPU for machine learning.  Use this queue to access it. (see below for instructions).

Useful commands (see individual help or man pages for more details):

    • squeue– prints jobs in the queue
    • scancel– to kill jobs.
    • qstat– alias for squeue that prints out more useful information that the default squeue command. Use ‘-a’ option to see all jobs.
    • sinfo– prints out info on the queues/partitions.
    • sbatch – submit a script to the queue, similar to qsub.
    • nodeinfo – print out detailed information about the nodes.
    • wookistat – gives an overview of current cluster usage.
    • cluster_stat – gives an different overview of the cluster usage.

Currently, the ‘General’ queue/partition has a maximum 14 day time limit for jobs.

Prewritten Submission Scripts

Submission scripts for some packages have been written similar to those on old Wooki. When you module load a package, like vasp, you can use the corresponding submitscript, like ‘vasp-submit’ or ‘gaussian-submit’. They generally work such that the first argument is the jobname or input name, and the 2nd optional argument is the number of CPUs.

vasp-submit  myjob 4 -t 1-5:30 --memory 6G

The above will submit a VASP job called ‘myjob’ with 4 CPUs, request a minimum of 6G memory, and run for 1 day 5 hours and 30 minutes. Many of these scripts will copy your job to the node’s local disk in the directory: /local_scratch/{your username}/{slurm job#}/. When the job is complete, it will copy the contents of the above run directory back into the directory where you submitted the job. Note: If your job is cancelled or terminated by the queue because it ran out of time, the contents of the /local_scratch directory above will not be copied back. However, that directory will still be available on the node it ran on. (One can ssh to the node). The location and node of the job should be written to the beginning of the slurm-{job #}.out file.

Use the ‘-h’ or ‘–help’ option to see all the options available for a particular submit script.

IMPORTANT NOTE:  Default time limits for the submit scripts are typically short.  So you may have to change these.  Please try to give realistic job times to make the queueing more efficient.

Custom Submission Scripts

You can also write your own submission scripts. Below is a sample shell script that can be submitted to the queue with the command ‘sbatch script_name’. If you job uses a lot of IO, think about using the ‘/local_scratch’ of the compute nodes. The queue keywords at the beginning that start with ”#SBATCH“ can also be configured as arguments in the sbatch call. i.e. sbatch –time =500 script.name. See the slurm manual for all options.

#!/bin/bash
#SBATCH --partition=General
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --job-name=REPEAT
#SBATCH --time=120

# The above are keywords that the queueing system reads.

# This script will look in ALL directories in the current folder and
# run REPEAT in it.

# the locpot to cube script runs in python 2
module load python/2.7

REPEAT_EXE=/opt/ohpc/pub/repeat/bin/repeat.x
VASP2CUBE=/opt/ohpc/pub/repeat/bin/vasp_to_cube.py
REPEAT_INPUT=/opt/ohpc/pub/repeat/bin/REPEAT_param.inp
export OMP_NUM_THREADS=3

for file in `ls | shuf`
do
    # check if item is a directory
    if [ -d $file ] ; then

        # if the 'queued' file exists then skip
        if [ -f $file/queued ] ; then
            echo $file 'skipping because queued file found '
        else
            cd $file

            touch queued
            cp $REPEAT_INPUT REPEAT_param.inp
            $VASP2CUBE   #this converts the LOCPOT to cube file

            # Run the repeat code
            # important use the srun command when executing parallel tasks.
            srun $REPEAT_EXE> repeat.output
            echo "REPEAT completed"

            #clean up
            rm queued *.dat mof.cube REPEAT_param.inp

            cd ..
        fi
    fi
done

echo 'done '

Example for GPU jobs:

#!/bin/bash
#SBATCH --gres=gpu:1          # Number of GPUs (per node)
#SBATCH --partition=gpu
#SBATCH --time=0-03:00        # time (DD-HH:MM)
#SBATCH --nodes=1             # Number of nodes to run on
#SBATCH --cpus-per-task=1     # Number of cpus

Department of Chemistry and Biomolecular Science
University of Ottawa

Email: twoo@uottawa.ca
Fax: (613) 562-5170
Office: 545 King Edward
Room 26
Lab: 545 King Edward
2nd floor