Look for guide with some screenshots above

Create an account for ACCESS
- This account is used for both PSC and Delta
- When you create, Register with an existing identity. Don’t do Register without an existing identity.
Send the username to allocation managers (e.g. William, wc4@andrew.cmu.edu) to add the user in our group.
- After this step is done, you should be able to see list of resources at https://allocations.access-ci.org
- To see the list of resources, log-in with identity provider Carnegie Mellon University
Initialise your PSC password (used for ssh login)
- Go to https://www.psc.edu/resources/bridges-2/user-guide-2-2/ and click PSC Password Change Utility
- It may take few hours for your username and email to be recognised, even if they’re correct.
Access via ssh
- ssh [username]@[resource_dir]
  - E.g., jjung1@bridges2.psc.edu
- Use the password you initialised in step 3.

Important

Home directory is of limited space. Please do most of your work in ocean storage ($ cd ${PROJECT})
When you publish a paper, please acknowledge the PSC and ACCESS. We will get benefit when we apply for PSC credits next time.
- Acknowledgement webpage
- Example: Experiments of this work used the Bridges2 system at PSC through allocations CIS210014 and IRI120008P from the Advanced Cyberinfrastructure Coordination Ecosystem: Services \& Support (ACCESS) program, supported by National Science Foundation grants #2138259,#tel:2138286, #tel:2138307, #tel:2137603, and #tel:2138296.
Add the following references
``` @ARTICLE{xsede, author = {J. Towns and T. Cockerill and M. Dahan and I. Foster and K. Gaither and A. Grimshaw and V. Hazlewood and S. Lathrop and D. Lifka and G. D. Peterson and R. Roskies and J. R. Scott and N. Wilkins-Diehr}, journal = {Computing in Science \& Engineering}, title = {XSEDE: Accelerating Scientific Discovery}, year = {2014}, volume = {16}, number = {5}, pages = {62-74}, keywords={Knowledge discovery;Scientific computing;Digital systems;Materials engineering;Supercomputers}, doi = {10.1109/MCSE.2014.80}, url = {doi.ieeecomputersociety.org/10.1109/MCSE.2014.80}, ISSN = {1521-9615}, month={Sept.-Oct.} } @inproceedings{nystrom2015bridges, title={Bridges: a uniquely flexible HPC resource for new communities and data analytics}, author={Nystrom, Nicholas A and Levine, Michael J and Roskies, Ralph Z and Scott, J Ray}, booktitle={Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure}, pages={1--8}, year={2015} } ```

Summary of PSC usage and the partitions

Both PSC have limited service units (SUs) for resource availability.
sinfo lists all the available partitions in PSC and their status.

	PSC(Bridges-2)
Partitions	GPU, GPU-shared, RM-small, RM, RM-512, RM-shared
GPU resources	GPU, GPU-shared (Recommended)
CPU resources	RM-small, RM, RM-512, RM-shared (Recommended)
Default	`RM`, which request to allocate a 128-cpu node. (Be careful of this case)

Misc. resources

	PSC(Bridges-2)
User Guide	https://www.psc.edu/resources/bridges-2/user-guide-2/
Connect from browser	https://ondemand.bridges2.psc.edu/
ESPnet installation guide	https://espnet.github.io/espnet/installation.html
Step-by-step guide with pictures	https://granite-echidna-ff2.notion.site/Access-for-PSC-07c3d4c05b54426895e3ddc87276e4b5

GPU Partitions

In GPU / GPU-shared partitions, each node consists of 8 v100 GPU devices
There are two types of GPU nodes: v100-16 and v100-32 having GPU units with 16GB and 32GB memory respectively.
Submit jobs to GPU-shared partition. (Recommended)
- Using -p GPU-shared --gpus=type:n in sbatch or srun. Here type can be v100-16 or v100-32 and n can range from 1 to 4 (1 is recommended).
Submit jobs to GPU partition.
- Please use it only when necessary.
- Using -p GPU in sbatch or srun. It request to allocate a whole GPU node, 8 GPUs, for each job.
- In this case, it deducts 8 SUs from our team’s GPU allocation every hour the job runs.
Good Practice: Usually, users are strongly recommended to use GPU-shared partition and allocate 1 GPU only for each job.
- When allocating multiple GPUs, you usually need to wait much longer time.
- Single-GPU jobs are more efficient than multi-GPU ones: the latter will have communication overhead.
- Multi-GPU jobs will be allowed only when (1) the users have been familiar with the cluster and (2) the project really needs that resource. Beginners are strongly discouraged from trying this option.
- Please avoid using GPU partition by mistake: it makes the allocated GPUs idle and causes much waste to our resources.
- We will periodically check each user’s usage and send a reminder when necessary.

RM Partitions (For CPU Jobs)

In the RM and RM-512 partitions, each node consists of 128 cores.
Nodes in RM, RM-shared partitions have a memory of 128GB, while nodes in RM-512 partitions have 512GB memory.
using an entire node for an hour will deduct 128 SUs from our team’s Regular Memory allocation.
Submit jobs to RM-shared partition. (suggested)
- Using -p RM-shared --ntasks-per-node=n --mem=2000M in sbatch or srun. Here n can range from 1 to 64.
- Usually, jobs only require a few cpu cores.
Submit jobs to RM partition.
- Please use it only when necessary.
- It request to allocate all of the 128 cpu cores.
- Using -p RM --ntasks-per-node=n in sbatch or srun. Here n can range from 1 to 64.
Good Practice: Usually, users are strongly recommended to use RM-shared partition.
- You can adjust the memory allocation of each CPU job, but please bear in mind that CPU is also charged by every 2000M memory allocation. E.g., a CPU core with 4000M memory running 1 hour will also be charged 2SUs.
- Usage of RM or RM-512 partitions will be allowed only when (1) the users have been familiar with the cluster and (2) the project really needs that resource. Beginners are strongly discouraged from trying this option.
- Please avoid using RM or RM-512 partitions by mistake: it makes the allocated CPUs idle and causes much waste to our resources.
- We will periodically check each user’s usage and send a reminder when necessary.

Other usage

Data copying / file transfer
- Suggest to use data.bridges2.psc.edu as the machine name. Doing file transfer with rsync, sftp, scp, etc. Transferring Files. Following is an example of scp.
```
  # This requires the enrollment of Two-Factor Authentication (TFA)
  scp -P 2222 myfile XSEDE-username@data.bridges2.psc.edu:/path/to/file
```
Submitting jobs with dependency
- This can be used to submit a job which is expected to start run after some specific jobs finish. In many cases, training a model can take a few days. However, PSC has the restriction that each job can run for 2 days at most. In this case, we can start a job with dependency for long jobs. For example, you already start a job ID is 000001 and you want a following job right after it. You can submit jobs like:
```
  sbatch --time 2-0:00:00 --dependency=afterany:000001 run.sh
```

Common arguments in sbatch / srun

-p, --partition=partition               partition requested
-J, --job-name=jobname                  name of job
-t, --time=time                         time limit
--gres=rsrc_name[:rsrc_type]:rsrc_num   required generic resources
-c, --cpus-per-task=ncpus               number of cpus required per task
-d, --dependency=type:jobid[:time]      defer job until condition on jobid is satisfied
-e, --error=err                         file for batch script's standard error
-o, --output=out                        file for batch script's standard output
--mem-per-cpu=MB                        maximum amount of real memory per allocated
--ntasks-per-node=n                     number of tasks to invoke on each node
--reservation=name                      allocate resources from named reservation, e.g. `GPUcis210027`

View tools

slurm commands

# view jobs in the queue
squeue -u ${username}
     
# view detailed job info
scontrol show jobid -d ${jobid}
     
# view job history and billing info, e.g. since time 04/22/2022 12am.
sacct -u ${username} -S 2022-04-22T00:00:00 --format=JobID,jobname,user,elapsed,nnodes,alloccpus,state,partition,nodelist,AllocTRES%50,CPUTime 

PSC provides slurm-tool

# Show or watch job queue:
slurm-tool [watch] queue     show own jobs
slurm-tool [watch] q <user>  show user's jobs
slurm-tool [watch] quick     show quick overview of own jobs
slurm-tool [watch] shorter   sort and compact entire queue by job size
slurm-tool [watch] short     sort and compact entire queue by priority
slurm-tool [watch] full      show everything
slurm-tool [w] [q|qq|ss|s|f] shorthands for above!

slurm-tool qos               show job service classes
slurm-tool top [queue|all]   show summary of active users

# Show detailed information about jobs:
slurm-tool prio [all|short]  show priority components
slurm-tool j|job <jobid>     show everything else
slurm-tool steps <jobid>     show memory usage of running srun job steps

# Show usage and fair-share values from accounting database:
slurm-tool h|history <time>  show jobs finished since, e.g. "1day" (default)
slurm-tool shares

# Show nodes and resources in the cluster:
slurm-tool p|partitions      all partitions
slurm-tool n|nodes           all cluster nodes
slurm-tool c|cpus            total cpu cores in use
slurm-tool cpus <partition>  cores available to partition, allocated and free
slurm-tool cpus jobs         cores/memory reserved by running jobs
slurm-tool cpus queue        cores/memory required by pending jobs
slurm-tool features          List features and GRES

# Other:
slurm-tool v|version         Print versions of slurm-tool tool and slurm itself

Simple helper script for minimum usage

# ~/.sbatch_helper.py
# You can easily use it by adding the below line in .bashrc"
# alias my_sbatch="python3 ~/.sbatch_helper.py"

machine = int(input("CPU (0) GPU-16GB (1) GPU-32GB (2): "))
if machine == 0:
    cpus = int(input("# of CPUs (Max 128): "))
    cpus = f"--ntasks-per-node={cpus}"
    gpus = ""
else:
    gpus = int(input("# of GPUs (Max 8): "))
    gpus = "-N 1 --gpus=v100-" + ["", "16", "32"][machine] + f":{gpus}"
    cpus = "--cpus-per-gpu 5"

hours = int(input("# of hours (Max 48): "))

machine_name = ["RM", "GPU", "GPU"][machine]
print(f"sbatch -p {machine_name}-shared {gpus} {cpus} -t {hours}:00:00")

ESPnet installation steps

Miniconda installation

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Load modules and set environment (deprecated. You don’t need to do this anymore)

Kaldi download/installation:

# If you only use Espnet2, simply cloning kaldi is sufficient:
cd /ocean/projects/cis210027p/<user>
git clone https://github.com/kaldi-asr/kaldi

# If you use Espnet version 1 (not very common), you need to compile kaldi
cd /ocean/projects/cis210027p/<user>
git clone https://github.com/kaldi-asr/kaldi
cd kaldi/tools
make -j 8
./extras/install_irstlm.sh
cd ../src/
./configure --use-cuda=no
make -j clean depend; make -j 8

ESPNet installation:

cd /ocean/projects/cis210027p/<user>
git clone https://github.com/espnet/espnet
cd espnet/tools/
ln -s /ocean/projects/cis210027p/<user>/kaldi .
. ./setup_anaconda.sh <MINICONDA_ROOT> espnet 3.9
# The CUDA and Torch versions are recommended, but you can change them based on your needs.
make -j 8 CUDA_VERSION=11.7 TH_VERSION=1.13.1

Verify torch installation:

srun --pty -p GPU-shared -N 1 --gpus=v100-16:1 /bin/bash -l
nvidia-smi
cd /ocean/projects/cis210027p/<user>/espnet/tools
. ./activate_python.sh
python -c "import torch; print(torch.cuda.is_available())"
exit

ESPnet usage tutorial

Full documentation: https://espnet.github.io/espnet/tutorial.html

Running an example `an4` recipe with `run.sh` using slurm backend

Go the the directory: cd egs2/an4/asr1

Modify cmd.sh and conf/slurm.conf to be ready for slurm management

# cmd.sh

# line 31
cmd_backend='slurm'  # cmd_backend='local'

# conf/slurm.conf

# Default configuration                                                                                                                                                        
command sbatch --export=PATH
option name=* --job-name $0
default time=48:00:00
option time=* --time $0
option mem=* --mem-per-cpu $0
option mem=0
option num_threads=* --cpus-per-task $0
option num_threads=1 --cpus-per-task 1
option num_nodes=* --nodes $0
default gpu=0
option gpu=0 -p RM-shared
option gpu=* -p GPU-shared --gres=gpu:$0 -c $0  # Recommend allocating more CPU than, or equal to the number of GPU
# note: the --max-jobs-run option is supported as a special case
# by slurm.pl and you don't have to handle it in the config file.

Run experiments
```
./run.sh --stage 1 --stop-stage 13
```

Running an example `an4` recipe step by step:

Execute the data preparation stages: -1 and 0
- Stage -1 downloads and un-tars the an4 dataset with 948 training and 130 test utterances.
- Stage 0 prepares the dataset by creating data/train and data/test directories. Names of these directories can vary for other datasets. Each of these directories contains 4 files: wav.scp, text, utt2spk and spk2utt. A mapping from a unique utterance-ID to the utterance’s filepath, text and speaker-ID is noted in wav.scp , text, utt2spk files respectively. An inverse mapping from the speaker-ID to the speaker’s utterance-IDs is noted in spk2utt file.
```
cd /ocean/projects/cis210027p/<user>/espnet
cd egs/an4/asr1
./run.sh --stage -1 --stop_stage 0
```
Execute the feature extraction stage: 1
- Stage 1 extracts the 80 log-mel and 3 pitch features from a 25ms frame shifted every 10ms for each audio sample.
- Parameter nj in this stage’s code represents the number of CPUs used to parallelly extract the features.
- Since this dataset doesn’t provide a validation split, we split the 948 training utterances into 848 training and 100 dev samples.
- A mapping from utterance-IDs to feature filepaths is noted in data/*/feats.scp file and extracted features are stored in the dump/*/deltafalse/feats.{1-$nj}.ark files.
```
./run.sh --stage 1 --stop_stage 1
```
Prepare a dictionary and data.json in stage: 2
- A mapping of each character, special tokens (ex: , etc.) to a unique token-ID is stored as a dictionary at `data/lang_1char/train_nodev_units.txt`
- Pairs of extracted features and mapped tokens (using above dictionary) for each utterance-ID is stored as a JSON file in the respective dump directories.
```
./run.sh --stage 2 --stop_stage 2
```
Train the RNN-LM and ASR models in stages: 3 and 4
- Trained LM models are stored by default in exp/train_rnnlm_pytorch_lm_word100/ directory
- Trained ASR models are stored by default in exp/train_nodev_pytorch_train_mtlalpha1.0/results/ directory
- Request GPU resources for training:
```
sbatch -t 2-00:00:00 -p GPU-shared -N 1 --gpus=v100-16:1 --mem=16G train_model.sh
```
- Preview of train_model.sh file to train the RNN based language model (RNN-LM) in stage: 3
```
#!/usr/bin/env bash
cd /ocean/projects/cis210027p/<user>/espnet/egs/an4/asr1
./run.sh --stage 3 --stop_stage 3
```
- Preview of train_model.sh file to train the ASR model in stage: 4
```
#!/usr/bin/env bash
cd /ocean/projects/cis210027p/<user>/espnet/egs/an4/asr1
./run.sh --stage 4 --stop_stage 4
```
Request computational resources and perform decoding in stage: 5
- Parameter nj in this stage’s code represents the number of CPUs used to parallelly decode each recognition set: validation, test splits. So we must request 2xnj number of CPUs in total.
- Note that for each CPU, we can request a maximum memory of 2000M.
- Request RM-shared node resources for decoding:
```
sbatch -t 12:00:00 -p RM-shared -N 1 --cpus-per-task=16 --mem=32000M decode_model.sh
```
- Preview of decode_model.sh file to decode the ASR model in stage: 5
```
#!/usr/bin/env bash
cd /ocean/projects/cis210027p/<user>/espnet/egs/an4/asr1
./run.sh --stage 5 --stop_stage 5
```
Misc.
- An interrupted training can be resumed by specifying --resume parameter with the path to most recent snapshot in exp/train_nodev_pytorch_train_mtlalpha1.0/results/ directory.
- A unique tag for managing your experiments can be set by specifying --lmtag and --tag parameters.
- Multi-GPU training can be done by specifying --ngpu in the training stages.
- For interactive debugging purposes, srun command can be used to request GPU-shared or RM-shared nodes as below. Note that PSC has limited service units (SUs), so use srun based debugging for only the required duration.
```
srun --pty -p GPU-shared -N 1 --gpus=v100-16:1 /bin/bash -l
srun --pty -p RM-shared -N 1 --cpus-per-task=16 --mem=32000M /bin/bash -l
```

Step by step starting towards successful ssh login