--- title : "d. Run PyTorch Data Parallel training on ParallelCluster" date: 2020-09-04T15:58:58Z weight : 30 tags : ["training", "data parallel", "ML", "sbatch", "slurm", "multi node", "multi gpu"] --- In this step you will use the **PyTorch** _DistributedDataParallel API_ to train a _Natural Language Understanding_ model using the **Fairseq** framework. You will create a **SLURM** batch script to run the data parallel job across multiple GPU nodes and configure the **PyTorch** API to distribute tasks between the GPUs in each node. #### Create distributed training job scripts Create a `fairseq-train` script in the _/lustre_ shared folder with the following commands: ```bash cd /lustre cat > train.sh << EOF #!/bin/bash # set up the Data and checkpoint locations DATABIN=/lustre/data/wikitext-103 OUTDIR=/lustre/data/out && mkdir -p \$OUTDIR SAVEDIR=/lustre/checkpoints # set up environment variables for Torch DistributedDataParallel WORLD_SIZE_JOB=\$SLURM_NTASKS RANK_NODE=\$SLURM_NODEID PROC_PER_NODE=8 MASTER_ADDR_JOB=(\`scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1\`) MASTER_PORT_JOB="12234" DDP_BACKEND=c10d # setup NCCL to use EFA export FI_PROVIDER=efa export FI_EFA_TX_MIN_CREDITS=64 export NCCL_DEBUG=INFO export NCCL_TREE_THRESHOLD=0 export NCCL_SOCKET_IFNAME=eth0 export LD_LIBRARY_PATH=$HOME/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:\$LD_LIBRARY_PATH # set up fairseq-train additional arguments BUCKET_CAP_MB=200 TOTAL_UPDATE=500 MAX_SENTENCES=8 UPDATE_FREQ=1 # calling fairseq-train torchrun \ --nproc_per_node=\$PROC_PER_NODE \ --nnodes=\$WORLD_SIZE_JOB \ --node_rank=\$RANK_NODE \ --master_addr=\${MASTER_ADDR_JOB} \ --master_port=\${MASTER_PORT_JOB} \ \$(which fairseq-train) \ \$DATABIN \ --log-format json --log-interval 25 \ --seed 1 \ --fp16 --memory-efficient-fp16 \ --criterion masked_lm \ --optimizer adam \ --lr-scheduler polynomial_decay \ --task masked_lm \ --num-workers 2 \ --max-sentences \$MAX_SENTENCES \ --ddp-backend \$DDP_BACKEND \ --bucket-cap-mb \$BUCKET_CAP_MB \ --fast-stat-sync \ --arch roberta_large \ --max-epoch 2 \ --max-update \$TOTAL_UPDATE \ --clip-norm 1.0 \ --update-freq \$UPDATE_FREQ \ --lr 0.0006 \ --save-dir \$SAVEDIR \ --sample-break-mode complete \ --tokens-per-sample 512 \ --adam-betas '(0.9, 0.98)' --adam-eps 1e-06 \ --warmup-updates 24000 \ --total-num-update \$TOTAL_UPDATE \ --dropout 0.1 \ --attention-dropout 0.1 \ --weight-decay 0.01 | tee \$OUTDIR/train.\$RANK.\$WORLD_SIZE.log EOF chmod +x train.sh ``` The script starts by setting up paths for the training data, output and checkpointing. All paths reference _/lustre_ and are visible by all compute nodes. Next, the script sets environment variables for the **PyTorch** _DistributedDataParalle API_ based on the **SLURM** environment. It also sets the required environment variables for [NCCL](https://developer.nvidia.com/nccl) to work with [AWS EFA](https://aws.amazon.com/hpc/efa/). The last section contains the command `torchrun`, which launches the training, based on the environment variables previously set. For more information on the _DistributedDataParalle API_, refer to the [documentation here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). Create the **SLURM** batch script with the following commands: ```bash cd /lustre cat > job.slurm << EOF #!/bin/bash #SBATCH --wait-all-nodes=1 #SBATCH --gres=gpu:8 #SBATCH --nodes=2 #SBATCH --cpus-per-task=5 #SBATCH --ntasks-per-node=1 #SBATCH --exclusive #SBATCH -o out_%j.out srun /lustre/train.sh EOF ``` This script requests 2 nodes with the argument `--nodes=2`, same as the maximum number of nodes defined on the cluster configuration file. The argument `--gres=gpu:8` requests resources with 8 GPUs, the amount available in each compute instance. With the _train.sh_ and _job.slurm_ scripts, you are ready to execute the multi-node, multi-GPU distributed training job. #### Run the distributed data parallel training job Use the `sbatch job.slurm` command to launch replicas of the _train.sh_ script across the different nodes: ```bash cd /lustre sbatch job.slurm ``` The above commands add a **SLURM** job to the queue and logs its output to the _out\_.out_ file. Check the status of the job with `squeue -ls` and `sinfo -ls`. If there are less than the requested 2 nodes, ParallelCluster will spin up new instances. This process takes a few minutes. Once the job starts, you can check the progress by tailing the output file with the following command: `tail -f out_.out`. As training starts, you will see an output such as the following: ![Training Output](/images/ml/training.png) The model is training at approximately 66,000 words per second (_wps_) on batch size 128, with around 55,000 words per batch (_wpb_). In this configuration training one epoch takes approximately 9 minutes to complete. At the end of the epoch a model checkpoint file will be created in _/lustre/checkpoints_ and can be used to continue training and further fine tuning. **Fairseq** saves the model state at certain points during training in the _/checkpoints_ directory. If you wish to persist the model checkpoints for later use, copy them to the S3 bucket with the following command - make sure to replace `` with your bucket's id: ```bash cd /lustre aws s3 cp checkpoints s3://mlbucket-/checkpoints --recursive ``` Next, cleanup!