# Megatron on AWS UltraCluster

This tutorial walks through the end-to-end process of configuring a cluster with AWS ParallelCluster, with a customized Deep Learning AMI, preprocessing a large dataset using large CPU Amazon EC2 instances and training a GPT-2 Natural Language Understanding model using an AWS EC2 UltraCluster.

Familiarity with AWS cloud concepts of Virtual Private Cloud (VPC), e.g Subnets and Availability Zones, the AWS CLI and bash scripting is recommended.

## Contents:

* [Contents](<#contents>)
* [ParallelCluster Management Setup](<#parallelcluster-management-setup>)
  * [Local Environment](<#local-environement>)
  * [AWS](<#aws>)
* [Building an Custom AMI](<#building-an-custom-ami>)
* [Configure and Deploy a Cluster](<#configure-and-deploy-a-cluster>)
* [Preprocessing the Training Dataset wth CPU Instances](<#preprocessing-the-training-dataset-wth-cpu-instances>)
* [Model Parallel Trainin on a p4d.24xlarge UltraCluster](<#model-parallel-trainin-on-a-p4d24xlarge-ultracluster>)
  * [Monitoring Training with Tensorboard](<#monitoring-training-with-tensorboard>)

## Prerequisites

If you don't have Python3 pip already installed, follow the instructions on the [pip installation](<https://pip.pypa.io/en/latest/installing/>) page before running the commands below.

## ParallelCluster Management Setup

#### Local Environment

To deploy a cluster with AWS ParallelCluster you'll need to install the `aws-parallelcluster` cli.
From the root of this project execute the following to create and activate a virtual environment with Parallelcluster installed:

```bash
python3 -m venv .megatron_env
source .megatron_env/bin/activate
python3 -m pip install awscli aws-parallelcluster==2.10.1
pcluster version

# Set AWS Region
export AWS_REGION="us-west-2"
```

To execute this sample repo, you will need credentials to access the AWS CLI. Refer to the [getting Started documentation](<https://docs.aws.amazon.com/parallelcluster/latest/ug/install.html>) for more information on setting up ParallelCluster CLI.

If you don't have Hashicorp Packer already installed, follow the instructions on the [Hashicorp Packer getting started](<https://learn.hashicorp.com/tutorials/packer/getting-started-install?in=packer/getting-started>) page before running the commands below.
It is required to build custom AMIs.

#### AWS

This sample repo assumes an existing VPC with private and public subnets. For details on how to provision such infrastructure check out the [this tutorial](<https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-public-private-vpc.html>).
Private subnets are a requirement for running p4d.24xlarge instances with 4 EFA cards.
Please note that the private subnet should have a NAT Gateway setup since AWS ParallelCluster and the Megatron ML lab require internet connection.

Use the following commands to list VPCs, Subnets and Availability Zones:

```bash
aws ec2 describe-subnets --query 'Subnets[].{VPC:VpcId,SUBNET:SubnetId,AZ:AvailabilityZone}' --region ${AWS_REGION}
```

Take note of the Ids to properly configure the cluster environment and set the following environment variables:

```bash
VPC_ID=<value>
PUBLIC_SUBNET_ID=<value>
PRIVATE_SUBNET_ID=<value>
```

This sample also requires an S3 bucket and a EC2 key pair. You can use the following AWS CLI commands to create new ones:

```bash
# Create a EC2 key pair
SSH_KEY_NAME="megatron-lab-key"

aws ec2 create-key-pair --key-name ${SSH_KEY_NAME} \
    --query KeyMaterial \
    --region ${AWS_REGION} \
    --output text > ~/.ssh/${SSH_KEY_NAME}

BUCKET_POSTFIX=$(python3 -S -c "import uuid; print(str(uuid.uuid4().hex)[:10])")aws
BUCKET_NAME="megatron-lab-${BUCKET_POSTFIX}"

aws s3 mb s3://${BUCKET_NAME} --region ${AWS_REGION}
```

## Building an Custom AMI

[Build a custom AMI](<https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_02_ami_customization.html>) to avoid long provisioning times associated with using [post installation scripts](<https://docs.aws.amazon.com/parallelcluster/latest/ug/cluster-definition.html#post-install>) for Megatron-LM dependencies.

The base AMI for customization is an AWS Deep Learning AMI (DLAMI).
It already provides the amazon required software to run distributed training of large machine learning models, including NVIDIA drivers and CUDA, EFA plugins and the major deep learning frameworks such as PyTorch and Tensorflow, managed in Conda environments.
The Conda package manager can also manage the Megatron-LM dependencies.

To retrieve the AMI ID of the Deep Learning image v38.0 based on Amazon Linux 2 in the region of deployment, you can use the following command

```bash
# Retrieve Deep Learning AMI ID
export DEEP_LEARNING_AMI_ID=`aws ec2 describe-images --owners amazon \
    --query 'Images[*].{ImageId:ImageId,CreationDate:CreationDate}' \
    --filters "Name=name,Values='Deep Learning AMI (Amazon Linux 2) Version 38.0'" \
    --region ${AWS_REGION} \
    | jq -r 'sort_by(.CreationDate)[-1] | .ImageId'`
```

Before building the customer AMI, you will have to modify the argument values between `<...>` accordingly of the base configuration file for AWS ParallelCluster, located in `./configs/base-config-build-ami.ini`.

The command calls for the script [custom\_dlami\_user\_data.sh](<./scripts/custom_dlami_user_data.sh>), which installs Megatron-LM and its dependencies, including NVIDIA APEX.
The instance used for the build is `-i p4d.24xlarge`, as NVIDIA APEX will be compiled to the host's platform during installation.

The instructions below help to set the variable in the AWS ParallelCluster configuration file that is used to build the customer AMI, i.e. `./configs/base-config-build-ami.ini`.

```bash
git clone https://github.com/pixelb/crudini

# Install dependencies
pip3 install iniparse

# Change the cluster configuration file
python3 crudini/crudini --set ./configs/base-config-build-ami.ini "aws" aws_region_name "${AWS_REGION}"
python3 crudini/crudini --set ./configs/base-config-build-ami.ini "vpc megatron" vpc_id "${VPC_ID}"
python3 crudini/crudini --set ./configs/base-config-build-ami.ini "vpc megatron" master_subnet_id "${PUBLIC_SUBNET_ID}"
python3 crudini/crudini --set ./configs/base-config-build-ami.ini "cluster base-config-build-ami" key_name "${SSH_KEY_NAME}"
```

To build the custom AMI from the root of the sample project, use the `pcluster createami` command as shown in the script [create\_custom\_ami.sh](<./scripts/create_custom_ami.sh>).

```bash
./scripts/create_custom_ami.sh
```

After the build is complete you get the AMI Id printed on screen:

```bash
Custom AMI ami-xxxxxxxxxxxxx created with name megatron-on-pcluster-aws-parallelcluster-2.10.1-amzn2-hvm-x86_64-202101071208

To use it, add the following variable to the AWS ParallelCluster config file, under the [cluster ...] section
custom_ami = ami-xxxxxxxxxxxxx
```

Please set the following variable that will be used to setup the AWS ParallelCluster for running Megatron-ML:

```bash
CUSTOM_AMI=<your value>
```

## Configure and Deploy a Cluster

Use the [configs/multi-queue-config.ini](<./configs/multi-queue-config.ini>) configuration file to stand up the cluster.
You can do that manually by changing the VPC, COMPUTE and FSX sections or use the following commands to change the configuration file using `crudini`:

```bash
# Change the cluster configuration file
python3 crudini/crudini --set ./configs/multi-queue-config.ini "aws" aws_region_name "${AWS_REGION}"
python3 crudini/crudini --set ./configs/multi-queue-config.ini "vpc megatron" vpc_id "${VPC_ID}"
python3 crudini/crudini --set ./configs/multi-queue-config.ini "vpc megatron" master_subnet_id "${PUBLIC_SUBNET_ID}"
python3 crudini/crudini --set ./configs/multi-queue-config.ini "vpc megatron" compute_subnet_id "${PRIVATE_SUBNET_ID}"
python3 crudini/crudini --set ./configs/multi-queue-config.ini "cluster multi-queue-us-west-2" key_name "${SSH_KEY_NAME}"
python3 crudini/crudini --set ./configs/multi-queue-config.ini "cluster multi-queue-us-west-2" s3_read_write_resource "arn:aws:s3:::${BUCKET_NAME}"
python3 crudini/crudini --set ./configs/multi-queue-config.ini "cluster multi-queue-us-west-2" custom_ami "${CUSTOM_AMI}"
python3 crudini/crudini --set ./configs/multi-queue-config.ini "fsx sharedfsx" import_path "s3://${BUCKET_NAME}"
python3 crudini/crudini --set ./configs/multi-queue-config.ini "fsx sharedfsx" export_path "s3://${BUCKET_NAME}"
```

You are now ready to create the cluster with:

```bash
pcluster create megatron-on-pcluster -c configs/multi-queue-config.ini
```

Once deployment completes you get the Head Node's public and private IPs printed on the screen at the end of the cluster creation:

```bash
Creating stack named: parallelcluster-megatron-on-pcluster
Status: parallelcluster-megatron-on-pcluster - CREATE_COMPLETE
MasterPublicIP: yyy.yyy.yy.yyy
ClusterUser: ec2-user
MasterPrivateIP: xxx.xxx.xx.xxx
```

Access the cluster Head Node using the CLI command

```bash
pcluster ssh megatron-on-pcluster -i ~/.ssh/${SSH_KEY_NAME}
```

## Preprocessing the Training Dataset wth CPU instances

Once connected to the cluster head node, set-up a data folder in the _/lustre_ directory and download the latest English Wikipedia data dump from Wikimedia.
This process follows the original [Megatron-LM documentation](<https://github.com/NVIDIA/Megatron-LM#datasets>):

```bash
export WIKI_DIR=/lustre/data/wiki
mkdir -p $WIKI_DIR && cd $WIKI_DIR

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
```

Download the vocab and merge table files for the desired model. This example uses the GPT-2 model:

```bash
export DATA_DIR=/lustre/data
export GPT2_DATA=${DATA_DIR}/gpt2

mkdir -p ${GPT2_DATA} && cd ${GPT2_DATA}

wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt

mkdir -p ${GPT2_DATA}/checkpoint
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O ${GPT2_DATA}/checkpoint/megatron_lm_345m_v0.0.zip
```

Once the data are available, provision a cpu node using slurm: `salloc --nodes 1 -p cpu`.

All data preprocesing work will proceed from the CPU machine. You can check the provisioning status of your new machine using the `squeue` command. Once the status, _ST_, changes to running, _R_, access the CPU machine terminal through ssh with: `ssh cpu-dy-c5n18xlarge-1`.

Extract the downloaded data using WikiExtractor:

```bash
git clone https://github.com/attardi/wikiextractor.git /lustre/wikiextractor
cd /lustre/wikiextractor
cd -
python -m wikiextractor.WikiExtractor --json /lustre/data/wiki/enwiki-latest-pages-articles.xml.bz2 --output /lustre/data/wiki/text/ -q --processes 70 2>&1 | tee wikiextract.out &
```

Wikiextractor first preprocesses the template of all pages sequentially, followed by a Map/Reduce process for extracting the pages and converting to the loose json format required by Megatron-LM.

Once the extraction completes, we merge the text files with:

```bash
conda activate pytorch_latest_p37
cd /lustre/data/wiki
find /lustre/data/wiki/text/ -name wiki* | parallel -m -j 70 "cat {} >> mergedfile.json"
```

The `mergedfile.json` size on disk is 16GB. With it, create the binary data format for Megatron GPT2.
**NOYE**: Refer to [this solution](<https://github.com/NVIDIA/Megatron-LM/issues/62>) if an `IndexError: list index out of range` occurs.

To create the binary data, type the following command:

```bash
python /home/ec2-user/megatron/tools/preprocess_data.py \
    --input /lustre/data/wiki/mergedfile.json \
    --output-prefix my-gpt2 \
    --vocab /lustre/data/gpt2/gpt2-vocab.json \
    --dataset-impl mmap \
    --tokenizer-type GPT2BPETokenizer \
    --merge-file /lustre/data/gpt2/gpt2-merges.txt \
    --append-eod \
    --workers 70
```

Once all preprocessing is done we can persist the data from FSx back to S3 using a [Data Repository Task](<https://docs.aws.amazon.com/fsx/latest/LustreGuide/export-data-repo-task.html>) from the terminal used to spin up the cluster.
This guarantees that data gets persisted even if the cluster is terminated.

```bash
# Retrieve the FSx for Lustre file system Id
export FSX_ID=$(aws fsx describe-file-systems --query "FileSystems[?LustreConfiguration.DataRepositoryConfiguration.ExportPath=='s3://${BUCKET_NAME}'].FileSystemId" --output text --region ${AWS_REGION})
# Create data repository task
aws fsx create-data-repository-task \
    --file-system-id $FSX_ID \
    --type EXPORT_TO_REPOSITORY \
    --paths data \
    --report Enabled=true,Scope=FAILED_FILES_ONLY,Format=REPORT_CSV_20191124,Path=s3://${BUCKET_NAME}/reports \
    --region ${AWS_REGION}
```

You can exit to the original terminal with the `exit` command 2 times: (1) for exiting the `ssh` session on the CPU node, (2) for the `salloc` slurm allocation.

## Model Parallel Trainin on a p4d.24xlarge UltraCluster

In this section you will train the 8 billion parameters version of Megatron-LM GPT-2 model across 64 GPUs - 8 p4d.24xlarge instances. Log back into the cluster head node using `pcluster ssh ...` if not already on the machine.

Start by creating a training script according to the [original documentation](<https://github.com/NVIDIA/Megatron-LM/blob/main/examples/pretrain_gpt_distributed.sh>).
To train using `slurm` on 8 nodes, modify the distributed world configuration section according to the script [scripts/train\_8B\_gpt2.sh](<./scripts/train_8B_gpt2.sh>).
Make sure to include the CUDA, EFA and NCCL environment variables to enable NCCL to communicate between GPUs through AWS EFA using GPU Remote Direct Memory Access.

Create a file named `/lustre/scripts/train\_8B\_gpt2.sh` and copy the content below in it.

```bash
# scripts/train_8B_gpt2.sh

#!/bin/bash

# Shared data paths from Fsx for Luster mount point /lustre
DATA_PATH=/lustre/data/wiki/my-gpt2_text_document
CHECKPOINT_PATH=/lustre/data/gpt2/checkpoint
VOCAB_FILE=/lustre/data/gpt2/gpt2-vocab.json
MERGES_FILE=/lustre/data/gpt2/gpt2-merges.txt

# Distributed World configuration
MP_SIZE=8
GPUS_PER_NODE=8
DDP_IMPL=torch
MASTER_ADDR=$SLURM_SUBMIT_HOST
MASTER_PORT=6000
NNODES=$SLURM_NTASKS
NODE_RANK=$SLURM_NODEID
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

# CUDA, EFA and NCCL configs
export LD_LIBRARY_PATH=/usr/local/cuda-11.0/efa/lib:/usr/local/cuda-11.0/lib:/usr/local/cuda-11.0/lib64:/usr/local/cuda-11.0:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH
export FI_PROVIDER=efa
export FI_EFA_USE_DEVICE_RDMA=1
export NCCL_ALGO=ring
export NCCL_DEBUG=INFO
export RDMAV_FORK_SAFE=1

# Distributed args for Pytorch DDP
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

# Training:
/home/ec2-user/anaconda3/envs/pytorch_latest_p37/bin/python -m torch.distributed.launch $DISTRIBUTED_ARGS \
    /home/ec2-user/megatron/pretrain_gpt2.py \
    --model-parallel-size $MP_SIZE \
    --DDP-impl $DDP_IMPL \
    --num-layers 42 \
    --hidden-size 4096 \
    --num-attention-heads 32 \
    --batch-size 16 \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --train-iters 1000 \
    --lr-decay-iters 320000 \
    --save $CHECKPOINT_PATH \
    --load $CHECKPOINT_PATH \
    --data-path $DATA_PATH \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGES_FILE \
    --data-impl mmap \
    --split 949,50,1 \
    --distributed-backend nccl \
    --lr 0.00015 \
    --lr-decay-style cosine \
    --min-lr 1.0e-5 \
    --weight-decay 1e-2 \
    --clip-grad 1.0 \
    --warmup .01 \
    --checkpoint-activations \
    --distribute-checkpointed-activations \
    --log-interval 50 \
    --save-interval 1000 \
    --eval-interval 1000 \
    --eval-iters 10 \
    --num-workers 2 \
    --fp16 \
    --tensorboard-dir /lustre/logs/gpt2_param8B_nodes16_bs16_sjob${SLURM_JOB_ID}

set +x
```

```bash
chmod +x /lustre/scripts/train\_8B\_gpt2.sh
```

To drive the `sbatch` execution of the trainning script, wrap it on a `job.sh` script, using a shared path across all nodes, such as `/lustre/scripts` :

```bash
mkdir -p /lustre/scripts
cat > /lustre/scripts/job.sh << EOF
#!/bin/bash
#SBATCH --wait-all-nodes=1
#SBATCH -p gpu
#SBATCH -n 8
#SBATCH -N 8
#SBATCH -o out_%j.out
srun /lustre/scripts/train_8B_gpt2.sh
EOF

sbatch /lustre/scripts/job.sh
```

Now you can start training by running `sbatch job.sh` from the head node on the cluster. The output from the run will be recorded on the `.out` file on the current folder.
If your job fails with `slurmstepd: error: execve(): /lustre/scripts/train_8B_gpt2.sh: Permission denied`, change the permissions of your scripts with `chmod +x /lustre/scripts/*.sh`.

Inspecting the NCCL logs in the `.out` file expect to find entries that describe the OFI provide to EFA, such as below:

```bash
gpu-dy-p4d24xlarge-10:33337:33337 [0] NCCL INFO NET/OFI Selected Provider is efa
```

### Monitoring training with Tensorboard

The Megatron-LM framework writes tensorboard logs to the `--tensorboard-dir` specified on the training script.The custom AMI built for the cluster has tensorboard installed on the `pytorch_latest_p37` environment used for training.
Use the to start a tensorboard silently and expose it in a specific port:

```bash
python -m tensorboard.main --port=8080 --logdir /lustre/logs --host 0.0.0.0  2>&1 | tee ~/tensorboard.logs &!
```

Using the following `ssh` tunel configuration when connecting to the head node, you can access tensorboard on `localhost:8080`:

```bash
pcluster ssh megatron-on-pcluster -i ~/.ssh/${SSH_KEY_NAME} -L 8080:localhost:8080
```