# BERT

TensorFlow 2.1 implementation of pretraining and finetuning scripts for BERT.

The original paper: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)

Pretraining consists of two phases. We use mixed-batch training.

* Phase1: We pretrain 14064 steps with a total batch size of 32K for maximum sequence length of 128 across 8 p3dn.24xlarge nodes.
* Phase2: We pretrain 6248 steps with a total batch size of 8K for maximum sequence length of 512 across 8 p3dn.24xlarge nodes.
* Lastly, we finetune SQuAD v1.1 3649 steps with a total batch size of 48 on a single p3dn.24xlarge node.

SQuAD F1 score combines both precision and recall of each word in the predicted answer ranging between 0-100. For fewer nodes, we apply gradient accumulation to reach the same global batch size per step.

| Model | p3dn Nodes | Phase1 | Phase2 | Finetuning | Total Training Time | SQuAD v1.1 F1 | SQuAD v2.0 F1 |
| --- | --- | --- | --- |  --- | --- | --- | --- |
| BERT-base | 1 | 32 hours | 15 hours | 15 mins | 47 hours | same | same |
| BERT-base | 2 | 17 hours | 8 hours | 15 mins | 25 hours | same | same |
| BERT-base | 4 | 10 hours | 4 hours | 15 mins | 14 hours | same | same |
| BERT-base | 8 | 5 hrs 33 mins | 2 hrs 53 mins | 15 mins | 8 hrs 41 mins | 87.68 | 76.14 |


### How To Launch Training

All commands should be run from the `models/nlp` directory.

1. Create an FSx volume.

2. Download the datasets onto FSx. The simplest way to start is with English Wikipedia.

3. Create an Amazon Elastic Container Registry (ECR) repository. Then build a Docker image from `models/nlp/Dockerfile` and push it to ECR.

```bash
export ACCOUNT_ID=
export REPO=
export IMAGE=${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/${REPO}:py37_tf211
docker build -t ${IMAGE} .
$(aws ecr get-login --no-include-email)
docker push ${IMAGE}
```

4. Define environment variables to point to the FSx volume. For a list, use a comma-separated string.

```bash
export SAGEMAKER_ROLE=arn:aws:iam::${ACCOUNT_ID}:role/service-role/AmazonSageMaker-ExecutionRole-20200101T123
export SAGEMAKER_IMAGE_NAME=${IMAGE}
export SAGEMAKER_FSX_ID=fs-123
export SAGEMAKER_SUBNET_IDS=subnet-123
export SAGEMAKER_SECURITY_GROUP_IDS=sg-123,sg-456
```

5. Define BERT-specific run names.

```bash
export PHASE1_RUN_NAME=bertphase1
export PHASE2_RUN_NAME=bertphase2
export PHASE1_STEPS=14064
export PHASE2_STEPS=6248
# The data should be in TFRecords inside /fsx/${TRAIN_DIR}
export TRAIN_DIR=bert_data/train
export VAL_DIR=bert_data/val
export LOG_DIR=logs/bert
export CHECKPOINT_DIR=checkpoints/bert
```

6. Launch the SageMaker Phase1 training.

```bash
python -m albert.launch_sagemaker \
    --source_dir=. \
    --entry_point=albert/run_pretraining.py \
    --sm_job_name=bert-pretrain-phase1 \
    --instance_type=ml.p3dn.24xlarge \
    --instance_count=8 \
    --train_dir=${TRAIN_DIR} \
    --val_dir=${VAL_DIR} \
    --log_dir=${LOG_DIR} \
    --checkpoint_dir=${CHECKPOINT_DIR} \
    --load_from=scratch \
    --model_type=bert \
    --model_size=base \
    --per_gpu_batch_size=128 \
    --max_seq_length=128 \
    --max_predictions_per_seq=20 \
    --optimizer=lamb \
    --learning_rate=0.005 \
    --end_learning_rate=0.0003 \
    --hidden_dropout_prob=0.1 \
    --attention_probs_dropout_prob=0.1 \
    --gradient_accumulation_steps=4 \
    --learning_rate_decay_power=0.5 \
    --warmup_steps=2812 \
    --total_steps=${PHASE1_STEPS} \
    --log_frequency=100 \
    --squad_frequency=0 \
    --run_name=${PHASE1_RUN_NAME} \
    --name=mybertphase1
```

7. Launch the SageMaker Phase2 training.

```bash
python -m albert.launch_sagemaker \
    --source_dir=. \
    --entry_point=albert/run_pretraining.py \
    --sm_job_name=bert-pretrain-phase2 \
    --instance_type=ml.p3dn.24xlarge \
    --instance_count=8 \
    --train_dir=${TRAIN_DIR} \
    --val_dir=${VAL_DIR} \
    --log_dir=${LOG_DIR} \
    --checkpoint_dir=${CHECKPOINT_DIR} \
    --load_from=checkpoint \
    --load_optimizer_state=false \
    --model_type=bert \
    --model_size=base \
    --per_gpu_batch_size=32 \
    --max_seq_length=512 \
    --max_predictions_per_seq=80 \
    --optimizer=lamb \
    --learning_rate=0.004 \
    --end_learning_rate=0.0003 \
    --hidden_dropout_prob=0.1 \
    --attention_probs_dropout_prob=0.1 \
    --gradient_accumulation_steps=4 \
    --learning_rate_decay_power=0.5 \
    --warmup_steps=625 \
    --total_steps=${PHASE2_STEPS} \
    --log_frequency=100 \
    --squad_frequency=0 \
    --run_name=${PHASE2_RUN_NAME} \
    --checkpoint_path=checkpoints/albert/${PHASE1_RUN_NAME}-step${PHASE1_STEPS} \
    --name=mybertphase2
```

8. Launch a SageMaker finetuning job.

For SQuAD v1.1

```bash
python -m albert.launch_sagemaker \
    --source_dir=. \
    --entry_point=albert/run_squad.py \
    --sm_job_name=bert-squadv1 \
    --instance_type=ml.p3dn.24xlarge \
    --instance_count=1 \
    --load_from=checkpoint \
    --checkpoint_path=checkpoints/albert/${PHASE2_RUN_NAME}-step${PHASE2_STEPS} \
    --model_type=bert \
    --per_gpu_batch_size=6 \
    --model_size=base \
    --squad_version=squadv1 \
    --learning_rate=5e-5 \
    --warmup_steps=365 \
    --total_steps=3649 \
    --validation_frequency=10000 \
    --evaluate_frequency=10000 \
    --skip_xla=true \
```

SQuAD v2.0

```bash
python -m albert.launch_sagemaker \
    --source_dir=. \
    --entry_point=albert/run_squad.py \
    --sm_job_name=bert-squadv2 \
    --instance_type=ml.p3dn.24xlarge \
    --instance_count=1 \
    --load_from=checkpoint \
    --checkpoint_path=checkpoints/albert/${PHASE2_RUN_NAME}-step${PHASE2_STEPS} \
    --model_type=bert \
    --per_gpu_batch_size=6 \
    --model_size=base \
    --squad_version=squadv2 \
    --learning_rate=10.0e-5 \
    --warmup_steps=814 \
    --total_steps=8144 \
    --validation_frequency=10000 \
    --evaluate_frequency=100000 \
    --skip_xla=true \
```

9. Enter the Docker container to debug and edit code.

```bash
docker run -it --privileged -v=/fsx:/fsx --gpus=all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm ${IMAGE} /bin/bash
```