# ALBERT (A Lite BERT) TensorFlow 2.1 implementation of pretraining and finetuning scripts for ALBERT, a state-of-the-art language model. The original paper: [ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations](https://arxiv.org/pdf/1909.11942.pdf) ### Overview Language models help AWS customers to improve search results, text classification, question answering, and customer service routing. BERT and its successive improvements are incredibly powerful, yet complex to pretrain. Here we demonstrate how to train a faster, smaller, more accurate BERT-based model called [ALBERT](https://arxiv.org/abs/1909.11942) on Amazon SageMaker with the FSx filesystem and TensorFlow 2 models from [huggingface/transformers](https://github.com/huggingface/transformers). We can pretrain ALBERT for a fraction of the cost of BERT and achieve better accuracy on SQuAD. ![SageMaker -> EC2 -> FSx infrastructure diagram](https://user-images.githubusercontent.com/4564897/81020280-b207a100-8e25-11ea-8b57-38f0a09a7fb2.png ) ### Results All training is done with sequence length 512. | Model | Nodes | Global Batch Size | Batch size per GPU | Gradient accumulation steps | iterations/sec | Steps | Time-to-train | SQuADv2 F1/EM | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | albert-base | 1 | 4096 | 32 | 16 | 0.24 | 125000 | 144 hours | same | | albert-base | 2 | 4096 | 32 | 8 | 0.46 | 125000 | 75 hours | same | | albert-base | 4 | 4096 | 32 | 4 | 0.90 | 125000 | 38 hours | same | | albert-base | 8 | 4096 | 32 | 2 | 1.73 | 125000 | 20 hours | 78.4/75.2 | ### How To Launch Training All commands should be run from the `models/nlp` directory. 1. Create an FSx volume. 2. Download the datasets onto FSx. The simplest way to start is with English Wikipedia. The structure should be as follows: ``` /fsx /deep-learning-models /logs /albert /tensorboard /checkpoints /albert /albert_data /train /val ``` 3. Create an Amazon Elastic Container Registry (ECR) repository. Then build a Docker image from `models/nlp/Dockerfile` and push it to ECR. ```bash export ACCOUNT_ID= export REPO= export IMAGE=${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com/${REPO}:py37_tf211 docker build -t ${IMAGE} . # AWS-CLI v1 $(aws ecr get-login --no-include-email) # AWS-CLI v2 aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin ${ACCOUNT_ID}.dkr.ecr.us-east-1.amazonaws.com docker push ${IMAGE} ``` 4. Define environment variables to point to the FSx volume. For a list, use a comma-separated string. ```bash export SAGEMAKER_ROLE=arn:aws:iam::${ACCOUNT_ID}:role/service-role/AmazonSageMaker-ExecutionRole-20200101T123 export SAGEMAKER_IMAGE_NAME=${IMAGE} export SAGEMAKER_FSX_ID=fs-123 export SAGEMAKER_FSX_MOUNT_NAME=fsx export SAGEMAKER_SUBNET_IDS=subnet-123 export SAGEMAKER_SECURITY_GROUP_IDS=sg-123,sg-456 ``` 5. Define environment variables for directories. If your data was in /fsx/albert_data/train, you would use: ```bash export TRAIN_DIR=albert_data/train export VAL_DIR=albert_data/validation export LOG_DIR=logs/albert export CHECKPOINT_DIR=checkpoints/electra ``` 6. Launch the SageMaker job. ```bash python -m albert.launch_sagemaker \ --source_dir=. \ --entry_point=albert/run_pretraining.py \ --sm_job_name=albert-pretrain \ --instance_type=ml.p3dn.24xlarge \ --instance_count=1 \ --train_dir=${TRAIN_DIR} \ --val_dir=${VAL_DIR} \ --log_dir=${LOG_DIR} \ --checkpoint_dir=${CHECKPOINT_DIR} \ --load_from=scratch \ --model_type=albert \ --model_size=base \ --per_gpu_batch_size=32 \ --gradient_accumulation_steps=2 \ --warmup_steps=3125 \ --total_steps=125000 \ --learning_rate=0.00176 \ --optimizer=lamb \ --log_frequency=10 \ --name=myfirstjob ``` 7. Launch a SageMaker finetuning job. ```bash python -m albert.launch_sagemaker \ --source_dir=. \ --entry_point=albert/run_squad.py \ --sm_job_name=albert-squad \ --instance_type=ml.p3dn.24xlarge \ --instance_count=1 \ --train_dir=${TRAIN_DIR} \ --val_dir=${VAL_DIR} \ --log_dir=${LOG_DIR} \ --checkpoint_dir=${CHECKPOINT_DIR} \ --load_from=scratch \ --model_type=albert \ --model_size=base \ --per_gpu_batch_size=6 \ --total_steps=8144 \ --warmup_steps=814 \ --learning_rate=3e-5 \ --squad_version=squadv2 ``` 8. Enter the Docker container to debug and edit code. ```bash docker run -it --privileged -v=/fsx:/fsx --gpus=all --shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 --rm ${IMAGE} /bin/bash ``` ### Command-Line Parameters See `common/arguments.py` for a complete list. Here are the main ones: Loading from checkpoint: - `model_type`: One of "albert", "bert", "electra". - `model_size`: One of "small", "base", "large". - `load_from`: One of "scratch", "checkpoint", "huggingface". If checkpoint, then checkpoint_path is required. - `checkpoint_path`: For example: "/fsx/checkpoints/albert/2020...step125000". No .ckpt on the end. - `load_optimizer_state`: One of "true", "false".