# DeepSpeed Bert 1.5b ## 0. Prerequisites ## 0.1. Configure AWS account access (Optional) ``` aws configure --profile ${AWS_PROFILE:-default} ``` ## 0.2. Setup EKS cluster with DL1 instances ### 0.2.1. Clone aws-do-eks ``` git clone https://github.com/aws-samples/aws-do-eks.git ``` ### 0.2.2. Configure cluster Change the configuration in `eks.conf` to use yaml and set it to `eks-dl1.yaml`: ``` export CONFIG=yaml export EKS_YAML=./eks-dl1.yaml ``` The file `eks-dl1.yaml` describes the node groups that will be created for the cluster and what EC2 key pair will be used in the instances. [Create a EC2 key pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html) or use one that you already have. ```yaml publicKeyName: DL1_Key ``` ### 0.2.3. Build and run aws-do-eks container ``` ./build.sh ./run.sh ./status.sh ./exec.sh ``` ### 0.2.4. Create cluster Within the aws-do-eks container run: ``` echo "AWS_PROFILE=$AWS_PROFILE" ./eks-create.sh ``` This operation could take up to ~30min ### 0.2.5. Verify cluster setup ``` kubectl get nodes ``` ### 0.2.5. Create shared EFS persistent volume ``` cd /eks/deployment/csi/efs/ ./efs-create.sh ./deploy.sh kubectl apply -f ./efs-pvc.yaml ``` ### 0.2.6. Deploy plugins and operators #### 0.2.6.1. Deploy Habana device plugin ``` cd /eks/deployment/habana-device-plugin ./deploy.sh ``` #### 0.2.6.2. Deploy EFA device plugin ``` cd /eks/deployment/efa-device-plugin ./deploy.sh ``` #### 0.2.6.3. Deploy Kubeflow mpi-operator ``` cd /eks/deployment/kubeflow/mpi-operator ./deploy.sh ``` ### 0.2.7. Configure distributed training ``` cd /eks/deployment/distributed-training/pytorch/habana/deepspeed-bert ``` In the file `deepspeed-bert.yaml.template`, set the number of desired workers: ```yaml Worker: replicas: 2 ``` In this case, the number of workers is the number of instances (nodes) you want to run the training. #### Adjust training hyperparameters (Optional) You can change the DeepSpeed parameters (as `train_batch_size` and `train_micro_batch_size_per_gpu`) in the file `scripts/deepspeed_config_bert_1.5b.json` Other training parameters can be adjusted in the launch script `scripts/launch_train.sh`: ``` MAX_SEQ_LENGTH=128 MAX_STEPS=155000 LR=0.0015 ``` ## 1. Build and push deep learning container ``` ./1-1-container-build.sh ./1-2-container-push.sh ``` ## 2. Download data Before running the training, you first have to download and pre-process the dataset: ``` ./2-1-data-download.sh ./2-2-data-status.sh ./2-3-data-log.sh ``` Downloading and pre-processing the data takes a long time (could be more than 24 hours) ## 3. Distributed training ### 3.1. Scale up DL1 nodes Once you have the data downloaded and pre-processed, you can prepare the enviroment for the training task. Here choose the same number of workers you have set in the `deepspeed-bert.yaml.template` file. ``` eksctl scale nodegroup --cluster=do-eks --nodes=2 --name=dl1 ``` Wait until nodes become available. Continuously check if nodes are ready: ``` watch kubectl get nodes ``` ``` Example: NAME STATUS ROLES AGE VERSION ip-192-168-70-74.ec2.internal Ready 1m22s v1.21.12-eks-5308cf7 ip-192-168-83-89.ec2.internal Ready 1m23s v1.21.12-eks-5308cf7 ``` ### 3.2. Run training After the nodes are ready, you can run the training task: ``` ./3-1-training-launch.sh watch ./3-2-training-status.sh ``` When pods are running, you can check logs: ``` ./3-3-training-logs.sh ``` ### 3.3. Explore Optionally exec into launcher pod: ``` ./3-4-training-exec.sh ``` Optionally, explore tensor board: ``` kubectl apply -f ./tensorboard.yaml ``` ### 3.3. Stop training ``` ./3-5-training-delete.sh ``` ### 3.4. Scale down DL1 nodes ``` eksctl scale nodegroup --cluster=do-eks --nodes=0 --name=dl1 ``` # 4. Clean up ## 4.1. Delete shared EFS volume ``` kubectl delete -f ./efs-get-data.yaml cd /eks/deployment/csi/efs kubectl delete -f ./efs-pvc.yaml ./delete.sh ``` ## 4.2. Delete cluster ``` cd /eks ./eks-delete.sh ```