# Distributed Training using TensorFlow and Horovod on Amazon EKS with Synthetic Data This document explains how to perform distributed training on [Amazon EKS](https://aws.amazon.com/eks/) using TensorFlow and [Horovod](https://github.com/uber/horovod) with synthetic imagenet dataset. ## Pre-requisite 1. Create [EKS cluster using GPU with Kubeflow](../../eks-gpu.md) ## Install MPI Operator 1. Install mpi package ``` cd ${KUBEFLOW_SRC}/${KFAPP}/ks_app ks pkg install kubeflow/mpi-job ``` 1. Install mpi operator ``` export MPI_OPERATOR=mpi-operator ks generate mpi-operator ${MPI_OPERATOR} ks param set ${MPI_OPERATOR} image mpioperator/mpi-operator:0.1.0 ks apply default -c ${MPI_OPERATOR} ``` 1. Verify Installation ``` kubectl get crd NAME CREATED AT ... mpijobs.kubeflow.org 2019-02-12T22:12:32Z ... $ kubectl -n kubeflow get deployment ${MPI_OPERATOR} NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE mpi-operator 1 1 1 1 1m ``` ## Launch MPI Training Job 1. Generate job prototype: ``` export JOB_NAME=tf-resnet50-horovod-job ks generate mpi-job-custom ${JOB_NAME} ``` 1. You can leverage [AWS Deep Learning Containers](https://aws.amazon.com/machine-learning/containers/) to Build a Docker image for Horovod using Dockerfile from `training/distributed_training/Dockerfile`. You will need to login to access the repository of [AWS Deep Learning Containers](https://aws.amazon.com/machine-learning/containers/) by running the command `$(aws ecr get-login --no-include-email --region us-east-1 --registry-ids 763104351884)` and then build command `docker image build -t ${YOUR_DOCKER_HUB_ID}/eks-kubeflow-horovod:latest .`. Alternatively, you can use the image `mpioperator/tensorflow-benchmarks:latest` that already exists on Docker Hub. ``` ks param set ${JOB_NAME} image "mpioperator/tensorflow-benchmarks:latest" ``` 1. Define the number of workers (containers) and number of GPU available per container: ``` ks param set ${JOB_NAME} replicas 2 ks param set ${JOB_NAME} gpusPerReplica 4 ``` 1. Formulate the MPI command based on official document from [Horovod](https://github.com/uber/horovod) ``` EXEC="mpirun,-mca,btl_tcp_if_exclude,lo,-mca,pml,ob1,-mca,btl,^openib,--bind-to,none,-map-by,slot,-x,LD_LIBRARY_PATH,-x,PATH,-x,NCCL_DEBUG=INFO,python,scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py,--data_format=NCHW,--batch_size=256,--model=resnet50,--optimizer=sgd,--variable_update=horovod,--data_name=imagenet,--use_fp16" ks param set ${JOB_NAME} command ${EXEC} ``` If more than one GPU is used, then you may have to replace `NCCL_SOCKET_IFNAME=eth0` with `NCCL_SOCKET_IFNAME=^docker0`. 1. Verify your job configuration, it will look like [mpi-job-template.yaml](../../samples/imagenet/distributed_training/mpi-job-template.yaml) ``` ks show default -c ${JOB_NAME} ``` 1. Deploy the config to your cluster ``` ks apply default -c ${JOB_NAME} ``` 1. Check the pod status and dump logs ``` POD_NAME=$(kubectl -n kubeflow get pods -l mpi_job_name=${JOB_NAME},mpi_role_type=launcher -o name) kubectl -n kubeflow logs -f ${POD_NAME} ``` Here is a [sample output](logs/tensorflow-horovod-synthetic-log.txt). 1. To delete job in Kubernetes and remove job configuration in your application, ``` ks delete default -c ${JOB_NAME} # delete job in kubernetes ks component rm ${JOB_NAME} # delete manifest in your ksonnet application ```