# Distributed Training using TensorFlow and Horovod on Amazon EKS with ImageNet Data

This document explains how to perform distributed training on [Amazon EKS](https://aws.amazon.com/eks/) using TensorFlow and [Horovod](https://github.com/uber/horovod) with [ImageNet dataset](http://www.image-net.org/). The following steps can be ued for any data set though.

## Pre-requisite

1. Create [EKS cluster using GPU with Kubeflow](../../eks-gpu.md).

1. Download and put prepare ImageNet dataset in your S3 bucket like this.

    ```
    ➜ aws s3 ls s3://eks-dl-benchmark/imagenet/train/
    2019-02-28 12:03:46   56755552 train-00001-of-01024
    2019-02-28 12:03:45   56365180 train-00002-of-01024
    ......
    2019-02-28 12:03:45   56365180 train-01024-of-01024


    ➜ aws s3 ls s3://eks-dl-benchmark/imagenet/validation/
    2019-02-28 12:14:10   19504012 validation-00001-of-00128
    2019-02-28 12:14:10   19624967 validation-00002-of-00128
    ....
    2019-02-28 12:14:10   20063161 validation-00128-of-00128
    ```

    The bucket name can be different but all data needs to be in the `imagenet` folder. The training data needs to be in the `train` sub folder and validation data in the `validation` sub folder.

1. Create an [FSX For Lustre](https://aws.amazon.com/fsx/lustre) filesystem and enable data integration with S3. Use the VPC info of the GPU-powered EKS cluster created in the first step to create FSX. Note down the file system id after FSX for Lustre is created.

   > Note: FSX can only mount to one AZ. Make sure to create a single-AZ EKS cluster. This is specified in the `aws_config/cluster_config.yaml` file during cluster creation.

   ![VPC setup](./fsx-vpc-setup.jpg)

   ![fsx for lustrue](./fsx-s3-integration.png)

## Steps

1. Follow [steps](tensorflow-horovod-synthetic.md#install-mpi-operator) to install mpi-operator.

1. Deploy the [Amazon FSx CSI Plugin](https://www.kubeflow.org/docs/aws/storage/#deploy-the-amazon-fsx-csi-plugin).

    ```
    cd ${KUBEFLOW_SRC}/${KFAPP}/ks_app
    export COMPONENT=aws-fsx-csi-driver
    ks generate aws-fsx-csi-driver ${COMPONENT}
    ks apply default -c ${COMPONENT}
    ```

1. Prepare Persistent Volumne (PV), Persistent Volume Claim (PVC) and Storage Class. Go to FSX console and replace `fsxId` and `dnsName` with your FSx info.

    ```
    cd ${KUBEFLOW_SRC}/${KFAPP}/ks_app
    export COMPONENT=fsx-static-storage

    ks generate aws-fsx-pv-static ${COMPONENT} --fsxId=fs-048xxxx7c25 --dnsName=fs-048xxxx7c25.fsx.us-west-2.amazonaws.com

    ks apply default -c ${COMPONENT}
    ```

1. Prepare training job. Check [here](tensorflow-horovod-synthetic.md#launch-mpi-training-job) for more details

    ```
    export JOB_NAME=tf-resnet50-horovod-job
    ks generate mpi-job-custom ${JOB_NAME}

    ks param set ${JOB_NAME} image "seedjeffwan/eks-dl-benchmark:cuda10-tf1.13.1-hvd0.16.0-py3.5"
    ks param set ${JOB_NAME} replicas 2
    ks param set ${JOB_NAME} gpusPerReplica 4

    EXEC="mpirun,-mca,btl_tcp_if_exclude,lo,-mca,pml,ob1,-mca,btl,^openib,--bind-to,none,-map-by,slot,-x,LD_LIBRARY_PATH,-x,PATH,-x,NCCL_DEBUG=INFO,python,models/resnet/tensorflow/train_imagenet_resnet_hvd.py,--batch_size=256,--model=resnet50,--num_batches=300,--fp16,--display_every=50,--lr_decay_mode=poly,--data_dir=/data/imagenet/train"
    ```

   > NOTE: Instead of using synthetic data, job will read from `--data_dir`.

1. Deploy training job

    ```
    ks apply default -c ${JOB_NAME}
    ```

1. Check pod status and logs

    ```
    POD_NAME=$(kubectl -n kubeflow get pods -l mpi_job_name=${JOB_NAME},mpi_role_type=launcher -o name)

    kubectl -n kubeflow logs -f ${POD_NAME}
    ```

    Here is a [sample output](logs/tensorflow-horovod-imagenet-log.txt).

## Appendix

### Download and Pre-process ImageNet Data

If you work for Amazon, then reach out to the authors of this document to have access to the data. Otherwise, follow the instructions below.

1. Download [ImageNet](http://image-net.org/download-images) dataset and upload to your S3 bucket. Use `Download Original Images (for non-commercial research/educational use only)` option.

2. TensorFlow consumes the ImageNet data in a specific format. You can preprocess them by downloading and modifying the script:

    ```
    curl -O https://raw.githubusercontent.com/aws-samples/deep-learning-models/master/utils/tensorflow/preprocess_imagenet.sh
    chmod +x preprocess_imagenet.sh
    ```

    The following values need to be changed:

    * `[your imagenet account]`
    * `[your imagenet access key]`
    * `[PATH TO TFRECORD TRAINING DATASET]`
    * `[PATH TO RESIZED TFRECORD TRAINING DATASET]`
    * `[PATH TO TFRECORD VALIDATION DATASET]`
    * `[PATH TO RESIZED TFRECORD VALIDATION DATASET]`

    Execute the script:

    ```
    ./preprocess_imagenet.sh
    ```