# TensorFlow on Amazon EKS This document explains how to run TensorFlow jobs on [Amazon EKS](https://aws.amazon.com/eks/). It requires to setup KubeFlow as explained in [Amazon EKS Setup with Kubeflow](eks-gpu.md). KubeFlow installation creates a `TFJob` custom resource. This makes it easy to run TensorFlow training jobs on Kubernetes. `tf-*` pods from the output of `kubectl get pods` verifies that. Run TensorFlow [TfCnn example](https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks) that contains implementation of several convolutional models for image classification. 1. Create a Jsonnet representation of the job: ``` export CNN_JOB_NAME=tf-cnn-training ks pkg install kubeflow/tf-training ks generate tf-job-simple-v1beta1 ${CNN_JOB_NAME} --name=${CNN_JOB_NAME} ``` This will generate `components/${CNN_JOB_NAME}.jsonnet`. This is a JSON file that defines the manifest for TFJob. 2. By default, this manifest is configured to use CPU. Open `components/${CNN_JOB_NAME}.jsonnet` to customize manifest. We'll update worker to use GPUs. The diff between the generated and the updated file is shown: ``` 8c8,9 < local image = "gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3"; --- > local image = "gcr.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3"; > 32,34c33,35 < "--num_gpus=1", < "--local_parameter_device=cpu", < "--device=cpu", --- > "--num_gpus=2", > "--local_parameter_device=gpu", > "--device=gpu", 39a41,45 > resources: { > limits: { > "nvidia.com/gpu": 2 > }, > }, 58,60c64,66 < "--num_gpus=1", < "--local_parameter_device=cpu", < "--device=cpu", --- > "--num_gpus=2", > "--local_parameter_device=gpu", > "--device=gpu", 65a72,76 > resources: { > limits: { > "nvidia.com/gpu": 2 > }, > }, ``` This assigns two GPUs per replica to the server and the workers. 3. `ks env list` lists the ksonnet environments available for your application. By default, it shows the output: ``` ks env list NAME OVERRIDE KUBERNETES-VERSION NAMESPACE SERVER ==== ======== ================== ========= ====== default v1.11.5 default https://xxx.sk1.us-west-2.eks.amazonaws.com ``` The output shows that `default` environment is configured to deploy to an EKS cluster. Setup an environment variable: ``` KF_ENV=default ``` 4. Use the updated manifest to create resources on the remote cluster: You can double check your job sepc `kubectl show ${KF_ENV} -c ${CNN_JOB_NAME} > /tmp/${CNN_JOB_NAME}`. We attach an [example](training/tfjob/tf_cnn_training.yaml) ``` ks apply ${KF_ENV} -c ${CNN_JOB_NAME} INFO Applying tfjobs kubeflow.tf-cnn-training INFO Creating non-existent tfjobs kubeflow.tf-cnn-training ``` Updated output is: ``` kubectl get pods NAME READY STATUS RESTARTS AGE ... tf-cnn-training-ps-0 1/1 Running 0 1m tf-cnn-training-worker-0 1/1 Running 0 1m ... ``` 5. Monitor the job: ``` kubectl logs tf-cnn-training-worker-0 -f ``` It shows the [output](logs/tf-cnn-training-worker-0.log): ``` INFO|2019-02-13T22:59:36|/opt/launcher.py|48| Launcher started. INFO|2019-02-13T22:59:36|/opt/launcher.py|73| Command to run: python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --local_parameter_device=gpu --device=gpu --data_format=NHWC --job_name=worker --ps_hosts=tf-cnn-training-ps-0.kubeflow.svc:2222 --worker_hosts=tf-cnn-training-worker-0.kubeflow.svc:2222 --task_index=0 INFO|2019-02-13T22:59:36|/opt/launcher.py|15| Running python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --local_parameter_device=gpu --device=gpu --data_format=NHWC --job_name=worker --ps_hosts=tf-cnn-training-ps-0.kubeflow.svc:2222 --worker_hosts=tf-cnn-training-worker-0.kubeflow.svc:2222 --task_index=0 INFO|2019-02-13T22:59:37|/opt/launcher.py|27| 2019-02-13 22:59:37.489630: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.177179: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.177921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1064] Found device 0 with properties: INFO|2019-02-13T22:59:38|/opt/launcher.py|27| name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 INFO|2019-02-13T22:59:38|/opt/launcher.py|27| pciBusID: 0000:00:17.0 INFO|2019-02-13T22:59:38|/opt/launcher.py|27| totalMemory: 15.78GiB freeMemory: 15.35GiB INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.485360: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.486253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1064] Found device 1 with properties: INFO|2019-02-13T22:59:38|/opt/launcher.py|27| name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 INFO|2019-02-13T22:59:38|/opt/launcher.py|27| pciBusID: 0000:00:19.0 INFO|2019-02-13T22:59:38|/opt/launcher.py|27| totalMemory: 15.78GiB freeMemory: 15.37GiB INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.486294: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1079] Device peer to peer matrix INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.486317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1085] DMA: 0 1 INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.486325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1095] 0: Y Y INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.486331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1095] 1: Y Y INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.486352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1154] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:17.0, compute capability: 7.0) INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.486360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1154] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:19.0, compute capability: 7.0) INFO|2019-02-13T23:02:11|/opt/launcher.py|27| 2019-02-13 23:02:11.314241: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> tf-cnn-training-ps-0.kubeflow.svc:2222} INFO|2019-02-13T23:02:11|/opt/launcher.py|27| 2019-02-13 23:02:11.314280: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222} INFO|2019-02-13T23:02:11|/opt/launcher.py|27| 2019-02-13 23:02:11.317209: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222 INFO|2019-02-13T23:02:11|/opt/launcher.py|27| TensorFlow: 1.5 INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Model: resnet50 INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Mode: training INFO|2019-02-13T23:02:11|/opt/launcher.py|27| SingleSess: False INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Batch size: 64 global INFO|2019-02-13T23:02:11|/opt/launcher.py|27| 32 per device INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Devices: ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1'] INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Data format: NHWC INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Optimizer: sgd INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Variables: parameter_server INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Sync: True INFO|2019-02-13T23:02:11|/opt/launcher.py|27| ========== INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Generating model INFO|2019-02-13T23:02:12|/opt/launcher.py|27| WARNING:tensorflow:From /opt/tf-benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:372: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. INFO|2019-02-13T23:02:12|/opt/launcher.py|27| Instructions for updating: INFO|2019-02-13T23:02:12|/opt/launcher.py|27| keep_dims is deprecated, use keepdims instead INFO|2019-02-13T23:02:18|/opt/launcher.py|27| 2019-02-13 23:02:18.452487: I tensorflow/core/distributed_runtime/master_session.cc:1011] Start master session 5a288b32a61167f7 with config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true INFO|2019-02-13T23:02:20|/opt/launcher.py|27| Running warm up INFO|2019-02-13T23:06:30|/opt/launcher.py|27| Done warm up INFO|2019-02-13T23:06:30|/opt/launcher.py|27| Step Img/sec loss INFO|2019-02-13T23:06:31|/opt/launcher.py|27| 1 images/sec: 217.5 +/- 0.0 (jitter = 0.0) 10.611 INFO|2019-02-13T23:06:33|/opt/launcher.py|27| 10 images/sec: 215.8 +/- 3.1 (jitter = 2.6) 8.493 INFO|2019-02-13T23:06:36|/opt/launcher.py|27| 20 images/sec: 217.1 +/- 1.6 (jitter = 3.0) 8.229 INFO|2019-02-13T23:06:39|/opt/launcher.py|27| 30 images/sec: 216.8 +/- 1.1 (jitter = 3.1) 8.105 INFO|2019-02-13T23:06:42|/opt/launcher.py|27| 40 images/sec: 216.8 +/- 0.9 (jitter = 3.1) 7.947 INFO|2019-02-13T23:06:45|/opt/launcher.py|27| 50 images/sec: 216.9 +/- 0.7 (jitter = 3.3) 7.916 INFO|2019-02-13T23:06:48|/opt/launcher.py|27| 60 images/sec: 217.8 +/- 0.7 (jitter = 3.5) 8.252 INFO|2019-02-13T23:06:51|/opt/launcher.py|27| 70 images/sec: 218.0 +/- 0.6 (jitter = 3.3) 7.880 INFO|2019-02-13T23:06:54|/opt/launcher.py|27| 80 images/sec: 217.8 +/- 0.6 (jitter = 3.2) 7.885 INFO|2019-02-13T23:06:57|/opt/launcher.py|27| 90 images/sec: 217.6 +/- 0.5 (jitter = 3.5) 7.823 INFO|2019-02-13T23:07:00|/opt/launcher.py|27| 100 images/sec: 217.7 +/- 0.5 (jitter = 3.3) 7.926 INFO|2019-02-13T23:07:00|/opt/launcher.py|27| ---------------------------------------------------------------- INFO|2019-02-13T23:07:00|/opt/launcher.py|27| total images/sec: 217.95 INFO|2019-02-13T23:07:00|/opt/launcher.py|27| ---------------------------------------------------------------- INFO|2019-02-13T23:07:01|/opt/launcher.py|80| Finished: python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --local_parameter_device=gpu --device=gpu --data_format=NHWC --job_name=worker --ps_hosts=tf-cnn-training-ps-0.kubeflow.svc:2222 --worker_hosts=tf-cnn-training-worker-0.kubeflow.svc:2222 --task_index=0 INFO|2019-02-13T23:07:01|/opt/launcher.py|84| Command ran successfully sleep for ever. ``` More details at [TensorFlow Training](https://www.kubeflow.org/docs/guides/components/tftraining/).