# Training MNIST using MXNet and Keras on Amazon EKS This document explains how to build a MNIST model using MXNet and Keras on Amazon EKS. This documents assumes that you have an EKS cluster available and running. Make sure to have a [GPU-enabled Amazon EKS cluster](eks-gpu.md) ready. ## MNIST Training using MXNet on EKS In this sample, we'll use MNIST database of handwritten digits and train the model to recognize any handwritten digit. 1. You can use a pre-built Docker image `rgaut/deeplearning-mxnet:with_mnist_cnn_gpu`. This image uses `763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.4.0-gpu-py27-cu90-ubuntu16.04` as the base image. It comes bundled with MXNet. It also has training code and downloads training and test data sets. Alternatively, you can build a Docker image using the Dockerfile in `samples/mnist/training/mxnet/Dockerfile`. ``` docker image build samples/mnist/training/mxnet/ -t ``` This will create a Docker image that will have all the utilities to run MNIST. 1. Create a pod that will use this Docker image and run the MNIST training: ``` kubectl create -f samples/mnist/training/mxnet/mxnet.yaml ``` 1. Check status of the pod: ``` kubectl get pods -l app=mxnet NAME READY STATUS RESTARTS AGE mxnet-mnist 0/1 Completed 0 6m ``` 1. Check the progress in training: ``` kubectl logs mxnet-mnist Using MXNet backend Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz 16384/11490434 [..............................] - ETA: 0s 24576/11490434 [..............................] - ETA: 35s 57344/11490434 [..............................] - ETA: 30s 122880/11490434 [..............................] - ETA: 21s 303104/11490434 [..............................] - ETA: 11s 581632/11490434 [>.............................] - ETA: 7s 1187840/11490434 [==>...........................] - ETA: 3s 2375680/11490434 [=====>........................] - ETA: 2s 3948544/11490434 [=========>....................] - ETA: 1s 5521408/11490434 [=============>................] - ETA: 0s 7094272/11490434 [=================>............] - ETA: 0s 8683520/11490434 [=====================>........] - ETA: 0s 10256384/11490434 [=========================>....] - ETA: 0s 11493376/11490434 [==============================] - 1s 0us/step 11501568/11490434 [==============================] - 1s 0us/step /usr/local/lib/python2.7/dist-packages/keras/backend/mxnet_backend.py:96: UserWarning: MXNet Backend performs best with `channels_first` format. Using `channels_last` will significantly reduce performance due to the Transpose operations. For performance improvement, please use this API`keras.utils.to_channels_first(x_input)`to transform `channels_last` data to `channels_first` format and also please change the `image_data_format` in `keras.json` to `channels_first`.Note: `x_input` is a Numpy tensor or a list of Numpy tensorRefer to: https://github.com/awslabs/keras-apache-mxnet/tree/master/docs/mxnet_backend/performance_guide.md train_symbol = func(*args, **kwargs) . . . [23:25:30] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) x_train shape: (60000, 28, 28, 1) 60000 train samples 10000 test samples Train on 60000 samples, validate on 10000 samples Epoch 1/12 128/60000 [..............................] - ETA: 15:12 - loss: 2.3015 - acc: 0.1094 384/60000 [..............................] - ETA: 5:15 - loss: 2.2646 - acc: 0.1667 640/60000 [..............................] - ETA: 3:14 - loss: 2.2128 - acc: 0.2437 896/60000 [..............................] - ETA: 2:22 - loss: 2.1461 - acc: 0.2824 1152/60000 [..............................] - ETA: 1:53 - loss: 2.0702 - acc: 0.3229 1408/60000 [..............................] - ETA: 1:34 - loss: 1.9679 - acc: 0.3629 1664/60000 [..............................] - ETA: 1:22 - loss: 1.8818 - acc: 0.3930 1920/60000 [..............................] - ETA: 1:12 - loss: 1.8086 - acc: 0.4104 2176/60000 [>.............................] - ETA: 1:05 - loss: 1.7239 - acc: 0.4370 . . . 59776/60000 [============================>.] - ETA: 0s - loss: 0.0398 - acc: 0.9882 60000/60000 [==============================] - 14s 232us/step - loss: 0.0398 - acc: 0.9882 - val_loss: 0.0262 - val_acc: 0.9904 Test loss: 0.026189500172245608 Test accuracy: 0.9904 MXNet Backend: Successfully exported the model as MXNet model! MXNet symbol file - mnist_cnn-symbol.json MXNet params file - mnist_cnn-0000.params . . . Model input data_names and data_shapes are: data_names : ['/conv2d_1_input1'] data_shapes : [DataDesc[/conv2d_1_input1,(128L, 28L, 28L, 1L),float32,NCHW]] . . . Note: In the above data_shapes, the first dimension represent the batch_size used for model training. You can change the batch_size for binding the module based on your inference batch_size. ``` Complete [detailed logs](mxnet_logs.txt). A copy of the model is also saved at `samples/mnist/training/mxnet/saved_model`. ## What happened? - Runs `python /tmp/mnist_cnn.py` command (specified in the Dockerfile and available at samples/mnist/training/mxnet/mnist_cnn.py) - Downloads MNIST training and test data set from S3. - Each set has images and labels that identify the image - Performs supervised learning - Run 12 epochs using the training data with the specified parameters - For each epoch - Reads the training data - Builds the training model using the specified algorithm - Feeds the test data and matches with the expected output - Reports the accuracy, expected to improve with each run - Exports the trained model in `/mnist_model` directory at a worker node. The model consists of `mnist_cnn-0000.params` and `mnist_cnn-symbol.json` files. These are needed for inference.