# TensorFlow Keras MNIST Classifier - Local Example

_**Train and export a TF.Keras CNN classifier for (a subset of) the [MNIST DIGITS](https://en.wikipedia.org/wiki/MNIST_database) dataset: Performing all storage and computation locally on the notebook.**_

This notebook works well with the `Python 3 (TensorFlow 2.3 Python 3.7 CPU Optimized)` kernel on SageMaker Studio, or `conda_tensorflow2_p37` on classic SageMaker Notebook Instances.

---

The [dataset](https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz) is hosted in the [Registry of Open Data on AWS](https://registry.opendata.aws/fast-ai-imageclas/) and contains PNG images organized in folders by which digit they represent.

>❓*Can you figure out how to re-create this notebook's workflow using SageMaker more effectively?*

## Contents

1. **[Notebook Setup](#Notebook-Setup)**
1. **[Prepare the Data](#Prepare-the-Data)**
1. **[Load the Data From File](#Load-the-Data-From-File)**
1. **[Pre-Process the Data for our CNN](#Pre-Process-the-Data-for-our-CNN)**
1. **[Build a Model](#Build-a-Model)**
1. **[Fit the Model](#Fit-the-Model)**
1. **[Save the Trained Model](#Save-the-Trained-Model)**
1. **[Explore Results](#Explore-Results)**

See the accompanying **Instructions** notebook for more guidance!

## Notebook Setup

As usual, we'll start by installing any extra required libraries and importing dependencies.

> ℹ️ **Troubleshooting install issues** if you run in to `ModuleNotFoundError` below, or the interactive widgets don't render for you:
>
> - Check you ran the below `!pip install` commands, and restarted the notebook kernel before running other code cells.
> - If a module looks like it was installed successfully but is missing from the notebook kernel environment, you may need to `%pip install` instead to install them in the right place.
> - You may need to restart your notebook kernel after installing pip libraries - Especially if you already `import`ed affected modules or are running on a SageMaker Notebook Instance instead of Studio.
> - Check you have the **`@jupyter-widgets/jupyterlab-manager`** and **`ipycanvas`** JupyterLab widgets installed (Use the puzzle piece icon "Extension Manager" sidebar tab, or click *Settings > Enable Extension Manager* if you don't see it). "Rebuild" JupyterLab if prompted, then save your work and refresh the page once the build is complete. In Studio, you may also need to open a system terminal and run `restart-jupyter-server`.
>
> In practice (and in this workshop's CloudFormation templates), JupyterLab extensions are typically installed via **lifecycle configuration scripts** for [Studio](https://aws.amazon.com/blogs/machine-learning/customize-amazon-sagemaker-studio-using-lifecycle-configurations/) or [Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/notebook-lifecycle-config.html) - rather than manual installs by users.

In [None]:
# First install some libraries which might not be available across all kernels:
!pip install "ipycanvas<0.13" "ipywidgets<8" matplotlib

> ⚠️ ***RESTART your notebook kernel*** before continuing! (Circular arrow button in the toolbar)

If you don't restart the kernel after the install, the interactive drawing widget might not work later.

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import glob
import os

# External Dependencies:
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPooling2D
from tensorflow.keras.models import Sequential

# Local Notebook Utils:
import util

%matplotlib inline

print(f"Using TensorFlow version {tf.__version__}")
print(f"Keras version {tf.keras.__version__}")

## Prepare the Data

Now let's download the image data.

The original MNIST data has 70,000 small 28x28 pixel PNG files (60,000 in the training dataset, and 10,000 in the test dataset). This format is nice and familiar - but a large number of tiny files is inefficient for storage and transfer - so **to keep things performant** we will:

- Download the data to a local temporary folder under `/tmp` (meaning you won't see the files in the left sidebar in SageMaker)
- Sample just a subset of the data to work with.

In [None]:
local_dir = "/tmp/mnist"
training_dir = f"{local_dir}/training"
testing_dir = f"{local_dir}/testing"

# Download the MNIST data from the Registry of Open Data on AWS
!rm -rf {local_dir}
!mkdir -p {local_dir}
!aws s3 cp s3://fast-ai-imageclas/mnist_png.tgz {local_dir} --no-sign-request

# Un-tar the MNIST data, stripping the leading path element; this will leave us with directories
# {local_dir}/testing/ and {local_dir/training/
!tar zxf {local_dir}/mnist_png.tgz -C {local_dir}/ --strip-components=1 --no-same-owner

# Get the list of files in the training and testing directories recursively
train_files = sorted(list(glob.iglob(os.path.join(training_dir, "*/*.png"), recursive=True)))
test_files = sorted(list(glob.iglob(os.path.join(testing_dir, "*/*.png"), recursive=True)))

print(f"Training files: {len(train_files)}")
print(f"Testing files: {len(test_files)}")

# Reduce the data by keeping every Nth file and dropping the rest of the files.
reduction_factor = 2
train_files_to_keep = train_files[::reduction_factor]
test_files_to_keep = test_files[::reduction_factor]

print(f"Training files kept: {len(train_files_to_keep)}")
print(f"Testing files kept: {len(test_files_to_keep)}")

# Delete all the files not to be kept
for fname in set(train_files) ^ set(train_files_to_keep):
 os.remove(fname)

for fname in set(test_files) ^ set(test_files_to_keep):
 os.remove(fname)

print("Done!")

## Load the Data From File

Now our images are stored in the `{local_dir}` folder, let's read our training and testing sets in from these files.

```
 {local_dir}
 |----------------.
 `-- testing `-- training
 |-- 0 |-- 0
 | `-- 1.png
 |-- 1 |-- 1
 |-- 2 |-- 2
 |-- 3 |-- 3
 |-- 4 |-- 4
 |-- 5 |-- 5
 |-- 6 |-- 6
 |-- 7 |-- 7
 |-- 8 |-- 8
 `-- 9 `-- 9
```

(For both training and testing) We'll loop through each folder taking the target label (`0`-`9`) from the folder name and loading each PNG into an image matrix.

In [None]:
from PIL import Image

labels = sorted(os.listdir(training_dir))
n_labels = len(labels)

x_train = []
y_train = []
x_test = []
y_test = []
print("Loading label ", end="")
for ix_label in range(n_labels):
 label_str = labels[ix_label]
 print(f"{label_str}...", end="")
 trainfiles = filter(
 lambda s: s.endswith(".png"),
 os.listdir(os.path.join(training_dir, label_str)),
 )

 for filename in trainfiles:
 # Can't just use tf.keras.preprocessing.image.load_img(), because it doesn't close its file
 # handles! So get "Too many open files" error... Grr
 with open(os.path.join(training_dir, label_str, filename), "rb") as imgfile:
 x_train.append(
 # Squeeze (drop) that extra channel dimension, to be consistent with prev format:
 np.squeeze(tf.keras.preprocessing.image.img_to_array(Image.open(imgfile)))
 )
 y_train.append(ix_label)

 # Repeat for test data:
 testfiles = filter(
 lambda s: s.endswith(".png"),
 os.listdir(os.path.join(testing_dir, label_str)),
 )

 for filename in testfiles:
 with open(os.path.join(testing_dir, label_str, filename), "rb") as imgfile:
 x_test.append(
 np.squeeze(tf.keras.preprocessing.image.img_to_array(Image.open(imgfile)))
 )
 y_test.append(ix_label)
print()


print("Shuffling trainset...")
train_shuffled = [(x_train[ix], y_train[ix]) for ix in range(len(y_train))]
np.random.shuffle(train_shuffled)

x_train = np.array([datum[0] for datum in train_shuffled])
y_train = np.array([datum[1] for datum in train_shuffled])
train_shuffled = None

print("Shuffling testset...")
test_shuffled = [(x_test[ix], y_test[ix]) for ix in range(len(y_test))]
np.random.shuffle(test_shuffled)

x_test = np.array([datum[0] for datum in test_shuffled])
y_test = np.array([datum[1] for datum in test_shuffled])
test_shuffled = None

print("Done!")

**Before we go ahead**, let's just quickly visualize the data distribution.

In [None]:
print(f"x_train.shape {x_train.shape}; dtype {x_train.dtype}")
print(f"y_train.shape {y_train.shape}; dtype {y_train.dtype}")
print(f"x_test.shape {x_test.shape}; dtype {x_test.dtype}")
print(f"y_test.shape {y_test.shape}; dtype {y_test.dtype}")

fig = plt.figure(figsize=(14, 3))
ax = plt.subplot(1, 2, 1)
plt.hist(x_train.flatten())
ax.set_title("Histogram of Training Image Data")
ax.set_ylabel("Frequency in Training Set")
ax.set_xlabel("Pixel Value")

ax = plt.subplot(1, 2, 2)
plt.hist(y_train)
ax.set_title("Histogram of Training Set Labels")
ax.set_ylabel("Frequency in Training Set")
ax.set_xlabel("Y Label Value")

plt.show()

It looks like the data is pretty evenly distributed between labels 0-9, and our images are encoded by fixed-size 28x28 matrices from 0 to 255. Here we will just plot a few examples to get a feel for them:

In [None]:
print("Some example images:")
fig = plt.figure(figsize=(14, 2))
for i in range(5):
 fig = plt.subplot(1, 5, i + 1)
 ax = plt.imshow(x_train[i], cmap="gray")
 fig.set_title(f"Number {y_train[i]}")
plt.show()

## Pre-Process the Data for our CNN

Next, we'll tweak this format for our neural network:

- Normalizing pixel values to improve the numerical conditioning
- One-hot encoding our labels to suit a softmax classifier output of probabilities for each digit
- Adding both a batch dimension (for processing multiple samples in parallel) and a channel dimension (e.g. as if this were a 3-channel RGB image, except single-channel for grayscale) - as well as the X and Y axes.

In [None]:
# Since we're actually feeding the images in to nets this time, we should actually pay attention
# to which way around Keras wants the channel dimension:
if K.image_data_format() == "channels_first":
 x_train = np.expand_dims(x_train, 1)
 x_test = np.expand_dims(x_train, 1)
else:
 x_train = np.expand_dims(x_train, len(x_train.shape))
 x_test = np.expand_dims(x_test, len(x_test.shape))

x_train = x_train.astype("float32")
x_test = x_test.astype("float32")
x_train /= 255
x_test /= 255

input_shape = x_train.shape[1:]

print("x_train shape:", x_train.shape)
print("input_shape:", input_shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")

# convert class vectors to binary class matrices
y_train = tf.keras.utils.to_categorical(y_train, n_labels)
y_test = tf.keras.utils.to_categorical(y_test, n_labels)

print("n_labels:", n_labels)
print("y_train shape:", y_train.shape)

## Build a Model

At its core, the model is a 2D convolutional network with a softmax output layer that'll yield a confidence score for every possible label (e.g. 10 options for digit = 0 to 9).


In [None]:
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation="relu", input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(n_labels, activation="softmax"))

model.compile(
 loss=tf.keras.losses.categorical_crossentropy,
 optimizer=tf.keras.optimizers.Adadelta(),
 metrics=["accuracy"],
)

## Fit the Model

Keras makes fitting and evaluating the model straightforward enough: We don't have any fancy hooks, and are happy with the default logging:


In [None]:
%%time
batch_size = 128
epochs = 12

model.fit(
 x_train,
 y_train,
 batch_size=batch_size,
 epochs=epochs,
 shuffle=True,
 verbose=1, # Hint: You might prefer =2 for running in SageMaker!
 validation_data=(x_test, y_test),
)

score = model.evaluate(x_test, y_test, verbose=0)
print(f"Test loss={score[0]}")
print(f"Test accuracy={score[1]}")

## Save the Trained Model

Keras has a built-in `model.save()` command, which in TensorFlow v2 can directly produce TensorFlow Serving-compatible outputs!

...However, this notebook runs TensorFlow v1. To save you the frustration of figuring it out (there's a nice blog post on the subject [here](https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/)), we'll give you a hint by saving the model here in TensorFlow Serving-ready format.


In [None]:
# The export folder needs to be empty, or non-existent
!rm -rf data/model/model/1

# Please ignore Layer.updates(...) warning if any while running the notebook in < TFv2.4
model.save(os.path.join("data/model", "model/1"))

## Explore Results

To try out the model we can take a sample image from the test set, predict the label, and plot it:

In [None]:
# Choose an image:
label = "2"
filename = os.listdir(f"{testing_dir}/{label}")[0]

# Load the image (and normalize to 0-1):
img = tf.keras.preprocessing.image.img_to_array(
 Image.open(f"{testing_dir}/{label}/{filename}")
) / 255

# Expand out the "batch" dimension, and send to the model:
result = model.predict(np.expand_dims(img, 0))
print(f"Result confidences: {result}")

# Plot the result:
plt.figure(figsize=(3, 3))
fig = plt.subplot(1, 1, 1)
ax = plt.imshow(np.squeeze(img), cmap="gray")
fig.set_title(f"Predicted Number {np.argmax(result[0])}")
plt.show()

...Or even better, draw your own digit using this interactive widget - then run the cell below to classify it!

In [None]:
# Create/reload the widget:
widget = util.draw.PixelDrawCanvas(pen_size=4)
widget.display()

In [None]:
# Classify the image currently drawn in the widget:
img = widget.data.mean(-1, keepdims=True) / 255 # (Convert full-RGB 0-255 to grayscale 0-1)
result = model.predict(np.expand_dims(img, 0))
print(f"Result confidences: {result}")

# Plot the result:
plt.figure(figsize=(3, 3))
fig = plt.subplot(1, 1, 1)
ax = plt.imshow(img, cmap="gray")
fig.set_title(f"Predicted Number {np.argmax(result[0])}")
plt.show()

All done!
