# IoT Greengrass image classification model training and retraining

1. [Part 1: initial training](#Part-1:-Initial-Training)
 1. [Prerequisites and preprocessing](#Prequisites-and-Preprocessing)
 1. [Permissions and environment variables](#Permissions-and-environment-variables)
 2. [Data preparation](#Data-preparation)
 3. [Create S3 folders for field data](#Create-S3-folders-for-field-data)
 2. [Training parameters](#Training-parameters)
 3. [Training](#Training)
2. [Part 2: Retraining the model](#Part-2:-Retraining-the-model)
 1. [Data preparation](#data-preparation)
 2. [Retraining](#Retraining)



Welcome to the "Machine learning at the edge - using and retraining image classification with AWS IoT Greengrass" notebook. This should serve as a resource alongside the blog post. This notebook will walk you through step by step how to:
1. Configure a model for image classification using the [Caltech-256 dataset](http://www.vision.caltech.edu/Image_Datasets/Caltech256/). 
2. Retrain a model with images you capture on your IoT Greengrass core device.

Both of these correspond to parts 1 and 2 of the blog post.

*Note: This notebook is a modified version of Amazon SageMaker's image classification sample notebook. Please refer to the SageMaker example notebooks for more details about using the service.*

## Part 1: Initial training

### Prequisites and preprocessing

#### Permissions and environment variables

Here we set up the linkage and authentication to AWS services. There are three parts to this:
* The roles used to give learning and hosting access to your data. This will automatically be obtained from the role used to start the notebook
* The S3 bucket that you want to use for training and model data
* The Amazon sagemaker image classification docker image which need not be changed

In [None]:
%%time
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri

role = get_execution_role()
print(role)

sess = sagemaker.Session()
bucket=sess.default_bucket()
print(bucket)

training_image = get_image_uri(boto3.Session().region_name, 'image-classification')

#### Data preparation
The Caltech 256 dataset consists of images from 257 categories (the last one being a clutter category) and has 30k images with a minimum of 80 images and a maximum of about 800 images per category.

We will leverage a subset of the Caltech dataset for our example (beer-mug, wine-bottle, coffee-mug, soda-can, and clutter). The following will download the full dataset, extract the subset of categories, and create our model in the [lst format](https://mxnet.incubator.apache.org/how_to/recordio.html?highlight=im2rec) (content type: application/x-image).

A .lst file is a tab-separated file with three columns that contains a list of image files. The first column specifies the image index, the second column specifies the class label index for the image, and the third column specifies the relative path of the image file. The image index in the first column should be unique across all of the images. Here we make an image list file using the [im2rec](https://github.com/apache/incubator-mxnet/blob/master/tools/im2rec.py) tool from MXNet. In order to train with the lst format interface, passing the lst file for both training and validation in the appropriate format is mandatory. 

In [None]:
import os
import urllib.request

def download(url):
 filename = url.split("/")[-1]
 if not os.path.exists(filename):
 urllib.request.urlretrieve(url, filename)

# Caltech-256 image files
download('http://www.vision.caltech.edu/Image_Datasets/Caltech256/256_ObjectCategories.tar')
!tar -xf 256_ObjectCategories.tar

# Tool for creating lst file
download('https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/im2rec.py')

In [None]:
%%bash

# Extract the subset of categories used for this example. We
# will only need beer-mug, coffee-mug, wine-bottle, soda-can

# Clean up any existing folders left behind by previous runs
rm -rf category_subset
rm -rf caltech_256_train_60

# Re indexes the given folders and sub image files. This
# will be useful when we add more data and/or more 
# classes during model retraining
reindex_categories() {
 folder_index=0
 for category_folder in $1/*; do 
 category_name=`basename $category_folder | cut -d'.' -f2`
 new_folder_index=`printf '%03d' $folder_index`
 new_folder_name='category_subset/'$new_folder_index'.'$category_name
 mv $category_folder $new_folder_name
 image_index=0
 for image_file in $new_folder_name/*; do
 new_image_name=`printf '%04d' $image_index`
 new_image_name=$new_folder_index'_'$new_image_name'.jpg'
 mv $image_file $new_folder_name/$new_image_name
 ((image_index++))
 done
 ((folder_index++))
 done
}

mkdir -p category_subset

# The caltech dataset is properly formatted for 257 categories. We will
# only be using 4 for our example. Copy the 4 categories to a new folder
# and rename them to have the proper indicies in their names - i.e
# 010.beer-mug -> 000.beer-mug (and sub files)
# 041.coffee-mug -> 001.coffee-mug (and sub files)
cp -r 256_ObjectCategories/010.beer-mug/. category_subset/beer-mug/
cp -r 256_ObjectCategories/041.coffee-mug/. category_subset/coffee-mug/
cp -r 256_ObjectCategories/195.soda-can/. category_subset/soda-can/
cp -r 256_ObjectCategories/246.wine-bottle/. category_subset/wine-bottle/
cp -r 256_ObjectCategories/257.clutter/. category_subset/clutter/
reindex_categories category_subset

# Take 60 images from each category and put them in a folder
# dedicated to training images. Use the remaining images in
# each folder for validation.
mkdir -p caltech_256_train_60
for i in category_subset/*; do
 c=`basename $i`
 mkdir -p caltech_256_train_60/$c
 for j in `ls $i/*.jpg | shuf | head -n 60`; do
 mv $j caltech_256_train_60/$c/
 done
done

python im2rec.py --list --recursive caltech-256-60-train caltech_256_train_60/
python im2rec.py --list --recursive caltech-256-60-val category_subset/

A sample of the lst file we created can be viewed by running below. 

In [None]:
!head -n 15 ./caltech-256-60-val.lst > example.lst
f = open('example.lst','r')
lst_content = f.read()
print(lst_content)

Once we have the data available in the correct format for training, the next step is to upload the image and .lst file to your S3 bucket.

In [None]:
# Four channels: train, validation, train_lst, and validation_lst
s3train = 's3://{}/image-classification/train/'.format(bucket)
s3validation = 's3://{}/image-classification/validation/'.format(bucket)
s3train_lst = 's3://{}/image-classification/train_lst/'.format(bucket)
s3validation_lst = 's3://{}/image-classification/validation_lst/'.format(bucket)

In [None]:
# upload the image files to train and validation channels
!aws s3 cp caltech_256_train_60 $s3train --recursive --quiet
!aws s3 cp category_subset $s3validation --recursive --quiet

# upload the lst files to train_lst and validation_lst channels
!aws s3 cp caltech-256-60-train.lst $s3train_lst --quiet
!aws s3 cp caltech-256-60-val.lst $s3validation_lst --quiet

### Create S3 folders for field data
In part 2 we will collect data in the field. These images start as unlabeled in the raw_field_data folder in the S3 bucket. You can label these images by moving them to the correct folders in the /labeled_field_data folder. The following cell creates placeholders for these folders.

In [None]:
# Folders for S3 field data
s3fielddata = 's3://{}/image-classification/labeled_field_data/'.format(bucket)

# Set up for retraining. empty.tmp is added to each bucket to allow us to create
# a visible folder in S3.
!mkdir -p field_data/beer-mug && touch field_data/beer-mug/empty.tmp 
!mkdir -p field_data/coffee-mug && touch field_data/coffee-mug/empty.tmp
!mkdir -p field_data/soda-can && touch field_data/soda-can/empty.tmp
!mkdir -p field_data/wine-bottle && touch field_data/wine-bottle/empty.tmp
!mkdir -p field_data/clutter && touch field_data/clutter/empty.tmp

!aws s3 cp --recursive field_data $s3fielddata

### Training parameters
The following parameters are defined below to configure our training job. These values are consumed in the following section when the training_params object is constructed.

In [None]:
# The algorithm supports multiple network depth (number of layers). They are 18, 34, 50, 
#101, 152 and 200. For this training, we will use 18 layers.
num_layers = 18
# The input image dimensions,'num_channels, height, width', for the network. It should be 
# no larger than the actual image size. The number of channels should be same as the actual
# image.
image_shape = "3,224,224"
# This is the total number of training samples. It is set to 300 (60 samples * 5 categories)
num_training_samples = 300
# This is the number of output classes for the new dataset: beer-mug, clutter, coffee-mug, wine-bottle, soda-can,
num_classes = 5
# The number of training samples used for each mini batch. In distributed training, the 
# number of training samples used per batch will be N * mini_batch_size where N is the number 
# of hosts on which training is run.
mini_batch_size = 128
# Number of training epochs.
epochs = 6
# Learning rate for training.
learning_rate = 0.01
# Report the top-k accuracy during training.
top_k = 5
# Resize the image before using it for training. The images are resized so that the shortest 
# side is of this parameter. If the parameter is not set, then the training data is used as 
# such without resizing.
resize = 256
# period to store model parameters (in number of epochs), in this case, we will save parameters 
# from epoch 2, 4, and 6
checkpoint_frequency = 2
# Since we are using transfer learning, we set use_pretrained_model to 1 so that weights can be 
# initialized with pre-trained weights. We aren't using a large number of input samples. Therefore, 
# we can benefit from using transfer learning to leverage pre-trained weights that have been 
# collected on a much larger dataset.
# See: https://docs.aws.amazon.com/sagemaker/latest/dg/IC-HowItWorks.html
use_pretrained_model = 1

### Training
Below creates three functions that will support the configuration and execution of our training jobs throughout the rest of this notebook (initial training and retraining).

In [None]:
%%time
import time
import boto3
from time import gmtime, strftime

s3 = boto3.client('s3')
sagemaker = boto3.client(service_name='sagemaker')

JOB_NAME_PREFIX = 'greengrass-imageclassification-training'
 
def create_unique_job_name():
 '''
 Creates a job name in the following format:
 greengrass-imageclassification-training-[year]-[month]-[day]-[hour]-[minute]-[second]
 '''
 timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
 job_name = JOB_NAME_PREFIX + timestamp
 return job_name

def create_training_params(unique_job_name):
 '''
 Constructs training parameters for the train function
 below.
 '''
 training_params = \
 {
 # specify the training docker image
 "AlgorithmSpecification": {
 "TrainingImage": training_image,
 "TrainingInputMode": "File"
 },
 "RoleArn": role,
 "OutputDataConfig": {
 "S3OutputPath": 's3://{}/{}/output'.format(bucket, JOB_NAME_PREFIX)
 },
 "ResourceConfig": {
 "InstanceCount": 1,
 "InstanceType": "ml.p2.xlarge",
 "VolumeSizeInGB": 50
 },
 "TrainingJobName": unique_job_name,
 "HyperParameters": {
 "image_shape": image_shape,
 "num_layers": str(num_layers),
 "num_training_samples": str(num_training_samples),
 "num_classes": str(num_classes),
 "mini_batch_size": str(mini_batch_size),
 "epochs": str(epochs),
 "learning_rate": str(learning_rate),
 "top_k": str(top_k),
 "resize": str(resize),
 "checkpoint_frequency": str(checkpoint_frequency),
 "use_pretrained_model": str(use_pretrained_model) 
 },
 "StoppingCondition": {
 "MaxRuntimeInSeconds": 360000
 },
 #Training data should be inside a subdirectory called "train"
 #Validation data should be inside a subdirectory called "validation"
 #The algorithm currently only supports fullyreplicated model (where data is copied onto each machine)
 "InputDataConfig": [
 {
 "ChannelName": "train",
 "DataSource": {
 "S3DataSource": {
 "S3DataType": "S3Prefix",
 "S3Uri": s3train,
 "S3DataDistributionType": "FullyReplicated"
 }
 },
 "ContentType": "application/x-image",
 "CompressionType": "None"
 },
 {
 "ChannelName": "validation",
 "DataSource": {
 "S3DataSource": {
 "S3DataType": "S3Prefix",
 "S3Uri": s3validation,
 "S3DataDistributionType": "FullyReplicated"
 }
 },
 "ContentType": "application/x-image",
 "CompressionType": "None"
 },
 {
 "ChannelName": "train_lst",
 "DataSource": {
 "S3DataSource": {
 "S3DataType": "S3Prefix",
 "S3Uri": s3train_lst,
 "S3DataDistributionType": "FullyReplicated"
 }
 },
 "ContentType": "application/x-image",
 "CompressionType": "None"
 },
 {
 "ChannelName": "validation_lst",
 "DataSource": {
 "S3DataSource": {
 "S3DataType": "S3Prefix",
 "S3Uri": s3validation_lst,
 "S3DataDistributionType": "FullyReplicated"
 }
 },
 "ContentType": "application/x-image",
 "CompressionType": "None"
 }
 ]
 }
 return training_params

def train(job_name, training_params):
 '''
 Creates a training job, job_name, configured with
 training_params.
 '''
 # create the Amazon SageMaker training job
 sagemaker.create_training_job(**training_params)

 # confirm that the training job has started
 status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
 print('Training job current status: {}'.format(status))

 try:
 # wait for the job to finish and report the ending status
 sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
 training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
 status = training_info['TrainingJobStatus']
 print("Training job ended with status: " + status)
 except:
 print('Training failed to start')
 # if exception is raised, that means it has failed
 message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
 print('Training failed with the following error: {}'.format(message))

In [None]:
# Create a training job and execute
initial_training_job_name = create_unique_job_name()
initial_training_params = create_training_params(initial_training_job_name)
print('Training job name: {}'.format(initial_training_job_name))

train(initial_training_job_name, initial_training_params)

You can monitor the status of the training job by running the code below. You can also view information about and the status of a training job using the AWS SageMaker console. Just click on the "Jobs" tab. 

In [None]:
training_info = sagemaker.describe_training_job(TrainingJobName=initial_training_job_name)
status = training_info['TrainingJobStatus']
print("Training job ended with status: " + status)
print(training_info)

If you see the message,

> `Training job ended with status: Completed`

then that means training sucessfully completed and the output model was stored in the output path specified by `training_params['OutputDataConfig']`.

***This is the end of Part 1. Please return to the blog post and continue from there.***

## Part 2: Retraining the model

At this point we have an IoT Greengrass core device capable of capturing images, performing inference, and uploading results to S3. In part 2, we will retrain our model to use the new data captured in the field using our IoT Greengrass core device.

Note, in this example we will be creating a new model with a combination of our original and new training data. Alternatively iterative training can be used. See the [SageMaker Image Classification Algorithm Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html) for more details.

### Data preparation
In this step we will access our S3 bucket and pull down the training data collected in the field. We will add this data to our original dataset and regenerate our training/validation image files.

In [None]:
# sync s3 labeled field data with local fieldData folder
!aws s3 sync $s3fielddata ./field_data

# remove empty.tmp from the local field_data folder
!rm -f field_data/beer-mug/empty.tmp
!rm -f field_data/coffee-mug/empty.tmp
!rm -f field_data/soda-can/empty.tmp
!rm -f field_data/wine-bottle/empty.tmp
!rm -f field_data/clutter/empty.tmp

In [None]:
%%bash
# Re indexes the given folders and sub image files. This
# will be useful when we add more data and/or more 
# classes during model retraining
reindex_categories() {
 folder_index=0
 for category_folder in $1/*; do 
 category_name=`basename $category_folder | cut -d'.' -f2`
 new_folder_index=`printf '%03d' $folder_index`
 new_folder_name='category_subset/'$new_folder_index'.'$category_name
 mv $category_folder $new_folder_name
 image_index=0
 for image_file in $new_folder_name/*; do
 new_image_name=`printf '%04d' $image_index`
 new_image_name=$new_folder_index'_'$new_image_name'.jpg'
 mv $image_file $new_folder_name/$new_image_name
 ((image_index++))
 done
 ((folder_index++))
 done
}

# Clean up any existing folders left behind by previous runs
rm -rf category_subset
rm -rf caltech_256_train_60

# Copy over category subset again
mkdir -p category_subset
cp -r 256_ObjectCategories/010.beer-mug/. category_subset/beer-mug/
cp -r 256_ObjectCategories/041.coffee-mug/. category_subset/coffee-mug/
cp -r 256_ObjectCategories/195.soda-can/. category_subset/soda-can/
cp -r 256_ObjectCategories/246.wine-bottle/. category_subset/wine-bottle/
cp -r 256_ObjectCategories/257.clutter/. category_subset/clutter/

# Copy contents of field data into category subset
cp -r field_data/beer-mug/. category_subset/beer-mug/
cp -r field_data/coffee-mug/. category_subset/coffee-mug/
cp -r field_data/soda-can/. category_subset/soda-can/
cp -r field_data/wine-bottle/. category_subset/wine-bottle/
cp -r field_data/clutter/. category_subset/clutter/

reindex_categories category_subset

# Take 60 images from each category and put them in a folder
# dedicated to training images. Use the remaining images in
# each folder for validation.
mkdir -p caltech_256_train_60
for i in category_subset/*; do
 c=`basename $i`
 mkdir -p caltech_256_train_60/$c
 for j in `ls $i/*.jpg | shuf | head -n 60`; do
 mv $j caltech_256_train_60/$c/
 done
done

python im2rec.py --list --recursive caltech-256-60-train caltech_256_train_60/
python im2rec.py --list --recursive caltech-256-60-val category_subset/

In [None]:
# cleanup existing training data in S3
!aws s3 rm $s3train
!aws s3 rm $s3validation
!aws s3 rm $s3train_lst
!aws s3 rm $s3validation_lst

# upload the image files to train and validation channels
!aws s3 cp caltech_256_train_60 $s3train --recursive
!aws s3 cp category_subset $s3validation --recursive

# upload the lst files to train_lst and validation_lst channels
!aws s3 cp caltech-256-60-train.lst $s3train_lst
!aws s3 cp caltech-256-60-val.lst $s3validation_lst

### Retraining

In [None]:
# Create a new training job and execute
re_training_job_name = create_unique_job_name()
re_training_params = create_training_params(re_training_job_name)
print('Training job name: {}'.format(re_training_job_name))
print('\nInput Data Location: {}'.format(re_training_params['InputDataConfig'][0]['DataSource']['S3DataSource']))

In [None]:
train(re_training_job_name, re_training_params)

The code in this section can be rerun at any time to generate a new model using the field data uploaded to S3.

**Return to the blog post to continue!**