# Custom image classification example

This notebook is based off the [example created by Gabe Hollombe](https://github.com/gabehollombe-aws/jupyter-notebooks). It has been simplified and modified for a defect detection example project. If interested for a more in-depth explanation of the image classification process, please visit the original notebook. 
Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0 


## Step 1 - Specify training data location

Provide a bucket containing the image dataset. Organize images into folders by class. 

```
images_to_classify
├── defect_free
│ ├── 1.jpg
│ ├── 2.jpg
| ├── 3.jpg
│ └── . . .
└── defective
│ ├── 1.jpg
│ ├── 2.jpg 
│ ├── 3.jpg
│ ├── . . .
└── . . .
```

In [None]:


# The S3 bucket name where the image dataset is located
data_bucket_name='SomeBucket'

# The name of the folder containing the dataset
dataset_name = 'images_to_classify'



## Step 2 - Setup environment

The following imports the neccesary python libraries and fetches the execution role for the notebook instance. It also fetches the built-in image-classification algorithm.

In [None]:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri

role = get_execution_role()
sess = sagemaker.Session()

training_image = get_image_uri(sess.boto_region_name, 'image-classification', repo_version="latest")

# Find im2rec in our environment and set up some other vars in our environemnt

base_dir='/tmp'

%env BASE_DIR=$base_dir
%env S3_DATA_BUCKET_NAME = $data_bucket_name
%env DATASET_NAME = $dataset_name

import sys,os

suffix='/mxnet/tools/im2rec.py'
im2rec = list(filter( (lambda x: os.path.isfile(x + suffix )), sys.path))[0] + suffix
%env IM2REC=$im2rec


## Step 3 - Download image dataset from Amazon S3

In [None]:
# Pull our images from S3
!aws s3 sync s3://$S3_DATA_BUCKET_NAME/$DATASET_NAME $BASE_DIR/$DATASET_NAME 

## Step 4 - Create RecordIO files from images

The images need to be converted into RecordIO files. Using im2rec.py, These files contain binary data of the images indexed by class, which is infered by the folder structure. Two files are made for training and validation. 

For more information see https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html#IC-inputoutput

In [None]:
%%bash
# Use the IM2REC script to convert our images into RecordIO files

# Clean up our working dir of existing LST and REC files
cd $BASE_DIR
rm *.rec
rm *.lst

# First we need to create two LST files (training and test lists), noting the correct label class for each image
# We'll also save the output of the LST files command, since it includes a list of all of our label classes
echo "Creating LST files"
python $IM2REC --list --recursive --pass-through --test-ratio=0.3 --train-ratio=0.7 $DATASET_NAME $DATASET_NAME > ${DATASET_NAME}_classes

echo "Label classes:"
cat ${DATASET_NAME}_classes

# Then we create RecordIO files from the LST files
echo "Creating RecordIO files"
python $IM2REC --num-thread=4 ${DATASET_NAME}_train.lst $DATASET_NAME
python $IM2REC --num-thread=4 ${DATASET_NAME}_test.lst $DATASET_NAME
ls -lh *.rec

## Step 5 - Upload RecordIO files to S3

In [None]:
# Upload our train and test RecordIO files to S3 in the bucket that our sagemaker session is using
bucket = sess.default_bucket()

s3train_path = 's3://{}/{}/train/'.format(bucket, dataset_name)
s3validation_path = 's3://{}/{}/validation/'.format(bucket, dataset_name)

# Clean up any existing data
!aws s3 rm s3://{bucket}/{dataset_name}/train --recursive
!aws s3 rm s3://{bucket}/{dataset_name}/validation --recursive

# Upload the rec files to the train and validation channels
!aws s3 cp /tmp/{dataset_name}_train.rec $s3train_path
!aws s3 cp /tmp/{dataset_name}_test.rec $s3validation_path

## Step 6 - Configure training model

For this example nothing needs to be changed. Batch size can be modified to reflect your sample size.

In [None]:
# The minimum batch size should be less than the total sample size for each class. 
batch_size = 2

train_data = sagemaker.session.s3_input(
 s3train_path, 
 distribution='FullyReplicated', 
 content_type='application/x-recordio', 
 s3_data_type='S3Prefix'
)

validation_data = sagemaker.session.s3_input(
 s3validation_path, 
 distribution='FullyReplicated', 
 content_type='application/x-recordio', 
 s3_data_type='S3Prefix'
)

data_channels = {'train': train_data, 'validation': validation_data}

s3_output_location = 's3://{}/{}/output'.format(bucket, dataset_name)

image_classifier = sagemaker.estimator.Estimator(
 training_image,
 role, 
 train_instance_count=1, 
 train_instance_type='ml.p3.2xlarge',
 output_path=s3_output_location,
 sagemaker_session=sess
)

num_classes=! ls -l {base_dir}/{dataset_name} | wc -l
num_classes=int(num_classes[0]) - 1

num_training_samples=! cat {base_dir}/{dataset_name}_train.lst | wc -l
num_training_samples = int(num_training_samples[0])

# Learn more about the Sagemaker built-in Image Classifier hyperparameters here: https://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html

# These hyperparameters we won't want to change, as they define things like
# the size of the images we'll be sending for input, the number of training classes we have, etc.
base_hyperparameters=dict(
 use_pretrained_model=1,
 image_shape='3,224,224',
 num_classes=num_classes,
 num_training_samples=num_training_samples,
)

# These are hyperparameters we may want to tune, as they can affect the model training success:
hyperparameters={
 **base_hyperparameters, 
 **dict(
 learning_rate=0.001,
 mini_batch_size= batch_size,
 )
}


image_classifier.set_hyperparameters(**hyperparameters)

hyperparameters



## Step 7 - Start training job

This will dispatch a job to Amazon SageMaker to begin a training job. Upon completion it will upload a training model to Amazon S3 in a generate bucket for SageMaker.

In [None]:
%%time

import time
now = str(int(time.time()))
training_job_name = 'IC-' + dataset_name.replace('_', '-') + '-' + now

image_classifier.fit(inputs=data_channels, job_name=training_job_name, logs=True)

job = image_classifier.latest_training_job
model_path = f"{base_dir}/{job.name}"

print(f"\n\n Finished training! The model is available for download at: {image_classifier.output_path}/{job.name}/output/model.tar.gz")

## Step 8 - Deploy Model Endpoint

Deploys a REST endpoint that can be invoked in order to run images against for classification.

In [None]:


%%time
# Deploying a model to an endpoint takes a few minutes to complete

deployed_endpoint = image_classifier.deploy(
 initial_instance_count = 1,
 instance_type = 'ml.t2.medium'
)

