# Sign & Speak ML Instructions

This notebook shows how to use Amazon SageMaker to run the training and inference scripts for the Sign & Speak project.

Use the `conda_pytorch_p36` kernel to run the cells in this notebook.

## Training

The following cell defines the training job to be run by Amazon SageMaker. It points to the `grid_train.py` training script, defines the number and types of instances used for training, sets the hyperparameter values, and defines regular expressions which Amazon SageMaker uses to track the training metrics.

Before running this cell, you must provide a descriptive name for the training job and specify the Amazon S3 URI where the output should be stored. The URI should look like `s3://bucket-name/output-folder/`.

*Note: If you are using a new AWS account, you may not have access to p2 instance types yet. The code should run fine on a CPU instance type, but it will require more time to complete. Submit a limit increase request to use p2 instances.*

In [None]:
import sagemaker
from sagemaker.pytorch import PyTorch

# Replace the following variables with a descriptive name for the
# training job and an S3 URI where to store the output
JOB_NAME = 'INSERT_A_NAME_HERE'
OUTPUT_PATH = 'INSERT_AN_S3_URI_HERE'

role = sagemaker.get_execution_role()

estimator = PyTorch(entry_point='grid_train.py',
 role=role,
 base_job_name=JOB_NAME,
 output_path=OUTPUT_PATH,
 framework_version='1.1.0',
 train_instance_count=1,
 train_instance_type='ml.p2.xlarge',
 hyperparameters={
 "epochs": 10,
 "batch-size": 4,
 "gamma": 0.1,
 "lr": 0.001,
 "momentum": 0.9,
 "step-size": 7
 },
 metric_definitions=[
 {'Name': 'train:loss', 'Regex': 'train Loss: (.*?) '},
 {'Name': 'train:acc', 'Regex': 'train Loss: .*? Acc: (.*?)$'},
 {'Name': 'val:loss', 'Regex': 'val Loss: (.*?) '},
 {'Name': 'val:acc', 'Regex': 'val Loss: .*? Acc: (.*?)$'}
 ]
 )

Once the training job has been defined, pass in the Amazon S3 URI for the training data to start the training job. The URI should look like `s3://bucket-name/training-data-folder/`, where `training-data-folder` contains one folder per label containing the training images for that label. 

This cell will output the logs of the training job, but you can also view the logs and visualize the metrics in the Amazon SageMaker console.

In [None]:
estimator.fit({'training': 'INSERT_AN_S3_URI_HERE'})

## Hyperparameter Tuning
This section shows how to run a hyperparameter tuning job using Amazon SageMaker. First, define the range of values for the hyperparameters which you want to tune.

In [None]:
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter

hyperparameter_ranges = {
 'batch-size': IntegerParameter(3,30,scaling_type='Auto'),
 'momentum': ContinuousParameter(0.1, 0.9, scaling_type='Auto'),
 'step-size': IntegerParameter(3, 12, scaling_type='Auto'),
 'gamma': ContinuousParameter(0.01, 0.9, scaling_type='Auto')
}

Next, define the training jobs which will be run during hyperparameter tuning. This is the same as in the above section on training.

In [None]:
import sagemaker
from sagemaker.pytorch import PyTorch

# Replace the following variables with a descriptive name for the
# training job and an S3 URI where to store the output
JOB_NAME = 'INSERT_A_NAME_HERE'
OUTPUT_PATH = 'INSERT_AN_S3_URI_HERE'

role = sagemaker.get_execution_role()

estimator = PyTorch(entry_point='grid_train.py',
 role=role,
 base_job_name=JOB_NAME
 output_path=OUTPUT_PATH,
 framework_version='1.1.0',
 train_instance_count=1,
 train_instance_type='ml.p2.xlarge',
 hyperparameters={
 "epochs": 20,
 "lr": 0.001
 },
 metric_definitions=[
 {'Name': 'train:loss', 'Regex': 'train Loss: (.*?) '},
 {'Name': 'train:acc', 'Regex': 'train Loss: .*? Acc: (.*?)$'},
 {'Name': 'val:loss', 'Regex': 'val Loss: (.*?) '},
 {'Name': 'val:acc', 'Regex': 'val Loss: .*? Acc: (.*?)$'}
 ]
 )

Next, define the hyperparameter tuning job based on the defined hyperparameter ranges. Set the objective metric, the maximum number of training jobs, and the maximum number of parallel training jobs. 

*Note: make sure your AWS account limits allow for the number of parallel training jobs for the instance type defined in the training job.*

In [None]:
from sagemaker.tuner import HyperparameterTuner

TUNING_JOB_NAME = 'INSERT_A_NAME_HERE'

tuner = HyperparameterTuner(
 estimator=estimator,
 objective_metric_name='val:acc',
 hyperparameter_ranges=hyperparameter_ranges,
 metric_definitions=[
 {'Name': 'train:loss', 'Regex': 'train Loss: (.*?) '},
 {'Name': 'train:acc', 'Regex': 'train Loss: .*? Acc: (.*?)$'},
 {'Name': 'val:loss', 'Regex': 'val Loss: (.*?) '},
 {'Name': 'val:acc', 'Regex': 'val Loss: .*? Acc: (.*?)$'}
 ],
 strategy='Bayesian',
 objective_type='Maximize',
 max_jobs=30,
 max_parallel_jobs=3,
 base_tuning_job_name=TUNING_JOB_NAME
)

Once the tuning job has been defined, pass in the Amazon S3 URI for the training data to start the tuning job. The URI should look like `s3://bucket-name/training-data-folder/`, where `training-data-folder` contains one folder per label containing the training images for that label.

View the logs and visualize the metrics for the training jobs linked to this tuning job in the Amazon SageMaker console.

In [None]:
tuner.fit(inputs='INSERT_AN_S3_URI_HERE')

## Deploying

After running some training jobs and/or hyperparameter tuning jobs, decide on which training job you want to base your deployment. Find the Amazon S3 URI of the model package, which should look like `s3://bucket-name/training-job-name/output/model.tar.gz`. Insert the URI in the code below.

In [None]:
from sagemaker.predictor import RealTimePredictor, json_serializer, json_deserializer

class JSONPredictor(RealTimePredictor):
 def __init__(self, endpoint_name, sagemaker_session):
 super(JSONPredictor, self).__init__(endpoint_name, sagemaker_session, json_serializer, json_deserializer)

In [None]:
from sagemaker.pytorch import PyTorchModel
import sagemaker

role = sagemaker.get_execution_role()
model = PyTorchModel(model_data='INSERT_S3_URI_OF_MODEL_PACKAGE',
 role=role,
 framework_version='1.1.0',
 entry_point='grid_serve.py',
 predictor_cls=JSONPredictor)

After defining the model and predictor type, we specify the number and type of instances for running the endpoint. 

*Note: An endpoint takes several minutes to start up.*

In [None]:
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

Once your endpoint is complete, note down the name to link it up to the Sign & Speak user interface.