# Train a Keras Sequential Model
This notebook shows how to train and host a Keras Sequential model on SageMaker. The model used for this notebook is a simple deep CNN that was extracted from [the Keras examples](https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py).

## The dataset
The [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) is one of the most popular machine learning datasets. It consists of 60,000 32x32 images belonging to 10 different classes (6,000 images per class). Here are the classes in the dataset, as well as 10 random images from each:

![cifar10](https://maet3608.github.io/nuts-ml/_images/cifar10.png)

In this tutorial, we will train a deep CNN to recognize these images.

We'll compare trainig with file mode, pipe mode datasets and distributed training with Horovod

In [None]:
!pip install sagemaker-experiments sagemaker boto3 awscli --upgrade

## Getting the data
To use the CIFAR-10 dataset, we first need to download it and convert it to TFRecords. This step takes around 5 minutes.

You can use the following command:

In [None]:
!python generate_cifar10_tfrecords.py --data-dir ./data

## Run the training locally

The script uses arguments for configuration. it requires the following configurations:
1. Model_dir - location where it'll save checkpoints and logs
2. train, validation, eval - location of the relevant tf records

Run the script locally:

In [None]:
!mkdir -p logs
!python training_script/cifar10_keras.py --model_dir ./logs \
                                         --train data/train \
                                         --validation data/validation \
                                         --eval data/eval \
                                         --epochs 1
!rm -rf logs

**Altough the script was running on a SageMaker notebook, you can run the same script on your computer using the same command.**

## Use TensorFlow Script Mode
For TensorFlow versions 1.11 and later, the Amazon SageMaker Python SDK supports script mode training scripts. Script mode has the following advantages over legacy mode training scripts:

* Script mode training scripts are more similar to training scripts you write for TensorFlow in general, so it is easier to modify your existing TensorFlow training scripts to work with Amazon SageMaker.

* Script mode supports both Python 2.7- andPython 3.6-compatible source files.

* Script mode supports Horovod for distributed training.

For information about writing TensorFlow script mode training scripts and using TensorFlow script mode estimators and models with Amazon SageMaker, see https://sagemaker.readthedocs.io/en/stable/using_tf.html.

### Preparing your script for training in SageMaker
The training script is very similar to a training script you might run outside of SageMaker.
SageMaker runs the script with 1 argument, model_dir, an S3 location that is used for logs and artifacts.

You can access useful properties about the training environment through various environment variables.
In this example, we are sending 3 data channels to the script: Train, Validation, Eval.

**Create a copy of the script (training_script/cifar10_keras.py) and save it as training_script/cifar10_keras_sm.py.**

In cifar10_keras_sm.py, scroll down to the **if __name__ == '__main__':** section.  
Update the train,validation,eval arguments to get the data by default from the relevant environment variable: SM_CHANNEL_TRAIN, SM_CHANNEL_VALIDATION, SM_CHANNEL_EVAL
Add the default configuration to the arguments in **cifar10_keras_sm.py**.  
The lines should look as following:
```python
parser.add_argument(
        '--train',
        type=str,
        required=False,
        default=os.environ.get('SM_CHANNEL_TRAIN'),
        help='The directory where the CIFAR-10 input data is stored.')
parser.add_argument(
        '--validation',
        type=str,
        required=False,
        default=os.environ.get('SM_CHANNEL_VALIDATION'),
        help='The directory where the CIFAR-10 input data is stored.')
parser.add_argument(
        '--eval',
        type=str,
        required=False,
        default=os.environ.get('SM_CHANNEL_EVAL'),
        help='The directory where the CIFAR-10 input data is stored.')
```

For info see the SageMaker-python-sdk [documentation](https://sagemaker.readthedocs.io/en/stable/using_tf.html#preparing-a-script-mode-training-script)

SageMaker will not send the locations as arguments, it'll use environment variables instead.

SageMaker send different useful environment variables to your scripts, e.g.:
* `SM_MODEL_DIR`: A string that represents the local path where the training job can write the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting. This is different than the model_dir argument passed to your training script which is a S3 location. `SM_MODEL_DIR` is always set to /opt/ml/model.
* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.
* `SM_OUTPUT_DATA_DIR`: A string that represents the path to the directory to write output artifacts to. Output artifacts might include checkpoints, graphs, and other files to save, but do not include model artifacts. These artifacts are compressed and uploaded to S3 to an S3 bucket with the same prefix as the model artifacts.

In this Example, to reduce the network latency. we would like to save the model checkpoints locally, they will be uploaded to S3 at the end of the job.

Add the following argument to your script:
```python
parser.add_argument(
        '--model_output_dir',
        type=str,
        default=os.environ.get('SM_MODEL_DIR'))
```
Change the ModelCheckPoint line to use to new location:
```python
checkpoint = ModelCheckpoint(args.model_output_dir + '/checkpoint-{epoch}.h5')
```

Change the save_model call to use that folder.  
From:  
```python
return save_model(model, args.model_dir)
```
To:  
```python
return save_model(model, args.model_output_dir)
```

### Test your script locally
For testing, run the new script with the same command as above, make sure it runs as expected.  
Add the new model_output_dir as an argument for the script. 

In [None]:
# Run the script locally
!mkdir -p logs
!python training_script/cifar10_keras_sm.py --model_dir ./logs \
                                         --model_output_dir ./logs \
                                         --train data/train \
                                         --validation data/validation \
                                         --eval data/eval \
                                         --epochs 1
!rm -rf logs

### Use SageMaker local for local testing
The local mode in the Amazon SageMaker Python SDK can emulate CPU (single and multi-instance) and GPU (single instance) SageMaker training jobs by changing a single argument in the TensorFlow or MXNet estimators.  To do this, it uses Docker compose and NVIDIA Docker.  It will also pull the Amazon SageMaker TensorFlow container from Amazon ECS.

Training in local mode also allows us to easily monitor metrics like GPU consumption to ensure that our code is written properly to take advantage of the hardware we’re using.

In [None]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

Using the sagemaker.tensorflow class we will create a new SageMaker TensorFlow job
We can use the command to pass different configuration or hyperparameters to the script

For info see the [documentation](https://sagemaker.readthedocs.io/en/stable/using_tf.html#training-with-tensorflow-estimator)

In [None]:
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(base_job_name='cifar10',
                       entry_point='cifar10_keras_sm.py',
                       source_dir='training_script',
                       role=role,
                       framework_version='1.12.0',
                       py_version='py3',
                       hyperparameters={'epochs' : 1},
                       train_instance_count=1, train_instance_type='local')

In [None]:
estimator.fit({'train' :  'file://data/train',
               'validation' :  'file://data/validation',
               'eval' :  'file://data/eval'})

The first time the estimator runs, it needs to download the container image from its Amazon ECR repository, but then training can begin immediately.  There’s no need to wait for a separate training cluster to be provisioned.  In addition, on subsequent runs, which may be necessary when iterating and testing, changes to your MXNet or TensorFlow script will start to run instantaneously.

## Using SageMaker
In the next part, we'll use a GPU machine for faster training time
First, We'll upload the data to S3. 
SageMaker creates a default bucket per region

### Start a new SageMaker experiment
Amazon SageMaker Experiments is a capability of Amazon SageMaker that lets you organize, track, compare, and evaluate your machine learning experiments.

Machine learning is an iterative process. You need to experiment with multiple combinations of data, algorithm and parameters, all the while observing the impact of incremental changes on model accuracy. Over time this iterative experimentation can result in thousands of model training runs and model versions. This makes it hard to track the best performing models and their input configurations. It’s also difficult to compare active experiments with past experiments to identify opportunities for further incremental improvements.

Amazon SageMaker Experiments automatically tracks the inputs, parameters, configurations, and results of your iterations as trials. You can assign, group, and organize these trials into experiments. Experiments is integrated with Amazon SageMaker Studio providing a visual interface to browse your active and past experiments, compare trials on key performance metrics, and identify the best performing models.

In [None]:
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
import time

cifar10_experiment = Experiment.create(
    experiment_name="TensorFlow-cifar10-experiment",
    description="Classification of cifar10 images")

In [None]:
dataset_location = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-cifar10')
display(dataset_location)

First we'll create a new trial in this trial we'll run a simple 20 epochs training job on a GPU instance

In [None]:
# create trial
trial_name = f"cifar10-training-job-{int(time.time())}"
trial = Trial.create(
    trial_name=trial_name, 
    experiment_name=cifar10_experiment.experiment_name
)

Create a new estimator (You can copy the command from above), this time use the **ml.p3.2xlarge** as the instance type and configure **epochs:20**

In [None]:
from sagemaker.tensorflow import TensorFlow
estimator = ...

This time, use the S3 data location for each of the channels
```python
dataset_location + '/train'
dataset_location + '/validation' 
dataset_location + '/eval'
```

Connect the trial configured above to the job. add the experiment config to the fit function.
```python
experiment_config={
                  "ExperimentName": cifar10_experiment.experiment_name, 
                  "TrialName": trial.trial_name,
                  "TrialComponentDisplayName": "Training"}
```

In [None]:
estimator.fit({'train' :  'train_data_location',
               'validation' :  'validation_data_location',
               'eval' :  'eval_data_location'},
             experiment_config=)

**Good job!** 
You were able to run 20 epochs on a bigger instance in SageMaker.  
Before continuing to the next notebook, take a look at the training jobs section in the SageMaker console, find your job and look at its configuration