# BOA318: Build a fitness activity tracker using machine learning
## re:Invent 2022

![title](img/app-in-action.jpg)

In this notebook we will use Amazon SageMaker to: 
- Create a data transformer
- Train the data transformer on our dataset
- Create a ML model
- Train the ML model 
- Create a pipeline model to chain together the transformer and the ML model
- Host the pipeline model ready to make predictions from our app 

Before we get started, we will load some of the Python libraries we need:

In [None]:
import os
import boto3

import sagemaker

from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput

from sagemaker.model import Model
from sagemaker.pipeline import PipelineModel

from sagemaker.sklearn.estimator import SKLearn

import pandas as pd

Now we will set up our Amazon SageMaker working environment.  This includes the session object that will be used by the SageMaker SDK in this notebook, and an S3 location for SageMaker to store assets it's working on:

In [None]:
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = sagemaker_session.default_bucket()
prefix = "activity-tracker-pipline-workshop"

## Send our data to S3 (SageMaker)

This handy snippet of code will take the training data we have and use the SageMaker SDK to upload it to S3:

In [None]:
for path, subdirs, files in os.walk("training_data"):
    for name in files:
        if ('.ipynb_checkpoints' not in path) and (name.endswith(".csv")): 
            csv_data_file = os.path.join(path, name)
        
train_input = sagemaker_session.upload_data(
    path=csv_data_file,
    bucket=bucket,
    key_prefix="{}/{}".format(prefix, "train"),
)

## The Data Transformer

We will now see how we can use scikit-learn to apply some processing to our data.

We are creating a SageMaker data transformer that will:
- Fill in missing data with scikit-learn SimpleImputer (Not that we have any missing data at this time)
- Standardize features by removing the mean and scaling to unit variance with scikit-learn StandardScaler

The transformer is defined in a Python file `./scripts/pre_processor_script.py`.  Once we create the transformer we will train it on our training data, that way it will be able to apply the same pre-processing when we train our ML model AND on new data that gets sent to our endpoint once it's trained.

Take a look at the code, there's no need to change anything: `./scripts/pre_processor_script.py`

### Create the Transformer

To create the transformer, we specify the location of our script, and the infrastructure parameters we want to use.

In [None]:
FRAMEWORK_VERSION = "1.0-1"
script_path = "scripts/pre_processor_script.py"

sklearn_preprocessor = SKLearn(
    entry_point=script_path,
    role=role,
    framework_version=FRAMEWORK_VERSION,
    instance_type="ml.c4.xlarge",
    sagemaker_session=sagemaker_session,
    environment={"SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT": "text/csv"}
)

Now we train the transformer:

*(Only run this once.  If you're trying to reconnect to the notebook, see the next two cells instead.)*

![title](img/wait_start.png)

In [None]:
# Only run this once.  If you're trying to reconnect to the notebook, use the next two cells instead.

sklearn_preprocessor.fit({"train": train_input})

![title](img/wait_end.png)

### Need Help? Oh no!  I closed the notebook or somehow lost connection and I don't want to train that again....

Use the following two cells to get back on track.  Ask for help if you need. :) 

In [None]:
# echo "Look for the last run job prefixed sagemaker-scikit-learn-..."
# !aws sagemaker list-training-jobs

In [None]:
# # Paste the name of the last run job from above into this line...
# sklearn_preprocessor =  sklearn_preprocessor.attach('sagemaker-scikit-learn-xxxxxxxxxxxxxxxxxx', sagemaker_session)

## Preprocess Training Data with the Transformer

Now we have a transformer that's trained on our data.  Next we use it to process that same data, and create a processed dataset.

Here we ask Amazon SageMaker to spin up some compute infrastructure for us and handle all the heavy lifting. 

First we define the infrastructure we want:

In [None]:
transformer = sklearn_preprocessor.transformer(
    instance_count=1, 
    instance_type="ml.m5.xlarge", 
    assemble_with="Line", 
    accept="text/csv",
)

Now we ask Amazon SageMaker to start the processing:

*(Only run this once.  If you're trying to reconnect to the notebook, see the next two cells instead.)*

![title](img/wait_start.png)

In [None]:
# Only run this once.  If you're trying to reconnect to the notebook, use the next two cells instead.

transformer.transform(train_input, content_type="text/csv")
print("Waiting for transform job: " + transformer.latest_transform_job.job_name)
transformer.wait()
preprocessed_train = transformer.output_path

![title](img/wait_end.png)

### Need Help? Oh no!  I closed the notebook or somehow lost connection and I don't want to train that again....

Use the following two cells to get back on track.  Ask for help if you need. :) 

In [None]:
# echo "Look for the last run job prefixed sagemaker-scikit-learn-..."
# !aws sagemaker list-transform-jobs

In [None]:
# # Paste the name of the last run transform job from above into this line...
# transformer = transformer.attach('sagemaker-scikit-learn-xxxxxxxxxxxxxxxxxxxxxxxx', sagemaker_session)
# preprocessed_train = transformer.output_path

## The Machine Learning Model 

As you have seen, most of the battle in ML is getting the data ready!  But now we are ready to start working with a machine learning model. 

From the analysis we performed in SageMaker Data Wrangler, we saw that the problem space is quite simple and a simple tree based algorithm will perform very well.  For this notebook we will use Amazon SageMaker's built in XGBoost algorithm.

The [XGBoost (eXtreme Gradient Boosting)](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) algorithm is a popular and efficient open-source implementation of the gradient boosted trees algorithm.


### Defining the Model

Here we set some hyperparameters for the algorithm.

You will see the hyperparameters are very specific.  Hmmm, how did we know these exact values?  Earlier I used Amazon SageMaker AutoML to create some candidate models and automaticaly work out some good hyperparameter values.  If you're interested in having a go take a look here:  [Amazon SageMaker Autopilot](https://aws.amazon.com/sagemaker/autopilot)

In [None]:
hyperparameters = {
        "num_class"        : "6",
        "num_round"        : "30",
        "objective"        : "multi:softprob",
        "alpha"            : "0.08581546561800178",
        "colsample_bytree" : "0.833507617399075",
        "eta"              : "0.37501110693093653",
        "eval_metric"      : "accuracy,f1,balanced_accuracy,precision_macro,recall_macro",
        "gamma"            : "0.016348263861047225",
        "lambda"           : "0.059577845107449054",
        "max_depth"        : "3",
        "min_child_weight" : "0.000988838943049348",
        "subsample"        : "0.5303863656830915",
}

output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'activity-xgb-framework')

All the XGBoost code is written for us already.  So all we need to do is find the SageMaker prebuilt container:

In [None]:
xgboost_container = sagemaker.image_uris.retrieve(
    "xgboost", 
    sagemaker_session.boto_region_name, 
    "1.5-1"
)

And then set some specifications for the infrastructure:

In [None]:
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)

### Training the Model

Training the model is now as easy as feeding in our processed training data:

In [None]:
content_type = "text/csv"
train_input = TrainingInput(preprocessed_train, content_type=content_type)

And now running the training job: 

*(Only run this once.  If you're trying to reconnect to the notebook, see the next two cells instead.)*

![title](img/wait_start.png)

In [None]:
# Only run this once.  If you're trying to reconnect to the notebook, use the next two cells instead.

estimator.fit({'train': train_input})

![title](img/wait_end.png)

### Need Help? Oh no!  I closed the notebook or somehow lost connection and I don't want to train that again....

Use the following two cells to get back on track.  Ask for help if you need. :) 

In [None]:
# echo "Look for the last run job prefixed sagemaker-xgboost-..."
# !aws sagemaker list-training-jobs

In [None]:
# # Paste the name of the last run job from above into this line...
# estimator = estimator.attach('sagemaker-xgboost-xxxxxxxxxxxxxxxxxxxxx', sagemaker_session)

# The Pipeline Model

By now we have a trained data transformer (scikit-learn), and a trained ML model (XGBoost).  We have everything we need to process data from our phone and make predictions about that new data. 

For each new sample we want to transform, and then have a prediction made.  To handle this process we will us an Sagemaker Pipeline Model to perform the orchestration heavy lifting. 

In [None]:
# Built models:

scikit_learn_inferencee_model = sklearn_preprocessor.create_model(
    env = {"SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT":"text/csv"}
)

xgb_model = estimator.create_model()

# Names

model_name = "activity-pipeline"
endpoint_name = "activity-pipeline"

# Create Pipline Model

sm_model = PipelineModel(
    name=model_name, 
    role=role, 
    models=[scikit_learn_inferencee_model, xgb_model] # Here is where we define the steps of the pipeline!
)

With our pipeline model now defined.  We ask SageMaker to deploy the model to an endpoint.  Once complete it will be ready to accept API calls, and make predictions on new data.

*(Only run this once.  If you're trying to reconnect to the notebook, see below.)*

![title](img/wait_start.png)

In [1]:
# Only run this once.  If you're trying to reconnect to the notebook, use the next two cells instead.

sm_model.deploy(initial_instance_count=1, instance_type="ml.c4.xlarge", endpoint_name=endpoint_name)

![title](img/wait_end.png)

### Need Help: Oh no!  I closed the notebook or somehow lost connection and I don't want to deploy that again....

That's okay.  We don't need to re-run that cell, but we will need to re-run the code up until that point using the commented out helper code and we will be back here again in no time.

You can monitor the progress of the deploying endpoint here: [https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/endpoints] 

## Let's perform a quick test...

We want to get data streaming from a phone, but just to test everything is working as it should this sample code will perform a single test.

The output of this cell should be an array of values.  These values are prediction scores for each of the labels we have (Run, Walk, etc).  The actual predicted action is represented by the highest value in this list.  But what are the labels?

In [None]:
from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer

features = "ax,ay,az,aigx,aigy,aigz,rralpha,rrbeta,rrgamma"
payload = "0.44266298739955007,0.7146334002037742,1.3313026220397495,0.467693860580073,0.6073590115775493,1.3592041447373837,16.549713027020594,11.653742763956659,17.214856502392134"

predictor = Predictor(
    endpoint_name=endpoint_name, sagemaker_session=sagemaker_session, serializer=CSVSerializer()
)

print(predictor.predict([features, payload]))

# Should be : walk

### Let's figure out the labels

Here is another handy script snippit.  It will look for the label csv and process it to produce an ordered list of labels.  

You will need this list, copy and past it into the LABELS environment variable of the Lambda function created by Amplify. (There are more details on this back in the main workshop instructions.)

In [None]:
for path, subdirs, files in os.walk("label_data"):
    for name in files:
        if ('.ipynb_checkpoints' not in path) and (name.endswith(".csv")): 
            csv_label_file = os.path.join(path, name)

df = pd.read_csv(csv_label_file, header=0)

labels = []
for filename in df.sort_values('label')['_data_source_filename']:
    labels.append((filename.split('/')[-1]).split('.')[-2])
    
print(','.join(labels))

*IF* the output is: `recline,walk,desk,run,sit`

*AND IF* the output from the prediction was: `b'0.00043103244388476014,0.9989290833473206,0.00010582990216789767,0.0001037378387991339,0.0003413469239603728,8.899380190996453e-05\n'`

*THEN* the prediction would be: `walk` (In the output the 2nd value is highest, and the 2nd label is 'walk').

## We're done!! That's it! If you deployed the Amplify app it should connect.  

Troubleshooting: Check the environment variables in the Lambda function deployed by Amplify, and ensure the SageMaker endpoint and the labels match the values from this notebook.

### EXTRA BONUS

Once the Amplify app is deployed, use this code to generate a QR code of the site.  It's easier than typing it out! (This code will only work if there is just one Amplify app deployed in the account.)

First we quickly install a QRCode library for Python:

In [None]:
!pip3 install qrcode

Import the Python libraries: 

In [None]:
import qrcode
import matplotlib.pyplot as plt

And generate the QR code: 

In [None]:
amplify = boto3.client('amplify')

apps = amplify.list_apps()
appId = apps['apps'][0]['appId']
app = amplify.get_app( appId=appId )

url = app['app']['defaultDomain']
branchName = app['app']['productionBranch']['branchName']

image = qrcode.make("https://{}.{}".format(branchName,url))
plt.imshow(image , cmap = 'gray')
plt.show()

## End of Notebook

```MIT No Attribution

Copyright 2022 Amazon Web Services

Permission is hereby granted, free of charge, to any person obtaining a copy of this
software and associated documentation files (the "Software"), to deal in the Software
without restriction, including without limitation the rights to use, copy, modify,
merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.```