# SageMaker Batch Transform Example Notebook
This notebook shows an example of using [SageMaker Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) to predict if an iso20022 pacs.008 XML messages will be successfully processed or result in failure. It uses the model trained by SageMaker Autopilot to make predictions.

To test or use model using SageMaker Batch Transform, you need to know the algorithm-specific format of the model artifacts that were generated by model training. For more information about output formats supported by SageMaker algorithms, see the section corresponding to the algorithm you are using in [Common Data Formats for Training](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html).

Supervised learning algorithms generally expect input data during inference to be in CSV or JSON format. See [Common Data Formats for Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html) documentation for more details on inference request payload formats.

The request payload **must** have values for features in the **same order** as they were during model training. Also note that the input payload **must** not contain target variable as that is what the model will predict based on input data.

To learn about order of features examine the features used during data preparation and pre-processing stage to create training data set. For the prototype example here, the order of payload values must be the full features in the labeled raw dataset which was created from `pacs.008 XML message`.

You can examine the training dataset to confirm that the order of features in it. You can also examine [00_gen_synthetic_dataset.ipynb](../synthetic-data/00_gen_synthetic_dataset.ipynb) notebook to see features in raw labeled dataset that was used in training.

The payload data format for batch inference or real-time inference are identical. The difference is in number of records i.e. batch size used as input.

The diagram shows how Amazon SageMaker Batch Transform (batch inference) works.

![SageMaker Batch Transform](../images/batch-transform.png)

## Environment Setup

In [None]:
import os
import boto3
import pandas as pd
import numpy as np
import sagemaker
from sagemaker import get_execution_role

sm = boto3.Session().client('sagemaker')
sess = sagemaker.Session()
region = boto3.session.Session().region_name

role = get_execution_role()
print ("Notebook is running with assumed role {}".format (role))
print("Working with AWS services in the {} region".format(region))

### Provide S3 Bucket Name

In [None]:
# Working directory for the notebook
WORKDIR = os.getcwd()
BASENAME = os.path.dirname(WORKDIR)

# Store all prototype assets in this bucket
s3_bucket_name = 'iso20022-prototype-t3'
s3_bucket_uri = 's3://' + s3_bucket_name

# Prefix for all files in this prototype
prefix = 'iso20022'

pacs008_prefix = prefix + '/pacs008'

# Use Trained Model - Use SageMaker Batch Transformation Job To Make Predictions

Use SageMaker Batch Transform service to test the model by supplying a batch of test data in csv file. Batch transform job produces inferences as an csv output file which has for each record in the input file model's prediction, a tuple `[(1=Success, 0=Failure), probability]`.

## Get Model Name

Get the model name from notebook store magic:

In [None]:
%store -r
print(model_name)

## Get Test Data

Get test data from the S3 bucket where it was stored during data pre-processing stage:

In [None]:
# Bucket for all files and artifacts for this prototype
s3 = boto3.client('s3')

s3_bucket_name = 'iso20022-prototype-t3'
s3_bucket_uri = 's3://' + s3_bucket_name

# Prefix for all files in this prototype
prefix = 'iso20022'

# Prefix for all pacs008 files
pacs008_prefix = prefix + '/pacs008'

test_data_prefix = pacs008_prefix + '/automl/test-data/test_data.csv'

# Download test data set
s3.download_file(s3_bucket_name, test_data_prefix, 'test_data.csv')

orig_test_data_df = pd.read_csv('test_data.csv')

orig_test_data_df.shape

In [None]:
orig_test_data_df.head()

In [None]:
# drop target column
test_data_batch_df = orig_test_data_df.iloc[:, 1:]
test_data_batch_df.shape

In [None]:
test_data_batch_df.head()

In [None]:
# no header
test_data_batch_df.to_csv('test_data_batch.csv', header=False, index=False)

# Upload test dataset for batch inference
inference_test_data_location = pacs008_prefix + '/automl/inference-test-data/test_data_batch.csv'
s3.upload_file('test_data_batch.csv', s3_bucket_name, inference_test_data_location)

## Create Batch Transform Job

In [None]:
from time import gmtime, strftime, sleep
import pandas as pd
import numpy as np

s3 = boto3.client('s3')
session = sagemaker.Session()

timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())
transform_job_name = 'pacs008-automl-batch-transform-' + timestamp_suffix

batch_inference_results_location = pacs008_prefix + '/automl/inference-test-data/batch-inference-results'

transform_input = {
 'DataSource': {
 'S3DataSource': {
 'S3DataType': 'S3Prefix',
 'S3Uri': s3_bucket_uri + '/' + inference_test_data_location
 }
 },
 'ContentType': 'text/csv',
 'CompressionType': 'None',
 'SplitType': 'Line'
 }

transform_output = {
 'S3OutputPath': s3_bucket_uri + '/' + batch_inference_results_location,
 }

transform_resources = {
 'InstanceType': 'ml.m5.4xlarge',
 'InstanceCount': 1
 }

environment = {
 'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label, probability'
}

sm.create_transform_job(TransformJobName = transform_job_name,
 ModelName = model_name,
 TransformInput = transform_input,
 TransformOutput = transform_output,
 TransformResources = transform_resources,
 Environment = environment
)

In [None]:
print ('JobStatus')
print('----------')

describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
job_run_status = describe_response['TransformJobStatus']
print (job_run_status)

while job_run_status not in ('Failed', 'Completed', 'Stopped'):
 describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)
 job_run_status = describe_response['TransformJobStatus']
 print (job_run_status)
 sleep(30)

## Check Batch Predictions on Test Sample

In [None]:
results_prefix = batch_inference_results_location + '/test_data_batch.csv.out'
local_inference_results_path = 'batch_inference_results.csv'

s3.download_file(s3_bucket_name, results_prefix, local_inference_results_path)

In [None]:
#data = pd.read_csv(local_inference_results_path, header=None)
inference_results_df = pd.read_csv(local_inference_results_path, names=['Prediction', 'Probability'])
pd.set_option('display.max_rows', 10) # Keep the output on one page
inference_results_df

## Create Confusion Matrix

In [None]:
eval_df = pd.concat([orig_test_data_df['y_target'], inference_results_df['Prediction']], axis=1)
eval_df

In [None]:
confusion_matrix = pd.crosstab(eval_df['y_target'], eval_df['Prediction'], rownames=['Actual'], colnames=['Predicted'], margins = True)
print (confusion_matrix)

In [None]:
import seaborn as sn
import matplotlib.pyplot as plt

sn.heatmap(confusion_matrix,cmap='Blues', annot=True, fmt='g')
plt.show()

### Additional Model Performance Metric

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

# Convert to numpy arrays
y_actual = orig_test_data_df['y_target'].to_numpy()
y_predicted = eval_df['Prediction'].to_numpy()

print('Confusion Matrix:\n ', confusion_matrix(y_actual, y_predicted, labels=['Failure', 'Success']))

print('Classification Report: ')
print(classification_report(y_actual, y_predicted, labels=['Failure', 'Success']))


In [None]:
print('Accuracy: ', accuracy_score(y_actual, y_predicted))
print('Precision: ', precision_score(y_actual, y_predicted, average='macro'))
print('Recall: ', recall_score(y_actual, y_predicted, average="macro"))
print('F1-Score: ', f1_score(y_actual, y_predicted, average='macro'))