{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SageMaker Batch Transform Example Notebook\n", "This notebook shows an example of using [SageMaker Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html) to predict if an iso20022 pacs.008 XML messages will be successfully processed or result in failure. It uses the model trained by SageMaker Autopilot to make predictions.\n", "\n", "To test or use model using SageMaker Batch Transform, you need to know the algorithm-specific format of the model artifacts that were generated by model training. For more information about output formats supported by SageMaker algorithms, see the section corresponding to the algorithm you are using in [Common Data Formats for Training](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html).\n", "\n", "Supervised learning algorithms generally expect input data during inference to be in CSV or JSON format. See [Common Data Formats for Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html) documentation for more details on inference request payload formats.\n", "\n", "The request payload **must** have values for features in the **same order** as they were during model training. Also note that the input payload **must** not contain target variable as that is what the model will predict based on input data.\n", "\n", "To learn about order of features examine the features used during data preparation and pre-processing stage to create training data set. For the prototype example here, the order of payload values must be the full features in the labeled raw dataset which was created from `pacs.008 XML message`.\n", "\n", "You can examine the training dataset to confirm that the order of features in it. You can also examine [00_gen_synthetic_dataset.ipynb](../synthetic-data/00_gen_synthetic_dataset.ipynb) notebook to see features in raw labeled dataset that was used in training.\n", "\n", "The payload data format for batch inference or real-time inference are identical. The difference is in number of records i.e. batch size used as input.\n", "\n", "The diagram shows how Amazon SageMaker Batch Transform (batch inference) works.\n", "\n", "![SageMaker Batch Transform](../images/batch-transform.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Environment Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import boto3\n", "import pandas as pd\n", "import numpy as np\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "\n", "sm = boto3.Session().client('sagemaker')\n", "sess = sagemaker.Session()\n", "region = boto3.session.Session().region_name\n", "\n", "role = get_execution_role()\n", "print (\"Notebook is running with assumed role {}\".format (role))\n", "print(\"Working with AWS services in the {} region\".format(region))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Provide S3 Bucket Name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Working directory for the notebook\n", "WORKDIR = os.getcwd()\n", "BASENAME = os.path.dirname(WORKDIR)\n", "\n", "# Store all prototype assets in this bucket\n", "s3_bucket_name = 'iso20022-prototype-t3'\n", "s3_bucket_uri = 's3://' + s3_bucket_name\n", "\n", "# Prefix for all files in this prototype\n", "prefix = 'iso20022'\n", "\n", "pacs008_prefix = prefix + '/pacs008'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Use Trained Model - Use SageMaker Batch Transformation Job To Make Predictions\n", "\n", "Use SageMaker Batch Transform service to test the model by supplying a batch of test data in csv file. Batch transform job produces inferences as an csv output file which has for each record in the input file model's prediction, a tuple `[(1=Success, 0=Failure), probability]`.\n", "\n", "## Get Model Name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the model name from notebook store magic:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store -r\n", "print(model_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get Test Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get test data from the S3 bucket where it was stored during data pre-processing stage:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Bucket for all files and artifacts for this prototype\n", "s3 = boto3.client('s3')\n", "\n", "s3_bucket_name = 'iso20022-prototype-t3'\n", "s3_bucket_uri = 's3://' + s3_bucket_name\n", "\n", "# Prefix for all files in this prototype\n", "prefix = 'iso20022'\n", "\n", "# Prefix for all pacs008 files\n", "pacs008_prefix = prefix + '/pacs008'\n", "\n", "test_data_prefix = pacs008_prefix + '/automl/test-data/test_data.csv'\n", "\n", "# Download test data set\n", "s3.download_file(s3_bucket_name, test_data_prefix, 'test_data.csv')\n", "\n", "orig_test_data_df = pd.read_csv('test_data.csv')\n", "\n", "orig_test_data_df.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "orig_test_data_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# drop target column\n", "test_data_batch_df = orig_test_data_df.iloc[:, 1:]\n", "test_data_batch_df.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_data_batch_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# no header\n", "test_data_batch_df.to_csv('test_data_batch.csv', header=False, index=False)\n", "\n", "# Upload test dataset for batch inference\n", "inference_test_data_location = pacs008_prefix + '/automl/inference-test-data/test_data_batch.csv'\n", "s3.upload_file('test_data_batch.csv', s3_bucket_name, inference_test_data_location)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Batch Transform Job" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from time import gmtime, strftime, sleep\n", "import pandas as pd\n", "import numpy as np\n", "\n", "s3 = boto3.client('s3')\n", "session = sagemaker.Session()\n", "\n", "timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())\n", "transform_job_name = 'pacs008-automl-batch-transform-' + timestamp_suffix\n", "\n", "batch_inference_results_location = pacs008_prefix + '/automl/inference-test-data/batch-inference-results'\n", "\n", "transform_input = {\n", " 'DataSource': {\n", " 'S3DataSource': {\n", " 'S3DataType': 'S3Prefix',\n", " 'S3Uri': s3_bucket_uri + '/' + inference_test_data_location\n", " }\n", " },\n", " 'ContentType': 'text/csv',\n", " 'CompressionType': 'None',\n", " 'SplitType': 'Line'\n", " }\n", "\n", "transform_output = {\n", " 'S3OutputPath': s3_bucket_uri + '/' + batch_inference_results_location,\n", " }\n", "\n", "transform_resources = {\n", " 'InstanceType': 'ml.m5.4xlarge',\n", " 'InstanceCount': 1\n", " }\n", "\n", "environment = {\n", " 'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label, probability'\n", "}\n", "\n", "sm.create_transform_job(TransformJobName = transform_job_name,\n", " ModelName = model_name,\n", " TransformInput = transform_input,\n", " TransformOutput = transform_output,\n", " TransformResources = transform_resources,\n", " Environment = environment\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print ('JobStatus')\n", "print('----------')\n", "\n", "describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)\n", "job_run_status = describe_response['TransformJobStatus']\n", "print (job_run_status)\n", "\n", "while job_run_status not in ('Failed', 'Completed', 'Stopped'):\n", " describe_response = sm.describe_transform_job(TransformJobName = transform_job_name)\n", " job_run_status = describe_response['TransformJobStatus']\n", " print (job_run_status)\n", " sleep(30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check Batch Predictions on Test Sample" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results_prefix = batch_inference_results_location + '/test_data_batch.csv.out'\n", "local_inference_results_path = 'batch_inference_results.csv'\n", "\n", "s3.download_file(s3_bucket_name, results_prefix, local_inference_results_path)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#data = pd.read_csv(local_inference_results_path, header=None)\n", "inference_results_df = pd.read_csv(local_inference_results_path, names=['Prediction', 'Probability'])\n", "pd.set_option('display.max_rows', 10) # Keep the output on one page\n", "inference_results_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Confusion Matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "eval_df = pd.concat([orig_test_data_df['y_target'], inference_results_df['Prediction']], axis=1)\n", "eval_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "confusion_matrix = pd.crosstab(eval_df['y_target'], eval_df['Prediction'], rownames=['Actual'], colnames=['Predicted'], margins = True)\n", "print (confusion_matrix)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sn\n", "import matplotlib.pyplot as plt\n", "\n", "sn.heatmap(confusion_matrix,cmap='Blues', annot=True, fmt='g')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Additional Model Performance Metric" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix, classification_report\n", "from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score\n", "\n", "# Convert to numpy arrays\n", "y_actual = orig_test_data_df['y_target'].to_numpy()\n", "y_predicted = eval_df['Prediction'].to_numpy()\n", "\n", "print('Confusion Matrix:\\n ', confusion_matrix(y_actual, y_predicted, labels=['Failure', 'Success']))\n", "\n", "print('Classification Report: ')\n", "print(classification_report(y_actual, y_predicted, labels=['Failure', 'Success']))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Accuracy: ', accuracy_score(y_actual, y_predicted))\n", "print('Precision: ', precision_score(y_actual, y_predicted, average='macro'))\n", "print('Recall: ', recall_score(y_actual, y_predicted, average=\"macro\"))\n", "print('F1-Score: ', f1_score(y_actual, y_predicted, average='macro'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.g4dn.xlarge", "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 4 }