{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SageMaker Autopilot Model Training Notebook for Automated Machine Learning\n", "This notebook uses SageMaker Autopilot on training dataset to perform model training. [Amazon SageMaker Autopilot](https://aws.amazon.com/sagemaker/autopilot/) service automates the machine learning lifecycle. ML Model develop is an iterative process with several tasks that data scientists go through to produce an effective model that can solve business problem. The process typically involves:\n", "* Data exploration and analysis\n", "* Feature engineering\n", "* Model development\n", "* Model training and tuning\n", "* Model deployment\n", "* Model monitoring and retraining \n", "\n", "Model development can be a time consuming process. To address challenges of model development, AWS introduced [Amazon SageMaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html), an Automated Machine Learning or AutoML service at AWS re:Invent 2019. Amazon SageMaker Autopilot is a whitebox approach to AutoML, producing the Python Notebooks for data analysis, feature engineering and model training. These notebooks can be examined by data scientists, giving them full control and visbility into model development. The image below descibes how SageMaker Autopilit works.\n", "\n", "![SageMaker Autopilot](../images/iso20022-prototype-AutoML.png)\n", "\n", "Note that model monioring and retraining is need to detect data and model drift in real world usage, learning about model's effectiveness and retraining to correct model may be making in real-world usage. Amazon SageMaker [Model Monitor](https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor.html) service can help with monitoring.\n", "\n", "In this notebook we use Amazon SageMaker Autopilot to **train multiple models** and select the best perfoming model using the model evaluation metric `Accuracy`.\n", "\n", "The problem is defined to be a `binary classification` problem, that of predicting if a pacs.008 XML message with be processed sucessfully or lead to exception process. The predicts `Success` i.e. 1 or `Failure` i.e. 0. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Environment Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import boto3\n", "import pandas as pd\n", "import numpy as np\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "\n", "sm = boto3.Session().client('sagemaker')\n", "sess = sagemaker.Session()\n", "region = boto3.session.Session().region_name\n", "\n", "role = get_execution_role()\n", "print (\"Notebook is running with assumed role {}\".format (role))\n", "print(\"Working with AWS services in the {} region\".format(region))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Provide S3 Bucket Name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Working directory for the notebook\n", "WORKDIR = os.getcwd()\n", "BASENAME = os.path.dirname(WORKDIR)\n", "\n", "# Store all prototype assets in this bucket\n", "s3_bucket_name = 'iso20022-prototype-t3'\n", "s3_bucket_uri = 's3://' + s3_bucket_name\n", "\n", "# Prefix for all files in this prototype\n", "prefix = 'iso20022'\n", "\n", "pacs008_prefix = prefix + '/pacs008'\n", "raw_data_prefix = pacs008_prefix + '/raw-data'\n", "labeled_data_prefix = pacs008_prefix + '/labeled-data'\n", "training_data_prefix = pacs008_prefix + '/automl/training-data'\n", "training_headers_prefix = pacs008_prefix + '/automl/training-headers'\n", "test_data_prefix = pacs008_prefix + '/automl/test-data'\n", "training_job_output_prefix = pacs008_prefix + '/training-output'\n", "\n", "print(f\"Training data will be uploaded to {s3_bucket_uri + '/' + training_data_prefix}\")\n", "print(f\"Test data will be uploaded to {s3_bucket_uri + '/' + test_data_prefix}\")\n", "print(f\"Training job output will be stored in {s3_bucket_uri + '/' + training_job_output_prefix}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labeled_data_location = s3_bucket_uri + '/' + labeled_data_prefix\n", "training_data_location = s3_bucket_uri + '/' + training_data_prefix\n", "test_data_location = s3_bucket_uri + '/' + test_data_prefix\n", "print(f\"Raw labeled data location = {labeled_data_location}\")\n", "print(f\"Training data location = {training_data_location}\")\n", "print(f\"Test data location = {test_data_location}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Split Labeled Dataset to Training and Test Datasets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Download labeled raw dataset from S3\n", "s3 = boto3.client('s3')\n", "s3.download_file(s3_bucket_name, labeled_data_prefix + '/labeled_data.csv', 'labeled_data.csv')\n", "df = pd.read_csv('labeled_data.csv')\n", "df.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is not used for Autopilot as it performs feature engineering, this section is here only for experimentation to see if providing selected features improves the model. Uncomment the cell below to experiment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Training features\n", "# fts=[\n", "# 'y_target', \n", "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry', \n", "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry', \n", "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_DbtCdtRptgInd', \n", "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Authrty_Ctry', \n", "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd',\n", "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',\n", "# ]\n", "\n", "# # New data frame with selected features\n", "# selected_df = df[fts]\n", " \n", "# print(f\"selected_df shape: {selected_df.shape}\") \n", "# selected_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split into Training and Test Datasets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "# Split raw labeled data to training and test datasets\n", "print('Spliting processed dataset into training and test datasets...')\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(df, df['y_target'], test_size=0.2, random_state=20, shuffle=True)\n", "# Uncomment if experimenting by selecting features for training.\n", "#X_train, X_test, y_train, y_test = train_test_split(selected_df, selected_df['y_target'], test_size=0.2, random_state=20, shuffle=True)\n", "\n", "print(f\"X_train shape: {X_train.shape}\") \n", "print(f\"X_test shape: {X_test.shape}\") \n", "print(f\"y_train shape: {y_train.shape}\") \n", "print(f\"y_test shape: {y_test.shape}\") \n", "\n", "train_data_output_path = WORKDIR + '/train_data.csv'\n", "\n", "test_data_output_path = WORKDIR + '/test_data.csv'\n", "\n", "print(\"Saving training data with headers to {}\".format(train_data_output_path))\n", "X_train.to_csv(train_data_output_path, index=False)\n", "\n", "print('Saving test data with headers to {}'.format(test_data_output_path))\n", "X_test.to_csv(test_data_output_path, index=False)\n", "\n", "s3.upload_file(train_data_output_path, s3_bucket_name, training_data_prefix + '/train_data.csv')\n", "s3.upload_file(test_data_output_path, s3_bucket_name, test_data_prefix + '/test_data.csv')\n", "\n", "print(f'Uploaded train data with headers to {training_data_location}')\n", "print(f'Uploaded test data with headers to {test_data_location}')\n", "\n", "print(\"Pre-processing Complete.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Create a Model Using SageMaker Autopilot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting the Sagemaker Autopilot Job" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "input_data_config = [{\n", " 'DataSource': {\n", " 'S3DataSource': {\n", " 'S3DataType': 'S3Prefix',\n", " 'S3Uri': training_data_location ## Where the training data is stored \n", " }\n", " },\n", " 'TargetAttributeName': 'y_target' ## Name of the target value \n", " }\n", " ]\n", "\n", "output_data_config = {\n", " 'S3OutputPath': s3_bucket_uri + '/' + training_job_output_prefix ## Where to store the model performance\n", " }\n", "\n", "autoMLJobConfig={\n", " 'CompletionCriteria': {\n", " 'MaxCandidates': 5 ## Number of models you want to try \n", " }\n", "}\n", "\n", "autoMLJobObjective = {\n", " \"MetricName\": \"Accuracy\" ## Metric we want to use to evaluate the model\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(input_data_config)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start SageMaker Autopilot job" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from time import gmtime, strftime, sleep\n", "timestamp_suffix = strftime('%Y-%m-%d-%H-%M', gmtime())\n", "\n", "auto_ml_job_name = 'pacs008-automl-' + timestamp_suffix\n", "print('AutoMLJobName: ' + auto_ml_job_name)\n", "\n", "#auto_ml_job_name = 'automl-iso20022-2021-11-25-16-13' # selected_df\n", "#auto_ml_job_name = '' #full labeled df\n", "\n", "sm.create_auto_ml_job(AutoMLJobName=auto_ml_job_name,\n", " InputDataConfig=input_data_config,\n", " OutputDataConfig=output_data_config,\n", " AutoMLJobConfig=autoMLJobConfig,\n", " AutoMLJobObjective=autoMLJobObjective,\n", " ProblemType=\"BinaryClassification\", ## Here we specify what type of problem we have\n", " RoleArn=role)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Monitor Training Job" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from time import sleep\n", "print ('JobStatus - Secondary Status')\n", "print('------------------------------')\n", "\n", "#auto_ml_job_name = 'automl-iso20022-2021-11-09-22-42'\n", "print('AutoMLJobName: ' + auto_ml_job_name)\n", "describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)\n", "print (describe_response['AutoMLJobStatus'] + \" - \" + describe_response['AutoMLJobSecondaryStatus'])\n", "job_run_status = describe_response['AutoMLJobStatus']\n", " \n", "while job_run_status not in ('Failed', 'Completed', 'Stopped'):\n", " describe_response = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)\n", " job_run_status = describe_response['AutoMLJobStatus']\n", " \n", " print (describe_response['AutoMLJobStatus'] + \" - \" + describe_response['AutoMLJobSecondaryStatus'])\n", " sleep(30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Autopilot Training Results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pprint\n", "\n", "best_candidate = sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['BestCandidate']\n", "best_candidate_name = best_candidate['CandidateName']\n", "pprint.pprint(best_candidate)\n", "print('\\n')\n", "print(\"CandidateName: \" + best_candidate_name)\n", "print(\"FinalAutoMLJobObjectiveMetricName: \" + best_candidate['FinalAutoMLJobObjectiveMetric']['MetricName'])\n", "print(\"FinalAutoMLJobObjectiveMetricValue: \" + str(best_candidate['FinalAutoMLJobObjectiveMetric']['Value']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "best_candidate['InferenceContainers'][1]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label,probability'})\n", "pprint.pprint(best_candidate['InferenceContainers'][1])\n", "\n", "best_candidate['InferenceContainers'][2]['Environment'].update({'SAGEMAKER_INFERENCE_INPUT': 'predicted_label,probability'})\n", "best_candidate['InferenceContainers'][2]['Environment'].update({'SAGEMAKER_INFERENCE_OUTPUT': 'predicted_label,probability'})\n", "pprint.pprint(best_candidate['InferenceContainers'][2])\n", "\n", "pprint.pprint(best_candidate)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "candidates = sm.list_candidates_for_auto_ml_job(AutoMLJobName=auto_ml_job_name, SortBy='FinalObjectiveMetricValue')['Candidates']\n", "for index, candidate in enumerate(candidates):\n", " print(str(index) + \" \" + candidate['CandidateName'] + \" \" + str(candidate['FinalAutoMLJobObjectiveMetric']['Value']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Autopilot Generated Candidate Model Training Notebook" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['CandidateDefinitionNotebookLocation']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Autopilot Generated Data Exploration Notebook" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sm.describe_auto_ml_job(AutoMLJobName=auto_ml_job_name)['AutoMLJobArtifacts']['DataExplorationNotebookLocation']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Store Training Job Name for use during Model Deployment\n", "This name is used by model deployment notebook that deploys a SageMaker Inference Endpoint i.e. uses SageMaker hosting services to deploy the model and expose an inference endpoint for users to use the model to make predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_job_name = candidates[0]['CandidateName']\n", "%store training_job_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())\n", "#model_name = 'pacs008-automl-' + timestamp_suffix\n", "\n", "model_name = auto_ml_job_name\n", "model = sm.create_model(Containers=best_candidate['InferenceContainers'],\n", " ModelName=model_name,\n", " ExecutionRoleArn=role)\n", "\n", "print('Model ARN corresponding to the best candidate is : {}'.format(model['ModelArn']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store model_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }