{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Fraud Detection for Automobile Claims: Create an End to End Pipeline" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "\n", "\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Background\n", "\n", "In this notebook, we will build a SageMaker Pipeline that automates the entire end-to-end process of preparing, training, and deploying a model that detects automobile claim fraud. For a more detailed explanation of each step of the pipeline, you can look the series of notebooks (listed below) that implements this same process using a manual approach. Please see the [README.md](README.md) for more information about this use case implemented by this series of notebooks. \n", "\n", "\n", "1. [Fraud Detection for Automobile Claims: Data Exploration](./0-AutoClaimFraudDetection.ipynb)\n", "1. [Fraud Detection for Automobile Claims: Data Preparation, Process, and Store Features](./1-data-prep-e2e.ipynb)\n", "1. [Fraud Detection for Automobile Claims: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)\n", "1. [Fraud Detection for Automobile Claims: Mitigate Bias, Train, Register, and Deploy Unbiased Model](./3-mitigate-bias-train-model2-registry-e2e.ipynb)\n", "\n", "\n", "## Contents\n", "1. [Prerequisites](#Prerequisites)\n", "1. [Architecture: Create a SageMaker Pipeline to Automate All the Steps from Data Prep to Model Deployment](#Architecture:-Create-a-SageMaker-Pipeline-to-Automate-All-the-Steps-from-Data-Prep-to-Model-Deployment)\n", "1. [Creating an Automated Pipeline using SageMaker Pipeline](#Creating-an-Automated-Pipeline-using-SageMaker-Pipeline)\n", "1. [Clean-Up](#Clean-Up)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "----" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Install required and/or update third-party libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python -m pip install -Uq pip\n", "!python -m pip install -q awswrangler==2.14.0 imbalanced-learn==0.7.0 sagemaker==2.41.0" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Import libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import boto3\n", "import pathlib\n", "import sagemaker\n", "import numpy as np\n", "import pandas as pd\n", "import awswrangler as wr\n", "import string\n", "\n", "import demo_helpers\n", "\n", "from sagemaker.xgboost.estimator import XGBoost\n", "from sagemaker.workflow.pipeline import Pipeline\n", "from sagemaker.workflow.steps import CreateModelStep\n", "from sagemaker.sklearn.processing import SKLearnProcessor\n", "from sagemaker.workflow.step_collections import RegisterModel\n", "from sagemaker.workflow.steps import ProcessingStep, TrainingStep\n", "from sagemaker.workflow.parameters import ParameterInteger, ParameterFloat, ParameterString" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Set region and boto3 config" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can change this to a region of your choice\n", "import sagemaker\n", "\n", "region = sagemaker.Session().boto_region_name\n", "print(\"Using AWS Region: {}\".format(region))\n", "\n", "boto3.setup_default_session(region_name=region)\n", "boto_session = boto3.Session(region_name=region)\n", "\n", "s3_client = boto3.client(\"s3\", region_name=region)\n", "\n", "sagemaker_boto_client = boto_session.client(\"sagemaker\")\n", "sagemaker_session = sagemaker.session.Session(\n", " boto_session=boto_session, sagemaker_client=sagemaker_boto_client\n", ")\n", "sagemaker_role = sagemaker.get_execution_role()\n", "\n", "account_id = boto3.client(\"sts\").get_caller_identity()[\"Account\"]\n", "\n", "bucket = sagemaker_session.default_bucket()\n", "prefix = \"fraud-detect-demo\"\n", "\n", "claims_fg_name = f\"{prefix}-claims\"\n", "customers_fg_name = f\"{prefix}-customers\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ======> Tons of output_paths\n", "\n", "training_job_output_path = f\"s3://{bucket}/{prefix}/training_jobs\"\n", "bias_report_output_path = f\"s3://{bucket}/{prefix}/clarify-bias\"\n", "explainability_output_path = f\"s3://{bucket}/{prefix}/clarify-explainability\"\n", "\n", "train_data_uri = f\"s3://{bucket}/{prefix}/data/train/train.csv\"\n", "test_data_uri = f\"s3://{bucket}/{prefix}/data/test/test.csv\"\n", "train_data_upsampled_s3_path = f\"s3://{bucket}/{prefix}/data/train/upsampled/train.csv\"\n", "processing_dir = \"/opt/ml/processing\"\n", "create_dataset_script_uri = f\"s3://{bucket}/{prefix}/code/create_dataset.py\"\n", "pipeline_bias_output_path = f\"s3://{bucket}/{prefix}/clarify-output/pipeline/bias\"\n", "deploy_model_script_uri = f\"s3://{bucket}/{prefix}/code/deploy_model.py\"\n", "\n", "# ======> variables used for parameterizing the notebook run\n", "flow_instance_count = 1\n", "flow_instance_type = \"ml.m5.4xlarge\"\n", "\n", "train_instance_count = 1\n", "train_instance_type = \"ml.m4.xlarge\"\n", "\n", "deploy_model_instance_type = \"ml.m4.xlarge\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Architecture: Create a SageMaker Pipeline to Automate All the Steps from Data Prep to Model Deployment\n", "----\n", "\n", "" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Creating an Automated Pipeline using SageMaker Pipeline\n", "\n", "- [Step 1: Claims Data Wrangler Preprocessing Step](#Step-1:-Claims-Data-Wrangler-Preprocessing-Step)\n", "- [Step 2: Customers Data Wrangler Preprocessing Step](#Step-2:-Customers-Data-Wrangler-Preprocessing-Step)\n", "- [Step 3: Create Dataset and Train/Test Split](#Step-3:-Create-Dataset-and-Train/Test-Split)\n", "- [Step 4: Train XGBoost Model](#Step-4:-Train-XGBoost-Model)\n", "- [Step 5: Model Pre-Deployment Step](#Step-5:-Model-Pre-Deployment-Step)\n", "- [Step 6: Run Bias Metrics with Clarify](#Step-6:-Run-Bias-Metrics-with-Clarify)\n", "- [Step 7: Register Model](#Step-7:-Register-Model)\n", "- [Step 8: Deploy Model](#Step-8:-Deploy-Model)\n", "- [Step 9: Combine and Run the Pipeline Steps](#Step-9:-Combine-and-Run-the-Pipeline-Steps)\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "----\n", "Now that youve manually done each step in our machine learning workflow, you can certain steps to allow for faster model experimentation without sacrificing transparncy and model tracking. In this section you will create a pipeline which trains a new model, persists the model in SageMaker and then adds the model to the registry." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Pipeline parameters\n", "An important feature of SageMaker Pipelines is the ability to define the steps ahead of time, but be able to change the parameters to those steps at execution without having to re-define the pipeline. This can be achieved by using ParameterInteger, ParameterFloat or ParameterString to define a value upfront which can be modified when you call `pipeline.start(parameters=parameters)` later. Only certain parameters can be defined this way." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_instance_param = ParameterString(\n", " name=\"TrainingInstance\",\n", " default_value=\"ml.m4.xlarge\",\n", ")\n", "\n", "model_approval_status = ParameterString(\n", " name=\"ModelApprovalStatus\", default_value=\"PendingManualApproval\"\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1: Claims Data Wrangler Preprocessing Step" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Upload raw data to S3\n", "Before you can preprocess the raw data with Data Wrangler, it must exist in S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_client.upload_file(\n", " Filename=\"data/claims.csv\", Bucket=bucket, Key=f\"{prefix}/data/raw/claims.csv\"\n", ")\n", "s3_client.upload_file(\n", " Filename=\"data/customers.csv\", Bucket=bucket, Key=f\"{prefix}/data/raw/customers.csv\"\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Update attributes within the `.flow` file \n", "Data Wrangler will generate a .flow file. It contains a reference to an S3 bucket used during the Wrangling. This may be different from the one you have as a default in this notebook eg if the Wrangling was done by someone else, you will probably not have access to their bucket and you now need to point to your own S3 bucket so you can actually load the .flow file into Data Wrangler or access the data.\n", "\n", "After running the cell below you can open the `claims.flow` and `customers.flow` files and export the data to S3 or you can continue the guide using the provided `data/claims_preprocessed.csv` and `data/customers_preprocessed.csv` files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "claims_flow_template_file = \"claims_flow_template\"\n", "\n", "with open(claims_flow_template_file, \"r\") as f:\n", " variables = {\"bucket\": bucket, \"prefix\": prefix}\n", " template = string.Template(f.read())\n", " claims_flow = template.substitute(variables)\n", " claims_flow = json.loads(claims_flow)\n", "\n", "with open(\"claims.flow\", \"w\") as f:\n", " json.dump(claims_flow, f)\n", "\n", "customers_flow_template_file = \"customers_flow_template\"\n", "\n", "with open(customers_flow_template_file, \"r\") as f:\n", " variables = {\"bucket\": bucket, \"prefix\": prefix}\n", " template = string.Template(f.read())\n", " customers_flow = template.substitute(variables)\n", " customers_flow = json.loads(customers_flow)\n", "\n", "with open(\"customers.flow\", \"w\") as f:\n", " json.dump(customers_flow, f)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Upload flow to S3\n", "This will become an input to the first step and, as such, needs to be in S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_client.upload_file(\n", " Filename=\"claims.flow\", Bucket=bucket, Key=f\"{prefix}/dataprep-notebooks/claims.flow\"\n", ")\n", "claims_flow_uri = f\"s3://{bucket}/{prefix}/dataprep-notebooks/claims.flow\"\n", "print(f\"Claims flow file uploaded to S3\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Define the first Data Wrangler step's inputs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(\"claims.flow\", \"r\") as f:\n", " claims_flow = json.load(f)\n", "\n", "flow_step_inputs = []\n", "\n", "# flow file contains the code for each transformation\n", "flow_file_input = sagemaker.processing.ProcessingInput(\n", " source=claims_flow_uri, destination=f\"{processing_dir}/flow\", input_name=\"flow\"\n", ")\n", "\n", "flow_step_inputs.append(flow_file_input)\n", "\n", "# parse the flow file for S3 inputs to Data Wranger job\n", "for node in claims_flow[\"nodes\"]:\n", " if \"dataset_definition\" in node[\"parameters\"]:\n", " data_def = node[\"parameters\"][\"dataset_definition\"]\n", " name = data_def[\"name\"]\n", " s3_input = sagemaker.processing.ProcessingInput(\n", " source=data_def[\"s3ExecutionContext\"][\"s3Uri\"],\n", " destination=f\"{processing_dir}/{name}\",\n", " input_name=name,\n", " )\n", " flow_step_inputs.append(s3_input)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Define outputs for first Data Wranger step" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "claims_output_name = (\n", " f\"{claims_flow['nodes'][-1]['node_id']}.{claims_flow['nodes'][-1]['outputs'][0]['name']}\"\n", ")\n", "\n", "flow_step_outputs = []\n", "\n", "flow_output = sagemaker.processing.ProcessingOutput(\n", " output_name=claims_output_name,\n", " feature_store_output=sagemaker.processing.FeatureStoreOutput(feature_group_name=claims_fg_name),\n", " app_managed=True,\n", ")\n", "\n", "flow_step_outputs.append(flow_output)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Define processor and processing step" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can find the proper image uri by exporting your Data Wrangler flow to a pipeline notebook\n", "# =================================\n", "from sagemaker import image_uris\n", "\n", "# Pulls the latest data-wrangler container tag, i.e. \"1.x\"\n", "# The latest tested container version was \"1.11.0\"\n", "image_uri = image_uris.retrieve(framework=\"data-wrangler\", region=region)\n", "\n", "print(\"image_uri: {}\".format(image_uri))\n", "\n", "flow_processor = sagemaker.processing.Processor(\n", " role=sagemaker_role,\n", " image_uri=image_uri,\n", " instance_count=flow_instance_count,\n", " instance_type=flow_instance_type,\n", " max_runtime_in_seconds=86400,\n", ")\n", "\n", "output_content_type = \"CSV\"\n", "\n", "# Output configuration used as processing job container arguments\n", "claims_output_config = {claims_output_name: {\"content_type\": output_content_type}}\n", "\n", "claims_flow_step = ProcessingStep(\n", " name=\"ClaimsDataWranglerProcessingStep\",\n", " processor=flow_processor,\n", " inputs=flow_step_inputs,\n", " outputs=flow_step_outputs,\n", " job_arguments=[f\"--output-config '{json.dumps(claims_output_config)}'\"],\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2: Customers Data Wrangler Preprocessing Step" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_client.upload_file(\n", " Filename=\"customers.flow\", Bucket=bucket, Key=f\"{prefix}/dataprep-notebooks/customers.flow\"\n", ")\n", "claims_flow_uri = f\"s3://{bucket}/{prefix}/dataprep-notebooks/customers.flow\"\n", "print(f\"Customers flow file uploaded to S3\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(\"customers.flow\", \"r\") as f:\n", " customers_flow = json.load(f)\n", "\n", "flow_step_inputs = []\n", "\n", "# flow file contains the code for each transformation\n", "flow_file_input = sagemaker.processing.ProcessingInput(\n", " source=claims_flow_uri, destination=f\"{processing_dir}/flow\", input_name=\"flow\"\n", ")\n", "\n", "flow_step_inputs.append(flow_file_input)\n", "\n", "# parse the flow file for S3 inputs to Data Wranger job\n", "for node in customers_flow[\"nodes\"]:\n", " if \"dataset_definition\" in node[\"parameters\"]:\n", " data_def = node[\"parameters\"][\"dataset_definition\"]\n", " name = data_def[\"name\"]\n", " s3_input = sagemaker.processing.ProcessingInput(\n", " source=data_def[\"s3ExecutionContext\"][\"s3Uri\"],\n", " destination=f\"{processing_dir}/{name}\",\n", " input_name=name,\n", " )\n", " flow_step_inputs.append(s3_input)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customers_output_name = (\n", " f\"{customers_flow['nodes'][-1]['node_id']}.{customers_flow['nodes'][-1]['outputs'][0]['name']}\"\n", ")\n", "\n", "flow_step_outputs = []\n", "\n", "flow_output = sagemaker.processing.ProcessingOutput(\n", " output_name=customers_output_name,\n", " feature_store_output=sagemaker.processing.FeatureStoreOutput(\n", " feature_group_name=customers_fg_name\n", " ),\n", " app_managed=True,\n", ")\n", "\n", "flow_step_outputs.append(flow_output)\n", "\n", "output_content_type = \"CSV\"\n", "\n", "# Output configuration used as processing job container arguments\n", "customers_output_config = {customers_output_name: {\"content_type\": output_content_type}}\n", "\n", "customers_flow_step = ProcessingStep(\n", " name=\"CustomersDataWranglerProcessingStep\",\n", " processor=flow_processor,\n", " inputs=flow_step_inputs,\n", " outputs=flow_step_outputs,\n", " job_arguments=[f\"--output-config '{json.dumps(customers_output_config)}'\"],\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3: Create Dataset and Train/Test Split" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_client.upload_file(\n", " Filename=\"create_dataset.py\", Bucket=bucket, Key=f\"{prefix}/code/create_dataset.py\"\n", ")\n", "\n", "create_dataset_processor = SKLearnProcessor(\n", " framework_version=\"0.23-1\",\n", " role=sagemaker_role,\n", " instance_type=\"ml.m5.xlarge\",\n", " instance_count=1,\n", " base_job_name=\"fraud-detection-demo-create-dataset\",\n", " sagemaker_session=sagemaker_session,\n", ")\n", "\n", "create_dataset_step = ProcessingStep(\n", " name=\"CreateDataset\",\n", " processor=create_dataset_processor,\n", " outputs=[\n", " sagemaker.processing.ProcessingOutput(\n", " output_name=\"train_data\", source=\"/opt/ml/processing/output/train\"\n", " ),\n", " sagemaker.processing.ProcessingOutput(\n", " output_name=\"test_data\", source=\"/opt/ml/processing/output/test\"\n", " ),\n", " ],\n", " job_arguments=[\n", " \"--claims-feature-group-name\",\n", " claims_fg_name,\n", " \"--customers-feature-group-name\",\n", " customers_fg_name,\n", " \"--bucket-name\",\n", " bucket,\n", " \"--bucket-prefix\",\n", " prefix,\n", " \"--region\",\n", " region,\n", " ],\n", " code=create_dataset_script_uri,\n", " depends_on=[claims_flow_step.name, customers_flow_step.name],\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Step 4: Train XGBoost Model\n", "In this step we use the ParameterString `train_instance_param` defined at the beginning of the pipeline.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hyperparameters = {\n", " \"max_depth\": \"3\",\n", " \"eta\": \"0.2\",\n", " \"objective\": \"binary:logistic\",\n", " \"num_round\": \"100\",\n", "}\n", "\n", "xgb_estimator = XGBoost(\n", " entry_point=\"xgboost_starter_script.py\",\n", " output_path=training_job_output_path,\n", " code_location=training_job_output_path,\n", " hyperparameters=hyperparameters,\n", " role=sagemaker_role,\n", " instance_count=train_instance_count,\n", " instance_type=train_instance_param,\n", " framework_version=\"1.0-1\",\n", ")\n", "\n", "train_step = TrainingStep(\n", " name=\"XgboostTrain\",\n", " estimator=xgb_estimator,\n", " inputs={\n", " \"train\": sagemaker.inputs.TrainingInput(\n", " s3_data=create_dataset_step.properties.ProcessingOutputConfig.Outputs[\n", " \"train_data\"\n", " ].S3Output.S3Uri\n", " )\n", " },\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Step 5: Model Pre-Deployment Step\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = sagemaker.model.Model(\n", " name=\"fraud-detection-demo-pipeline-xgboost\",\n", " image_uri=train_step.properties.AlgorithmSpecification.TrainingImage,\n", " model_data=train_step.properties.ModelArtifacts.S3ModelArtifacts,\n", " sagemaker_session=sagemaker_session,\n", " role=sagemaker_role,\n", ")\n", "\n", "inputs = sagemaker.inputs.CreateModelInput(instance_type=\"ml.m4.xlarge\")\n", "\n", "create_model_step = CreateModelStep(name=\"ModelPreDeployment\", model=model, inputs=inputs)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Step 6: Run Bias Metrics with Clarify\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Clarify configuration" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bias_data_config = sagemaker.clarify.DataConfig(\n", " s3_data_input_path=create_dataset_step.properties.ProcessingOutputConfig.Outputs[\n", " \"train_data\"\n", " ].S3Output.S3Uri,\n", " s3_output_path=pipeline_bias_output_path,\n", " label=\"fraud\",\n", " dataset_type=\"text/csv\",\n", ")\n", "\n", "bias_config = sagemaker.clarify.BiasConfig(\n", " label_values_or_threshold=[0],\n", " facet_name=\"customer_gender_female\",\n", " facet_values_or_threshold=[1],\n", ")\n", "\n", "analysis_config = bias_data_config.get_config()\n", "analysis_config.update(bias_config.get_config())\n", "analysis_config[\"methods\"] = {\"pre_training_bias\": {\"methods\": \"all\"}}\n", "\n", "clarify_config_dir = pathlib.Path(\"config\")\n", "clarify_config_dir.mkdir(exist_ok=True)\n", "with open(clarify_config_dir / \"analysis_config.json\", \"w\") as f:\n", " json.dump(analysis_config, f)\n", "\n", "s3_client.upload_file(\n", " Filename=\"config/analysis_config.json\",\n", " Bucket=bucket,\n", " Key=f\"{prefix}/clarify-config/analysis_config.json\",\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Clarify processing step" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clarify_processor = sagemaker.processing.Processor(\n", " base_job_name=\"fraud-detection-demo-clarify-processor\",\n", " image_uri=sagemaker.clarify.image_uris.retrieve(framework=\"clarify\", region=region),\n", " role=sagemaker.get_execution_role(),\n", " instance_count=1,\n", " instance_type=\"ml.c5.xlarge\",\n", ")\n", "\n", "clarify_step = ProcessingStep(\n", " name=\"ClarifyProcessor\",\n", " processor=clarify_processor,\n", " inputs=[\n", " sagemaker.processing.ProcessingInput(\n", " input_name=\"analysis_config\",\n", " source=f\"s3://{bucket}/{prefix}/clarify-config/analysis_config.json\",\n", " destination=\"/opt/ml/processing/input/config\",\n", " ),\n", " sagemaker.processing.ProcessingInput(\n", " input_name=\"dataset\",\n", " source=create_dataset_step.properties.ProcessingOutputConfig.Outputs[\n", " \"train_data\"\n", " ].S3Output.S3Uri,\n", " destination=\"/opt/ml/processing/input/data\",\n", " ),\n", " ],\n", " outputs=[\n", " sagemaker.processing.ProcessingOutput(\n", " source=\"/opt/ml/processing/output/analysis.json\",\n", " destination=pipeline_bias_output_path,\n", " output_name=\"analysis_result\",\n", " )\n", " ],\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Step 7: Register Model\n", "In this step you will use the ParameterString `model_approval_status` defined at the outset of the pipeline code.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mpg_name = prefix\n", "\n", "model_metrics = demo_helpers.ModelMetrics(\n", " bias=sagemaker.model_metrics.MetricsSource(\n", " s3_uri=clarify_step.properties.ProcessingOutputConfig.Outputs[\n", " \"analysis_result\"\n", " ].S3Output.S3Uri,\n", " content_type=\"application/json\",\n", " )\n", ")\n", "\n", "register_step = RegisterModel(\n", " name=\"XgboostRegisterModel\",\n", " estimator=xgb_estimator,\n", " model_data=train_step.properties.ModelArtifacts.S3ModelArtifacts,\n", " content_types=[\"text/csv\"],\n", " response_types=[\"text/csv\"],\n", " inference_instances=[\"ml.t2.medium\", \"ml.m5.xlarge\"],\n", " transform_instances=[\"ml.m5.xlarge\"],\n", " model_package_group_name=mpg_name,\n", " approval_status=model_approval_status,\n", " model_metrics=model_metrics,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Step 8: Deploy Model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_client.upload_file(\n", " Filename=\"deploy_model.py\", Bucket=bucket, Key=f\"{prefix}/code/deploy_model.py\"\n", ")\n", "\n", "deploy_model_processor = SKLearnProcessor(\n", " framework_version=\"0.23-1\",\n", " role=sagemaker_role,\n", " instance_type=\"ml.t3.medium\",\n", " instance_count=1,\n", " base_job_name=\"fraud-detection-demo-deploy-model\",\n", " sagemaker_session=sagemaker_session,\n", ")\n", "\n", "deploy_step = ProcessingStep(\n", " name=\"DeployModel\",\n", " processor=deploy_model_processor,\n", " job_arguments=[\n", " \"--model-name\",\n", " create_model_step.properties.ModelName,\n", " \"--region\",\n", " region,\n", " \"--endpoint-instance-type\",\n", " deploy_model_instance_type,\n", " \"--endpoint-name\",\n", " \"xgboost-model-pipeline-0120\",\n", " ],\n", " code=deploy_model_script_uri,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Step 9: Combine and Run the Pipeline Steps\n", "\n", "Though easier to reason with, the parameters and steps don't need to be in order. The pipeline DAG will parse it out properly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline_name = f\"FraudDetectDemo\"\n", "%store pipeline_name\n", "\n", "pipeline = Pipeline(\n", " name=pipeline_name,\n", " parameters=[train_instance_param, model_approval_status],\n", " steps=[\n", " claims_flow_step,\n", " customers_flow_step,\n", " create_dataset_step,\n", " train_step,\n", " create_model_step,\n", " clarify_step,\n", " register_step,\n", " deploy_step,\n", " ],\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Submit the pipeline definition to the SageMaker Pipeline service\n", "Note: If an existing pipeline has the same name it will be overwritten." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline.upsert(role_arn=sagemaker_role)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### View the entire pipeline definition\n", "Viewing the pipeline definition will all the string variables interpolated may help debug pipeline bugs. It is commented out here due to length." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "json.loads(pipeline.describe()[\"PipelineDefinition\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Run the pipeline\n", "Note this will take about 23 minutes to complete. You can watch the progress of the Pipeline Job on your SageMaker Studio Components panel" ] }, { "attachments": { "image.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Special pipeline parameters can be defined or changed here\n", "parameters = {\"TrainingInstance\": \"ml.m5.xlarge\"}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_response = pipeline.start(parameters=parameters)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_response.wait(delay=60, max_attempts=500)\n", "start_response.describe()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "" ] }, { "attachments": { "image.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "###after completion it will look something like this\n", "" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Clean Up\n", "----\n", "After running the demo, you should remove the resources which were created. You can also delete all the objects in the project's S3 directory by passing the keyword argument `delete_s3_objects=True`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from demo_helpers import delete_project_resources" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "delete_project_resources(\n", " sagemaker_boto_client=sagemaker_boto_client,\n", " pipeline_name=pipeline_name,\n", " mpg_name=mpg_name,\n", " prefix=prefix,\n", " delete_s3_objects=False,\n", " bucket_name=bucket,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": true, "memoryGiB": 0, "name": "ml.geospatial.interactive", "supportedImageNames": [ "sagemaker-geospatial-v1-0" ], "vcpuNum": 0 }, { "_defaultOrder": 21, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 28, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 29, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 54, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 }, { "_defaultOrder": 55, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 56, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4de.24xlarge", "vcpuNum": 96 } ], "kernelspec": { "display_name": "Python 3 (Data Science 3.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/sagemaker-data-science-310-v1" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }