{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Fraud Detection with Amazon SageMaker FeatureStore\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Kernel `Python 3 (Data Science)` works well with this notebook.\n", "\n", "The following policies need to be attached to the execution role:\n", "- AmazonSageMakerFullAccess\n", "- AmazonS3FullAccess\n", "\n", "## Contents\n", "1. [Background](#Background)\n", "1. [Setup SageMaker FeatureStore](#Setup-SageMaker-FeatureStore)\n", "1. [Inspect Dataset](#Inspect-Dataset)\n", "1. [Ingest Data into FeatureStore](#Ingest-Data-into-FeatureStore)\n", "1. [Build Training Dataset](#Build-Training-Dataset)\n", "1. [Train and Deploy the Model](#Train-and-Deploy-the-Model)\n", "1. [SageMaker FeatureStore At Inference](#SageMaker-FeatureStore-During-Inference)\n", "1. [Cleanup Resources](#Cleanup-Resources)\n", "\n", "## Background\n", "\n", "Amazon SageMaker FeatureStore is a new SageMaker capability that makes it easy for customers to create and manage curated data for machine learning (ML) development. SageMaker FeatureStore enables data ingestion via a high TPS API and data consumption via the online and offline stores. \n", "\n", "This notebook provides an example for the APIs provided by SageMaker FeatureStore by walking through the process of training a fraud detection model. The notebook demonstrates how the dataset's tables can be ingested into the FeatureStore, queried to create a training dataset, and quickly accessed during inference. \n", "\n", "\n", "### Terminology\n", "\n", "A **FeatureGroup** is the main resource that contains the metadata for all the data stored in SageMaker FeatureStore. A FeatureGroup contains a list of FeatureDefinitions. A **FeatureDefinition** consists of a name and one of the following data types: a integral, string or decimal. The FeatureGroup also contains an **OnlineStoreConfig** and an **OfflineStoreConfig** controlling where the data is stored. Enabling the online store allows quick access to the latest value for a Record via the GetRecord API. The offline store, a required configuration, allows storage of historical data in your S3 bucket. \n", "\n", "Once a FeatureGroup is created, data can be added as Records. **Records** can be thought of as a row in a table. Each record will have a unique **RecordIdentifier** along with values for all other FeatureDefinitions in the FeatureGroup. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Setup SageMaker FeatureStore\n", "\n", "Let's start by setting up the SageMaker Python SDK and boto client. Note that this notebook requires a `boto3` version above `1.17.21`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import sagemaker\n", "\n", "original_boto3_version = boto3.__version__\n", "%pip install 'boto3>1.17.21'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.session import Session\n", "\n", "region = boto3.Session().region_name\n", "\n", "boto_session = boto3.Session(region_name=region)\n", "\n", "sagemaker_client = boto_session.client(service_name=\"sagemaker\", region_name=region)\n", "featurestore_runtime = boto_session.client(\n", " service_name=\"sagemaker-featurestore-runtime\", region_name=region\n", ")\n", "\n", "feature_store_session = Session(\n", " boto_session=boto_session,\n", " sagemaker_client=sagemaker_client,\n", " sagemaker_featurestore_runtime_client=featurestore_runtime,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### S3 Bucket Setup For The OfflineStore\n", "\n", "SageMaker FeatureStore writes the data in the OfflineStore of a FeatureGroup to a S3 bucket owned by you. To be able to write to your S3 bucket, SageMaker FeatureStore assumes an IAM role which has access to it. The role is also owned by you.\n", "Note that the same bucket can be re-used across FeatureGroups. Data in the bucket is partitioned by FeatureGroup." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Set the default s3 bucket name and it will be referenced throughout the notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can modify the following to use a bucket of your choosing\n", "default_s3_bucket_name = feature_store_session.default_bucket()\n", "prefix = \"sagemaker-featurestore-demo\"\n", "\n", "print(default_s3_bucket_name)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Set up the IAM role. This role gives SageMaker FeatureStore access to your S3 bucket. \n", "\n", "
\n", "Note: In this example we use the default SageMaker role, assuming it has both AmazonSageMakerFullAccess and AmazonSageMakerFeatureStoreAccess managed policies. If not, please make sure to attach them to the role before proceeding.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker import get_execution_role\n", "\n", "# You can modify the following to use a role of your choosing. See the documentation for how to create this.\n", "role = get_execution_role()\n", "print(role)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Inspect Dataset\n", "\n", "The provided dataset is a synthetic dataset with two tables: identity and transactions. They can both be joined by the `TransactionId` column. The transaction table contains information about a particular transaction such as amount, credit or debit card while the identity table contains information about the user such as device type and browser. The transaction must exist in the transaction table, but might not always be available in the identity table.\n", "\n", "The objective of the model is to predict if a transaction is fraudulent or not, given the transaction record." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import io\n", "\n", "s3_client = boto3.client(\"s3\", region_name=region)\n", "\n", "fraud_detection_bucket_name = f\"sagemaker-example-files-prod-{region}\"\n", "identity_file_key = (\n", " \"datasets/tabular/fraud_detection/synthethic_fraud_detection_SA/sampled_identity.csv\"\n", ")\n", "transaction_file_key = (\n", " \"datasets/tabular/fraud_detection/synthethic_fraud_detection_SA/sampled_transactions.csv\"\n", ")\n", "\n", "identity_data_object = s3_client.get_object(\n", " Bucket=fraud_detection_bucket_name, Key=identity_file_key\n", ")\n", "transaction_data_object = s3_client.get_object(\n", " Bucket=fraud_detection_bucket_name, Key=transaction_file_key\n", ")\n", "\n", "identity_data = pd.read_csv(io.BytesIO(identity_data_object[\"Body\"].read()))\n", "transaction_data = pd.read_csv(io.BytesIO(transaction_data_object[\"Body\"].read()))\n", "\n", "identity_data = identity_data.round(5)\n", "transaction_data = transaction_data.round(5)\n", "\n", "identity_data = identity_data.fillna(0)\n", "transaction_data = transaction_data.fillna(0)\n", "\n", "# Feature transformations for this dataset are applied before ingestion into FeatureStore.\n", "# One hot encode card4, card6\n", "encoded_card_bank = pd.get_dummies(transaction_data[\"card4\"], prefix=\"card_bank\")\n", "encoded_card_type = pd.get_dummies(transaction_data[\"card6\"], prefix=\"card_type\")\n", "\n", "transformed_transaction_data = pd.concat(\n", " [transaction_data, encoded_card_type, encoded_card_bank], axis=1\n", ")\n", "# blank space is not allowed in feature name\n", "transformed_transaction_data = transformed_transaction_data.rename(\n", " columns={\"card_bank_american express\": \"card_bank_american_express\"}\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "identity_data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "transformed_transaction_data.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Ingest Data into FeatureStore\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In this step we will create the FeatureGroups representing the transaction and identity tables." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Define FeatureGroups" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from time import gmtime, strftime, sleep\n", "\n", "identity_feature_group_name = \"identity-feature-group-\" + strftime(\"%d-%H-%M-%S\", gmtime())\n", "transaction_feature_group_name = \"transaction-feature-group-\" + strftime(\"%d-%H-%M-%S\", gmtime())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.feature_store.feature_group import FeatureGroup\n", "\n", "identity_feature_group = FeatureGroup(\n", " name=identity_feature_group_name, sagemaker_session=feature_store_session\n", ")\n", "transaction_feature_group = FeatureGroup(\n", " name=transaction_feature_group_name, sagemaker_session=feature_store_session\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time\n", "\n", "current_time_sec = int(round(time.time()))\n", "\n", "\n", "def cast_object_to_string(data_frame):\n", " for label in data_frame.columns:\n", " if data_frame.dtypes[label] == \"object\":\n", " data_frame[label] = data_frame[label].astype(\"str\").astype(\"string\")\n", "\n", "\n", "# cast object dtype to string. The SageMaker FeatureStore Python SDK will then map the string dtype to String feature type.\n", "cast_object_to_string(identity_data)\n", "cast_object_to_string(transformed_transaction_data)\n", "\n", "# record identifier and event time feature names\n", "record_identifier_feature_name = \"TransactionID\"\n", "event_time_feature_name = \"EventTime\"\n", "\n", "# append EventTime feature\n", "identity_data[event_time_feature_name] = pd.Series(\n", " [current_time_sec] * len(identity_data), dtype=\"float64\"\n", ")\n", "transformed_transaction_data[event_time_feature_name] = pd.Series(\n", " [current_time_sec] * len(transaction_data), dtype=\"float64\"\n", ")\n", "\n", "# load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.\n", "identity_feature_group.load_feature_definitions(data_frame=identity_data)\n", "# output is suppressed\n", "transaction_feature_group.load_feature_definitions(data_frame=transformed_transaction_data)\n", "# output is suppressed" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Create FeatureGroups in SageMaker FeatureStore" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def wait_for_feature_group_creation_complete(feature_group):\n", " status = feature_group.describe().get(\"FeatureGroupStatus\")\n", " while status == \"Creating\":\n", " print(\"Waiting for Feature Group Creation\")\n", " time.sleep(5)\n", " status = feature_group.describe().get(\"FeatureGroupStatus\")\n", " if status != \"Created\":\n", " raise RuntimeError(f\"Failed to create feature group {feature_group.name}\")\n", " print(f\"FeatureGroup {feature_group.name} successfully created.\")\n", "\n", "\n", "identity_feature_group.create(\n", " s3_uri=f\"s3://{default_s3_bucket_name}/{prefix}\",\n", " record_identifier_name=record_identifier_feature_name,\n", " event_time_feature_name=event_time_feature_name,\n", " role_arn=role,\n", " enable_online_store=True,\n", ")\n", "\n", "transaction_feature_group.create(\n", " s3_uri=f\"s3://{default_s3_bucket_name}/{prefix}\",\n", " record_identifier_name=record_identifier_feature_name,\n", " event_time_feature_name=event_time_feature_name,\n", " role_arn=role,\n", " enable_online_store=True,\n", ")\n", "\n", "wait_for_feature_group_creation_complete(feature_group=identity_feature_group)\n", "wait_for_feature_group_creation_complete(feature_group=transaction_feature_group)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Confirm the FeatureGroup has been created by using the DescribeFeatureGroup and ListFeatureGroups APIs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "identity_feature_group.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "transaction_feature_group.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sagemaker_client.list_feature_groups() # use boto client to list FeatureGroups" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### PutRecords into FeatureGroup\n", "\n", "After the FeatureGroups have been created, we can put data into the FeatureGroups by using the PutRecord API. This API can handle high TPS and is designed to be called by different streams. The data from all of these Put requests is buffered and written to S3 in chunks. The files will be written to the offline store within a few minutes of ingestion. For this example, to accelerate the ingestion process, we are specifying multiple workers to do the job simultaneously. It will take ~1min to ingest data to the 2 FeatureGroups, respectively." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "identity_feature_group.ingest(data_frame=identity_data, max_workers=3, wait=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "transaction_feature_group.ingest(data_frame=transformed_transaction_data, max_workers=5, wait=True)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To confirm that data has been ingested, we can quickly retrieve a record from the online store:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "record_identifier_value = str(2990130)\n", "\n", "featurestore_runtime.get_record(\n", " FeatureGroupName=transaction_feature_group_name,\n", " RecordIdentifierValueAsString=record_identifier_value,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We can also retrieve a record of each feature group from the online store:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "featurestore_runtime.batch_get_record(\n", " Identifiers=[\n", " {\n", " \"FeatureGroupName\": identity_feature_group_name,\n", " \"RecordIdentifiersValueAsString\": [\"2990130\"],\n", " },\n", " {\n", " \"FeatureGroupName\": transaction_feature_group_name,\n", " \"RecordIdentifiersValueAsString\": [\"2990130\"],\n", " },\n", " ]\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The SageMaker Python SDK’s FeatureStore class also provides the functionality to generate Hive DDL commands. Schema of the table is generated based on the feature definitions. Columns are named after feature name and data-type are inferred based on feature type." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(identity_feature_group.as_hive_ddl())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(transaction_feature_group.as_hive_ddl())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now let's wait for the data to appear in our offline store before moving forward to creating a dataset. This will take approximately 5 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "account_id = boto3.client(\"sts\").get_caller_identity()[\"Account\"]\n", "print(account_id)\n", "\n", "identity_feature_group_resolved_output_s3_uri = (\n", " identity_feature_group.describe()\n", " .get(\"OfflineStoreConfig\")\n", " .get(\"S3StorageConfig\")\n", " .get(\"ResolvedOutputS3Uri\")\n", ")\n", "transaction_feature_group_resolved_output_s3_uri = (\n", " transaction_feature_group.describe()\n", " .get(\"OfflineStoreConfig\")\n", " .get(\"S3StorageConfig\")\n", " .get(\"ResolvedOutputS3Uri\")\n", ")\n", "\n", "identity_feature_group_s3_prefix = identity_feature_group_resolved_output_s3_uri.replace(\n", " f\"s3://{default_s3_bucket_name}/\", \"\"\n", ")\n", "transaction_feature_group_s3_prefix = transaction_feature_group_resolved_output_s3_uri.replace(\n", " f\"s3://{default_s3_bucket_name}/\", \"\"\n", ")\n", "\n", "offline_store_contents = None\n", "while offline_store_contents is None:\n", " objects_in_bucket = s3_client.list_objects(\n", " Bucket=default_s3_bucket_name, Prefix=transaction_feature_group_s3_prefix\n", " )\n", " if \"Contents\" in objects_in_bucket and len(objects_in_bucket[\"Contents\"]) > 1:\n", " offline_store_contents = objects_in_bucket[\"Contents\"]\n", " else:\n", " print(\"Waiting for data in offline store...\\n\")\n", " sleep(60)\n", "\n", "print(\"Data available.\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "SageMaker FeatureStore adds metadata for each record that's ingested into the offline store." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Build Training Dataset\n", "\n", "SageMaker FeatureStore automatically builds the Glue Data Catalog for FeatureGroups (you can optionally turn it on/off while creating the FeatureGroup). In this example, we want to create one training dataset with FeatureValues from both identity and transaction FeatureGroups. This is done by utilizing the auto-built Catalog. We run an Athena query that joins the data stored in the offline store in S3 from the 2 FeatureGroups. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "identity_query = identity_feature_group.athena_query()\n", "transaction_query = transaction_feature_group.athena_query()\n", "\n", "identity_table = identity_query.table_name\n", "transaction_table = transaction_query.table_name\n", "\n", "query_string = (\n", " 'SELECT * FROM \"'\n", " + transaction_table\n", " + '\" LEFT JOIN \"'\n", " + identity_table\n", " + '\" ON \"'\n", " + transaction_table\n", " + '\".transactionid = \"'\n", " + identity_table\n", " + '\".transactionid'\n", ")\n", "print(\"Running \" + query_string)\n", "\n", "# run Athena query. The output is loaded to a Pandas dataframe.\n", "# dataset = pd.DataFrame()\n", "identity_query.run(\n", " query_string=query_string,\n", " output_location=\"s3://\" + default_s3_bucket_name + \"/\" + prefix + \"/query_results/\",\n", ")\n", "identity_query.wait()\n", "dataset = identity_query.as_dataframe()\n", "\n", "dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Prepare query results for training.\n", "query_execution = identity_query.get_query_execution()\n", "query_result = (\n", " \"s3://\"\n", " + default_s3_bucket_name\n", " + \"/\"\n", " + prefix\n", " + \"/query_results/\"\n", " + query_execution[\"QueryExecution\"][\"QueryExecutionId\"]\n", " + \".csv\"\n", ")\n", "print(query_result)\n", "\n", "# Select useful columns for training with target column as the first.\n", "dataset = dataset[\n", " [\n", " \"isfraud\",\n", " \"transactiondt\",\n", " \"transactionamt\",\n", " \"card1\",\n", " \"card2\",\n", " \"card3\",\n", " \"card5\",\n", " \"card_type_credit\",\n", " \"card_type_debit\",\n", " \"card_bank_american_express\",\n", " \"card_bank_discover\",\n", " \"card_bank_mastercard\",\n", " \"card_bank_visa\",\n", " \"id_01\",\n", " \"id_02\",\n", " \"id_03\",\n", " \"id_04\",\n", " \"id_05\",\n", " ]\n", "]\n", "\n", "# Write to csv in S3 without headers and index column.\n", "dataset.to_csv(\"dataset.csv\", header=False, index=False)\n", "s3_client.upload_file(\"dataset.csv\", default_s3_bucket_name, prefix + \"/training_input/dataset.csv\")\n", "dataset_uri_prefix = \"s3://\" + default_s3_bucket_name + \"/\" + prefix + \"/training_input/\"\n", "\n", "dataset" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Train and Deploy the Model\n", "\n", "Now it's time to launch a Training job to fit our model. We use the gradient boosting algorithm provided by XGBoost libary to fit our data. Call the SageMaker XGBoost container and construct a generic SageMaker estimator." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_image = sagemaker.image_uris.retrieve(\"xgboost\", region, \"1.0-1\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: If the previous cell fails to call the SageMaker XGBoost training image, this might be due to the limited support of regions. To find available regions, see [Docker Registry Paths for SageMaker Built-in Algorithms](https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html) in the Amazon SageMaker developer guide. If you are in a region that is not listed in the documentation, you need to run the following code to manually pull the pre-built SageMaker XGBoost base image and push to your [Amazon Elastic Container Registry (ECR)](https://aws.amazon.com/ecr/). **You must use this manual docker registration in SageMaker Notebook instances** because SageMaker Studio apps run based off docker containers and do not support docker.\n", "\n", "**Step 1**: Pull, build, and push the SageMaker XGBoost container to your ECR account. The following bash script pulls the SageMaker XGBoost docker image from the `us-east-2` region and pushes to your ECR.\n", "\n", "```bash\n", "%%bash\n", "public_ecr=257758044811.dkr.ecr.us-east-2.amazonaws.com\n", "image=sagemaker-xgboost\n", "tag=1.0-1-cpu-py3\n", "\n", "# Add the public ECR for XGBoost image to authenticated registries\n", "aws ecr get-login-password --region us-east-2 | \\\n", " docker login --username AWS --password-stdin $public_ecr\n", "\n", "# Pull the XGBoost image\n", "docker pull $public_ecr/$image:$tag\n", "\n", "# Push the image to your ECR\n", "my_region=$(aws configure get region)\n", "my_account=$(aws sts get-caller-identity --query Account | tr -d '\"')\n", "my_ecr=$my_account.dkr.ecr.$my_region.amazonaws.com \n", "\n", "# Authenticate your ECR\n", "aws ecr get-login-password --region $my_region | \\\n", " docker login --username AWS --password-stdin $my_ecr\n", "\n", "# Create a repository in your ECR to host the XGBoost image\n", "repository_name=sagemaker-xgboost\n", "\n", "if aws ecr create-repository --repository-name $repository_name ; then\n", " echo \"Repository $repository_name created!\"\n", "else\n", " echo \"Repository $repository_name already exists!\"\n", "fi\n", "\n", "# Push the image to your ECR\n", "docker tag $public_ecr/$image:$tag $my_ecr/$image:$tag\n", "docker push $my_ecr/$image:$tag\n", "```\n", "\n", "**Step 2**: To use the ECR image URI, run the following code and set the `training_image` string object.\n", "```python\n", "import boto3\n", "region = boto3.Session().region_name\n", "account_id = boto3.client('sts').get_caller_identity()[\"Account\"]\n", "ecr = '{}.dkr.ecr.{}.amazonaws.com'.format(account_id, region)\n", "\n", "training_image=ecr + '/' + 'sagemaker-xgboost:1.0-1-cpu-py3'\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Construct a SageMaker generic estimator using the SageMaker XGBoost container" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_output_path = \"s3://\" + default_s3_bucket_name + \"/\" + prefix + \"/training_output\"\n", "\n", "from sagemaker.estimator import Estimator\n", "\n", "training_model = Estimator(\n", " training_image,\n", " role,\n", " instance_count=1,\n", " instance_type=\"ml.m5.2xlarge\",\n", " volume_size=5,\n", " max_run=3600,\n", " input_mode=\"File\",\n", " output_path=training_output_path,\n", " sagemaker_session=feature_store_session,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Set hyperparameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_model.set_hyperparameters(objective=\"binary:logistic\", num_round=50)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Specify training dataset \n", "Specify the training dataset created in the [Build Training Dataset](#Build-Training-Dataset) section." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_data = sagemaker.inputs.TrainingInput(\n", " dataset_uri_prefix,\n", " distribution=\"FullyReplicated\",\n", " content_type=\"text/csv\",\n", " s3_data_type=\"S3Prefix\",\n", ")\n", "data_channels = {\"train\": train_data}" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Start training" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_model.fit(inputs=data_channels, logs=True)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Set up Hosting for the Model\n", "\n", "Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same instance (or type of instance) that we used to train. The endpoint deployment can be accomplished as follows. This takes 8-10 minutes to complete." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor = training_model.deploy(initial_instance_count=1, instance_type=\"ml.m5.xlarge\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker FeatureStore During Inference\n", "\n", "SageMaker FeatureStore can be useful in supplementing data for inference requests because of the low-latency GetRecord functionality. For this demo, we will be given a TransactionId and query our online FeatureGroups for data on the transaction to build our inference request. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Incoming inference request.\n", "transaction_id = str(3450774)\n", "\n", "\n", "# Helper to parse the feature value from the record.\n", "def get_feature_value(record, feature_name):\n", " return str(list(filter(lambda r: r[\"FeatureName\"] == feature_name, record))[0][\"ValueAsString\"])\n", "\n", "\n", "transaction_response = featurestore_runtime.get_record(\n", " FeatureGroupName=transaction_feature_group_name, RecordIdentifierValueAsString=transaction_id\n", ")\n", "transaction_record = transaction_response[\"Record\"]\n", "\n", "transaction_test_data = [\n", " get_feature_value(transaction_record, \"TransactionDT\"),\n", " get_feature_value(transaction_record, \"TransactionAmt\"),\n", " get_feature_value(transaction_record, \"card1\"),\n", " get_feature_value(transaction_record, \"card2\"),\n", " get_feature_value(transaction_record, \"card3\"),\n", " get_feature_value(transaction_record, \"card5\"),\n", " get_feature_value(transaction_record, \"card_type_credit\"),\n", " get_feature_value(transaction_record, \"card_type_debit\"),\n", " get_feature_value(transaction_record, \"card_bank_american_express\"),\n", " get_feature_value(transaction_record, \"card_bank_discover\"),\n", " get_feature_value(transaction_record, \"card_bank_mastercard\"),\n", " get_feature_value(transaction_record, \"card_bank_visa\"),\n", "]\n", "\n", "identity_response = featurestore_runtime.get_record(\n", " FeatureGroupName=identity_feature_group_name, RecordIdentifierValueAsString=transaction_id\n", ")\n", "identity_record = identity_response[\"Record\"]\n", "id_test_data = [\n", " get_feature_value(identity_record, \"id_01\"),\n", " get_feature_value(identity_record, \"id_02\"),\n", " get_feature_value(identity_record, \"id_03\"),\n", " get_feature_value(identity_record, \"id_04\"),\n", " get_feature_value(identity_record, \"id_05\"),\n", "]\n", "\n", "# Join all pieces for inference request.\n", "inference_request = []\n", "inference_request.extend(transaction_test_data[:])\n", "inference_request.extend(id_test_data[:])\n", "\n", "inference_request" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "\n", "results = predictor.predict(\",\".join(inference_request), initial_args={\"ContentType\": \"text/csv\"})\n", "prediction = json.loads(results)\n", "print(prediction)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Cleanup Resources" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictor.delete_endpoint()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "identity_feature_group.delete()\n", "transaction_feature_group.delete()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# restore original boto3 version\n", "%pip install 'boto3=={}'.format(original_boto3_version)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-featurestore|sagemaker_featurestore_fraud_detection_python_sdk.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (Data Science 2.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/sagemaker-data-science-38" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 4 }