{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Amazon SageMaker Feature Store: Introduction to Feature Store" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook demonstrates how to get started with Feature Store, create feature groups, and ingest data into them. These feature groups are stored in your Feature Store.\n", "\n", "Feature groups are resources that contain metadata for all data stored in your Feature Store. A feature group is a logical grouping of features, defined in the feature store to describe records. A feature group’s definition is composed of a list of feature definitions, a record identifier name, and configurations for its online and offline store. \n", "\n", "### Overview\n", "1. Set up\n", "2. Creating a feature group\n", "3. Ingest data into a feature group\n", "\n", "### Prerequisites\n", "This notebook uses both `boto3` and Python SDK libraries, and the `Python 3 (Data Science)` kernel. This notebook works with Studio, Jupyter, and JupyterLab. \n", "\n", "#### Library dependencies:\n", "* `sagemaker>=2.100.0`\n", "* `numpy`\n", "* `pandas`\n", "\n", "#### Role requirements:\n", "**IMPORTANT**: You must attach the following policies to your execution role:\n", "* `AmazonS3FullAccess`\n", "* `AmazonSageMakerFeatureStoreAccess`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![policy](images/feature-store-policy.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# SageMaker Python SDK version 2.100.0 is required\n", "# boto3 version 1.24.20 is required\n", "import sagemaker\n", "import boto3\n", "import sys\n", "\n", "!pip install 'sagemaker>=2.100.0'\n", "!pip install 'boto3>=1.24.20'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import io\n", "from sagemaker.session import Session\n", "from sagemaker import get_execution_role\n", "\n", "prefix = \"sagemaker-featurestore-introduction\"\n", "role = get_execution_role()\n", "\n", "sagemaker_session = sagemaker.Session()\n", "region = sagemaker_session.boto_region_name\n", "s3_bucket_name = sagemaker_session.default_bucket()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspect your data\n", "In this notebook example we ingest synthetic data. We read from `./data/feature_store_introduction_customer.csv` and `./data/feature_store_introduction_orders.csv`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customer_data = pd.read_csv(\"data/feature_store_introduction_customer.csv\")\n", "orders_data = pd.read_csv(\"data/feature_store_introduction_orders.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customer_data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "orders_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is an illustration on the steps the data goes through before it is ingested into a Feature Store. In this notebook, we illustrate the use-case where you have data from multiple sources and want to store them independently in a feature store. Our example considers data from a data warehouse (customer data), and data from a real-time streaming service (order data). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![data flow](images/feature_store_data_ingest.svg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a feature group\n", "\n", "We first start by creating feature group names for customer_data and orders_data. Following this, we create two Feature Groups, one for `customer_data` and another for `orders_data`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from time import gmtime, strftime, sleep\n", "\n", "customers_feature_group_name = \"customers-feature-group-\" + strftime(\"%d-%H-%M-%S\", gmtime())\n", "orders_feature_group_name = \"orders-feature-group-\" + strftime(\"%d-%H-%M-%S\", gmtime())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instantiate a FeatureGroup object for customers_data and orders_data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.feature_store.feature_group import FeatureGroup\n", "\n", "customers_feature_group = FeatureGroup(\n", " name=customers_feature_group_name, sagemaker_session=sagemaker_session\n", ")\n", "orders_feature_group = FeatureGroup(\n", " name=orders_feature_group_name, sagemaker_session=sagemaker_session\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import time\n", "\n", "current_time_sec = int(round(time.time()))\n", "\n", "record_identifier_feature_name = \"customer_id\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Append `EventTime` feature to your data frame. This parameter is required, and time stamps each data point." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customer_data[\"EventTime\"] = pd.Series([current_time_sec] * len(customer_data), dtype=\"float64\")\n", "orders_data[\"EventTime\"] = pd.Series([current_time_sec] * len(orders_data), dtype=\"float64\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load feature definitions to your feature group. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customers_feature_group.load_feature_definitions(data_frame=customer_data)\n", "orders_feature_group.load_feature_definitions(data_frame=orders_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below we call create to create two feature groups, customers_feature_group and orders_feature_group respectively" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customers_feature_group.create(\n", " s3_uri=f\"s3://{s3_bucket_name}/{prefix}\",\n", " record_identifier_name=record_identifier_feature_name,\n", " event_time_feature_name=\"EventTime\",\n", " role_arn=role,\n", " enable_online_store=True,\n", ")\n", "\n", "orders_feature_group.create(\n", " s3_uri=f\"s3://{s3_bucket_name}/{prefix}\",\n", " record_identifier_name=record_identifier_feature_name,\n", " event_time_feature_name=\"EventTime\",\n", " role_arn=role,\n", " enable_online_store=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To confirm that your FeatureGroup has been created we use `DescribeFeatureGroup` and `ListFeatureGroups` APIs to display the created FeatureGroup." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customers_feature_group.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "orders_feature_group.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sagemaker_session.boto_session.client(\n", " \"sagemaker\", region_name=region\n", ").list_feature_groups() # We use the boto client to list FeatureGroups" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def check_feature_group_status(feature_group):\n", " status = feature_group.describe().get(\"FeatureGroupStatus\")\n", " while status == \"Creating\":\n", " print(\"Waiting for Feature Group to be Created\")\n", " time.sleep(5)\n", " status = feature_group.describe().get(\"FeatureGroupStatus\")\n", " print(f\"FeatureGroup {feature_group.name} successfully created.\")\n", "\n", "\n", "check_feature_group_status(customers_feature_group)\n", "check_feature_group_status(orders_feature_group)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add metadata to a feature\n", "\n", "We can add searchable metadata fields to FeatureGroup features by using the `UpdateFeatureMetadata` API. The currently supported metadata fields are `description` and `parameters`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.feature_store.inputs import FeatureParameter\n", "\n", "customers_feature_group.update_feature_metadata(\n", " feature_name=\"customer_id\",\n", " description=\"The ID of a customer. It is also used in orders_feature_group.\",\n", " parameter_additions=[FeatureParameter(\"idType\", \"primaryKey\")],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To view feature metadata, we can use `DescribeFeatureMetadata` to display that feature." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customers_feature_group.describe_feature_metadata(feature_name=\"customer_id\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Feature metadata fields are searchable. We use `search` API to find features with metadata that matches some search criteria." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sagemaker_session.boto_session.client(\"sagemaker\", region_name=region).search(\n", " Resource=\"FeatureMetadata\",\n", " SearchExpression={\n", " \"Filters\": [\n", " {\n", " \"Name\": \"FeatureGroupName\",\n", " \"Operator\": \"Contains\",\n", " \"Value\": \"customers-feature-group-\",\n", " },\n", " {\"Name\": \"Parameters.idType\", \"Operator\": \"Equals\", \"Value\": \"primaryKey\"},\n", " ]\n", " },\n", ") # We use the boto client to search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ingest data into a feature group\n", "\n", "We can put data into the FeatureGroup by using the `PutRecord` API. It will take < 1 minute to ingest data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customers_feature_group.ingest(data_frame=customer_data, max_workers=3, wait=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "orders_feature_group.ingest(data_frame=orders_data, max_workers=3, wait=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using an arbitrary customer record ID, 573291 we use `get_record` to check that the data has been ingested into the feature group." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customer_id = 573291\n", "sample_record = sagemaker_session.boto_session.client(\n", " \"sagemaker-featurestore-runtime\", region_name=region\n", ").get_record(\n", " FeatureGroupName=customers_feature_group_name, RecordIdentifierValueAsString=str(customer_id)\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sample_record" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use `batch_get_record` to check that all data has been ingested into two feature groups by providing customer IDs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_records = sagemaker_session.boto_session.client(\n", " \"sagemaker-featurestore-runtime\", region_name=region\n", ").batch_get_record(\n", " Identifiers=[\n", " {\n", " \"FeatureGroupName\": customers_feature_group_name,\n", " \"RecordIdentifiersValueAsString\": [\"573291\", \"109382\", \"828400\", \"124013\"],\n", " },\n", " {\n", " \"FeatureGroupName\": orders_feature_group_name,\n", " \"RecordIdentifiersValueAsString\": [\"573291\", \"109382\", \"828400\", \"124013\"],\n", " },\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "all_records" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add features to a feature group\n", "\n", "If we want to update a FeatureGroup that has done the data ingestion, we can use the `UpdateFeatureGroup` API and then re-ingest data by using the updated dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.feature_store.feature_definition import StringFeatureDefinition\n", "\n", "customers_feature_group.update(\n", " feature_additions=[StringFeatureDefinition(\"email\"), StringFeatureDefinition(\"name\")]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verify the FeatureGroup has been updated successfully or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def check_last_update_status(feature_group):\n", " last_update_status = feature_group.describe().get(\"LastUpdateStatus\")[\"Status\"]\n", " while last_update_status == \"InProgress\":\n", " print(\"Waiting for FeatureGroup to be updated\")\n", " time.sleep(5)\n", " last_update_status = feature_group.describe().get(\"LastUpdateStatus\")\n", " if last_update_status == \"Successful\":\n", " print(f\"FeatureGroup {feature_group.name} successfully updated.\")\n", " else:\n", " print(\n", " f\"FeatureGroup {feature_group.name} updated failed. The LastUpdateStatus is\"\n", " + str(last_update_status)\n", " )\n", "\n", "\n", "check_last_update_status(customers_feature_group)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inspect the new dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customer_data_updated = pd.read_csv(\"data/feature_store_introduction_customer_updated.csv\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customer_data_updated.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Append `EventTime` feature to your data frame again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customer_data_updated[\"EventTime\"] = pd.Series(\n", " [current_time_sec] * len(customer_data), dtype=\"float64\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ingest the new dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customers_feature_group.ingest(data_frame=customer_data_updated, max_workers=3, wait=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `batch_get_record` again to check that all updated data has been ingested into `customers_feature_group` by providing customer IDs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "updated_customers_records = sagemaker_session.boto_session.client(\n", " \"sagemaker-featurestore-runtime\", region_name=region\n", ").batch_get_record(\n", " Identifiers=[\n", " {\n", " \"FeatureGroupName\": customers_feature_group_name,\n", " \"RecordIdentifiersValueAsString\": [\"573291\", \"109382\", \"828400\", \"124013\"],\n", " }\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "updated_customers_records" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Clean up\n", "Here we remove the Feature Groups we created. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "customers_feature_group.delete()\n", "orders_feature_group.delete()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Next steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook you learned how to quickly get started with Feature Store and now know how to create feature groups, and ingest data into them.\n", "\n", "For an advanced example on how to use Feature Store for a Fraud Detection use-case, see [Fraud Detection with Feature Store](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/sagemaker_featurestore_fraud_detection_python_sdk.html).\n", "\n", "For detailed information about Feature Store, see the [Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Programmers note" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we used a variety of different API calls. Most of them are accessible through the Python SDK, however some only exist within `boto3`. You can invoke the Python SDK API calls directly on your Feature Store objects, whereas to invoke API calls that exist within `boto3`, you must first access a boto client through your boto and sagemaker sessions: e.g. `sagemaker_session.boto_session.client()`.\n", "\n", "Below we list API calls used in this notebook that exist within the Python SDK and ones that exist in `boto3` for your reference. \n", "\n", "#### Python SDK API Calls\n", "* `describe()`\n", "* `ingest()`\n", "* `delete()`\n", "* `create()`\n", "* `load_feature_definitions()`\n", "* `update()`\n", "* `update_feature_metadata()`\n", "* `describe_feature_metadata()`\n", "\n", "#### Boto3 API Calls\n", "* `list_feature_groups()`\n", "* `get_record()`\n", "* `batch_get_record()`\n", "* `search()`\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-featurestore|feature_store_introduction.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-featurestore|feature_store_introduction.ipynb)\n" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science 2.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/sagemaker-data-science-38" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" } }, "nbformat": 4, "nbformat_minor": 4 }