{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Amazon SageMaker Feature Store: Ground Truth Classification labelling job output to Feature Store" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook demonstrates how to securely store the output of an image or text classification labelling job from [Amazon Ground Truth](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html) directly into Feature Store using a KMS key.\n", "\n", "This notebook starts by reading in the `output.manifest` file, which is the output file from your classification labeling job from Amazon SageMaker Ground Truth. You can substitute your own Amazon S3 bucket and path to a method we provide, which downloads the file to your current working directory. Then we prepare the manifest file for ingestion to an online or offline feature store. We use a [Key Management Service (KMS)](https://docs.aws.amazon.com/kms/latest/developerguide/create-keys.html) key for server-side encryption to ensure that your data is securely stored in your feature store.\n", "\n", "\n", "This notebook uses a KMS key for server side encryption for your Feature Store. For more information on server-side encryption, see [Feature Store: Encrypt Data in your Online or Offline Feature Store using KMS key](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/feature_store_kms_key_encryption.html). \n", "\n", "To encrypt your data on the client side prior to ingestion, see [Amazon SageMaker Feature Store: Client-side Encryption using AWS Encryption SDK](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/feature_store_client_side_encryption.html) for a demonstration. \n", "\n", "## Overview\n", "1. Set up.\n", "2. Prepare `output.manifest`for Feature Store. \n", "3. Create a feature group and ingest your data into it.\n", "\n", "## Prerequisites\n", "This notebook uses the Python SDK library for Feature Store, and the `Python 3 (Data Science)` kernel. To encrypt your data with KMS key for server side encryption, you will need to have an active KMS key. If you do not have a KMS key, then you can create one by following the [KMS Policy Template](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/feature_store_kms_key_encryption.html#KMS-Policy-Template) steps, or you can visit the [KMS section in the console](https://console.aws.amazon.com/kms/home) and follow the prompts for creating a KMS key. This notebook is compatible with SageMaker Studio, Jupyter, and JupyterLab. \n", "\n", "## Library dependencies:\n", "* `sagemaker>=2.0.0`\n", "* `numpy`\n", "* `pandas`\n", "* `boto3`\n", "\n", "## Data\n", "This notebook uses a synthetic manifest file called `output.manifest` located in the data subfolder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from time import gmtime, strftime\n", "from sagemaker.feature_store.feature_group import FeatureGroup\n", "\n", "import json\n", "import pandas as pd\n", "import sagemaker\n", "import time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sagemaker_session = sagemaker.Session()\n", "s3_bucket_name = sagemaker_session.default_bucket() # This is the bucket for your offline store.\n", "prefix = \"sagemaker-featurestore-demo\"\n", "role = sagemaker.get_execution_role()\n", "region = sagemaker_session.boto_region_name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional - Helper Method\n", "Below is a method that you can use to get your manifest file from your S3 bucket into your current working directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def download_file_from_s3(bucket, path, filename):\n", " \"\"\"\n", " Download filename to your current directory.\n", " Parameters:\n", " bucket: S3 bucket name\n", " path: path to file\n", " filename: the name of the file you are downloading\n", " Returns:\n", " None\n", " \"\"\"\n", " import os.path\n", "\n", " if not os.path.exists(filename):\n", " s3 = boto3.client(\"s3\")\n", " s3.download_file(Bucket=bucket, Key=path, Filename=filename)\n", "\n", "\n", "# Supply the path to your output.manifest file from your Ground Truth labelling job.\n", "# download_file_from_s3(public_s3_bucket_name, path='PATH', filename='output.manifest')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare your manifest file for Feature Store. \n", "\n", "Below is a method that will parse your `output.manifest` file into a Panda's data frame for ingestion into your Feature Store. At this point, it is assumed that your manifest file is in your current working directory. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_dataframe_from_manifest(filename):\n", " \"\"\"\n", " Return a dataframe containing all information from your output.manifest file.\n", " Parameters:\n", " filename: path to your output.manifest file. This should be the\n", " output.manifest file from an AWS Ground Truth Classification labelling job.\n", " Returns:\n", " Data frame.\n", "\n", " Implementation details:\n", " i1 and i2: These are indices within dictionary d that we are looping through.\n", " k and j: k is a key of dictionary d which is also a dictionary, and j is a key of dictionary k.\n", " \"\"\"\n", " (\n", " item_name,\n", " classification,\n", " class_name_meta_data,\n", " confidence_meta_data,\n", " type_meta_data,\n", " job_name_meta_data,\n", " human_annotated_meta_data,\n", " creation_date,\n", " ) = ([] for _ in range(8))\n", "\n", " for entry in open(filename, \"r\"):\n", " d = json.loads(entry)\n", " for i1, k in enumerate(d.keys()):\n", " if i1 == 0:\n", " item_name.append(d[k])\n", " elif i1 == 1:\n", " classification.append(d[k])\n", " elif i1 == 2:\n", " for i2, j in enumerate(d[k].keys()):\n", " if i2 == 0:\n", " class_name_meta_data.append(d[k][j])\n", " elif i2 == 1:\n", " confidence_meta_data.append(d[k][j])\n", " elif i2 == 2:\n", " type_meta_data.append(d[k][j])\n", " elif i2 == 3:\n", " job_name_meta_data.append(d[k][j])\n", " elif i2 == 4:\n", " human_annotated_meta_data.append(d[k][j])\n", " elif i2 == 5:\n", " creation_date.append(d[k][j])\n", " return pd.DataFrame(\n", " {\n", " \"item_name\": item_name,\n", " \"classification\": classification,\n", " \"class_name_meta_data\": class_name_meta_data,\n", " \"confidence_meta_data\": confidence_meta_data,\n", " \"type_meta_data\": type_meta_data,\n", " \"job_name_meta_data\": job_name_meta_data,\n", " \"human_annotated_meta_data\": human_annotated_meta_data,\n", " \"creation_date\": creation_date,\n", " }\n", " )\n", "\n", "\n", "# output.manifest is located in data/\n", "df = create_dataframe_from_manifest(\"data/output.manifest\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Preview the parsed manifest file as a data frame" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.dtypes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def cast_object_to_string(data_frame):\n", " \"\"\"\n", " Cast all columns of data_frame of type object to type string and return it.\n", " Parameters:\n", " data_frame: A pandas Dataframe\n", " Returns:\n", " Data frame\n", " \"\"\"\n", " for label in data_frame.columns:\n", " if data_frame.dtypes[label] == object:\n", " data_frame[label] = data_frame[label].astype(\"str\").astype(\"string\")\n", " return data_frame" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Cast columns of df of type object to string.\n", "df = cast_object_to_string(df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Feature Group and Ingest data into it\n", "Below we start by appending the `EventTime` feature to your data to timestamp entries, then we load the feature definition, and instantiate the Feature Group object. Then lastly we ingest the data into your feature store." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_group_name = \"ground-truth-classification-feature-group-\" + strftime(\n", " \"%d-%H-%M-%S\", gmtime()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instantiate a `FeatureGroup` object for your data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_group = FeatureGroup(name=feature_group_name, sagemaker_session=sagemaker_session)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "record_identifier_feature_name = \"item_name\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Append the `EventTime` feature to your data frame. This parameter is required, and time stamps each data point." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "current_time_sec = int(round(time.time()))\n", "\n", "event_time_feature_name = \"EventTime\"\n", "# append EventTime feature\n", "df[event_time_feature_name] = pd.Series([current_time_sec] * len(df), dtype=\"float64\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load Feature Definition's of your data into your feature group." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_group.load_feature_definitions(data_frame=df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create your feature group.\n", "\n", "**Important**: You will need to substitute your KMS Key ARN for `kms_key` for server side encryption. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_group.create(\n", " s3_uri=f\"s3://{s3_bucket_name}/{prefix}\",\n", " record_identifier_name=record_identifier_feature_name,\n", " event_time_feature_name=\"EventTime\",\n", " role_arn=role,\n", " enable_online_store=False,\n", " offline_store_kms_key_id=kms_key,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_group.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Continually check your offline store until your data is available in it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def check_feature_group_status(feature_group):\n", " \"\"\"\n", " Print when the feature group has been successfully created\n", " Parameters:\n", " feature_group: FeatureGroup\n", " Returns:\n", " None\n", " \"\"\"\n", " status = feature_group.describe().get(\"FeatureGroupStatus\")\n", " while status == \"Creating\":\n", " print(\"Waiting for Feature Group to be Created\")\n", " time.sleep(5)\n", " status = feature_group.describe().get(\"FeatureGroupStatus\")\n", " print(f\"FeatureGroup {feature_group.name} successfully created.\")\n", "\n", "\n", "check_feature_group_status(feature_group)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ingest your data into your feature group." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_group.ingest(data_frame=df, max_workers=5, wait=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "time.sleep(30)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_client = sagemaker_session.boto_session.client(\"s3\", region_name=region)\n", "\n", "feature_group_s3_uri = (\n", " feature_group.describe()\n", " .get(\"OfflineStoreConfig\")\n", " .get(\"S3StorageConfig\")\n", " .get(\"ResolvedOutputS3Uri\")\n", ")\n", "\n", "feature_group_s3_prefix = feature_group_s3_uri.replace(f\"s3://{s3_bucket_name}/\", \"\")\n", "offline_store_contents = None\n", "while offline_store_contents is None:\n", " objects_in_bucket = s3_client.list_objects(\n", " Bucket=s3_bucket_name, Prefix=feature_group_s3_prefix\n", " )\n", " if \"Contents\" in objects_in_bucket and len(objects_in_bucket[\"Contents\"]) > 1:\n", " offline_store_contents = objects_in_bucket[\"Contents\"]\n", " else:\n", " print(\"Waiting for data in offline store...\\n\")\n", " time.sleep(60)\n", "\n", "print(\"Data available.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clean up resources\n", "Remove the Feature Group that was created. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature_group.delete()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next steps\n", "\n", "In this notebook we covered how to securely store the output of an image or text classification labelling job from [Amazon Ground Truth](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html) directly into Feature Store using KMS key.\n", "\n", "To learn more about how server-side encryption is done with Feature Store, see [Feature Store: Encrypt Data in your Online or Offline Feature Store using KMS key](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/feature_store_kms_key_encryption.html).\n", "\n", "To learn more about how to do client-side encryption to encrypt your image dataset prior to storing it in your feature store, see [Amazon SageMaker Feature Store: Client-side Encryption using AWS Encryption SDK](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/feature_store_client_side_encryption.html). For more information on the AWS Encryption library, see [AWS Encryption SDK library](https://docs.aws.amazon.com/encryption-sdk/latest/developer-guide/introduction.html).\n", "\n", "For detailed information about Feature Store, see the [Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html).\n", "\n", "For a complete list of Feature Store notebooks, see [Feature Store notebook examples](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-notebooks.html#feature-store-sample-notebooks)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science 3.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }