{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Amazon SageMaker Feature Store: Ground Truth Classification labelling job output to Feature Store"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook demonstrates how to securely store the output of an image or text classification labelling job from [Amazon Ground Truth](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html) directly into Feature Store using a KMS key.\n",
    "\n",
    "This notebook starts by reading in the `output.manifest` file, which is the output file from your classification labeling job from Amazon SageMaker Ground Truth. You can substitute your own Amazon S3 bucket and path to a method we provide, which downloads the file to your current working directory. Then we prepare the manifest file for ingestion to an online or offline feature store. We use a [Key Management Service (KMS)](https://docs.aws.amazon.com/kms/latest/developerguide/create-keys.html) key for server-side encryption to ensure that your data is securely stored in your feature store.\n",
    "\n",
    "\n",
    "This notebook uses a KMS key for server side encryption for your Feature Store. For more information on server-side encryption, see [Feature Store: Encrypt Data in your Online or Offline Feature Store using KMS key](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/feature_store_kms_key_encryption.html). \n",
    "\n",
    "To encrypt your data on the client side prior to ingestion, see [Amazon SageMaker Feature Store: Client-side Encryption using AWS Encryption SDK](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/feature_store_client_side_encryption.html) for a demonstration. \n",
    "\n",
    "## Overview\n",
    "1. Set up.\n",
    "2. Prepare `output.manifest`for Feature Store. \n",
    "3. Create a feature group and ingest your data into it.\n",
    "\n",
    "## Prerequisites\n",
    "This notebook uses the Python SDK library for Feature Store, and the `Python 3 (Data Science)` kernel. To encrypt your data with KMS key for server side encryption, you will need to have an active KMS key. If you do not have a KMS key, then you can create one by following the [KMS Policy Template](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/feature_store_kms_key_encryption.html#KMS-Policy-Template) steps, or you can visit the [KMS section in the console](https://console.aws.amazon.com/kms/home) and follow the prompts for creating a KMS key. This notebook is compatible with SageMaker Studio, Jupyter, and JupyterLab. \n",
    "\n",
    "## Library dependencies:\n",
    "* `sagemaker>=2.0.0`\n",
    "* `numpy`\n",
    "* `pandas`\n",
    "* `boto3`\n",
    "\n",
    "## Data\n",
    "This notebook uses a synthetic manifest file called `output.manifest` located in the data subfolder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from time import gmtime, strftime\n",
    "from sagemaker.feature_store.feature_group import FeatureGroup\n",
    "\n",
    "import json\n",
    "import pandas as pd\n",
    "import sagemaker\n",
    "import time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sagemaker_session = sagemaker.Session()\n",
    "s3_bucket_name = sagemaker_session.default_bucket()  # This is the bucket for your offline store.\n",
    "prefix = \"sagemaker-featurestore-demo\"\n",
    "role = sagemaker.get_execution_role()\n",
    "region = sagemaker_session.boto_region_name"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Additional - Helper Method\n",
    "Below is a method that you can use to get your manifest file from your S3 bucket into your current working directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def download_file_from_s3(bucket, path, filename):\n",
    "    \"\"\"\n",
    "    Download filename to your current directory.\n",
    "    Parameters:\n",
    "        bucket: S3 bucket name\n",
    "        path: path to file\n",
    "        filename: the name of the file you are downloading\n",
    "    Returns:\n",
    "        None\n",
    "    \"\"\"\n",
    "    import os.path\n",
    "\n",
    "    if not os.path.exists(filename):\n",
    "        s3 = boto3.client(\"s3\")\n",
    "        s3.download_file(Bucket=bucket, Key=path, Filename=filename)\n",
    "\n",
    "\n",
    "# Supply the path to your output.manifest file from your Ground Truth labelling job.\n",
    "# download_file_from_s3(public_s3_bucket_name, path='PATH', filename='output.manifest')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare your manifest file for Feature Store. \n",
    "\n",
    "Below is a method that will parse your `output.manifest` file into a Panda's data frame for ingestion into your Feature Store. At this point, it is assumed that your manifest file is in your current working directory. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_dataframe_from_manifest(filename):\n",
    "    \"\"\"\n",
    "    Return a dataframe containing all information from your output.manifest file.\n",
    "    Parameters:\n",
    "        filename: path to your output.manifest file. This should be the\n",
    "        output.manifest file from an AWS Ground Truth Classification labelling job.\n",
    "    Returns:\n",
    "        Data frame.\n",
    "\n",
    "    Implementation details:\n",
    "    i1 and i2: These are indices within dictionary d that we are looping through.\n",
    "    k and j: k is a key of dictionary d which is also a dictionary, and j is a key of dictionary k.\n",
    "    \"\"\"\n",
    "    (\n",
    "        item_name,\n",
    "        classification,\n",
    "        class_name_meta_data,\n",
    "        confidence_meta_data,\n",
    "        type_meta_data,\n",
    "        job_name_meta_data,\n",
    "        human_annotated_meta_data,\n",
    "        creation_date,\n",
    "    ) = ([] for _ in range(8))\n",
    "\n",
    "    for entry in open(filename, \"r\"):\n",
    "        d = json.loads(entry)\n",
    "        for i1, k in enumerate(d.keys()):\n",
    "            if i1 == 0:\n",
    "                item_name.append(d[k])\n",
    "            elif i1 == 1:\n",
    "                classification.append(d[k])\n",
    "            elif i1 == 2:\n",
    "                for i2, j in enumerate(d[k].keys()):\n",
    "                    if i2 == 0:\n",
    "                        class_name_meta_data.append(d[k][j])\n",
    "                    elif i2 == 1:\n",
    "                        confidence_meta_data.append(d[k][j])\n",
    "                    elif i2 == 2:\n",
    "                        type_meta_data.append(d[k][j])\n",
    "                    elif i2 == 3:\n",
    "                        job_name_meta_data.append(d[k][j])\n",
    "                    elif i2 == 4:\n",
    "                        human_annotated_meta_data.append(d[k][j])\n",
    "                    elif i2 == 5:\n",
    "                        creation_date.append(d[k][j])\n",
    "    return pd.DataFrame(\n",
    "        {\n",
    "            \"item_name\": item_name,\n",
    "            \"classification\": classification,\n",
    "            \"class_name_meta_data\": class_name_meta_data,\n",
    "            \"confidence_meta_data\": confidence_meta_data,\n",
    "            \"type_meta_data\": type_meta_data,\n",
    "            \"job_name_meta_data\": job_name_meta_data,\n",
    "            \"human_annotated_meta_data\": human_annotated_meta_data,\n",
    "            \"creation_date\": creation_date,\n",
    "        }\n",
    "    )\n",
    "\n",
    "\n",
    "# output.manifest is located in data/\n",
    "df = create_dataframe_from_manifest(\"data/output.manifest\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Preview the parsed manifest file as a data frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cast_object_to_string(data_frame):\n",
    "    \"\"\"\n",
    "    Cast all columns of data_frame of type object to type string and return it.\n",
    "    Parameters:\n",
    "        data_frame: A pandas Dataframe\n",
    "    Returns:\n",
    "        Data frame\n",
    "    \"\"\"\n",
    "    for label in data_frame.columns:\n",
    "        if data_frame.dtypes[label] == object:\n",
    "            data_frame[label] = data_frame[label].astype(\"str\").astype(\"string\")\n",
    "    return data_frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cast columns of df of type object to string.\n",
    "df = cast_object_to_string(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Feature Group and Ingest data into it\n",
    "Below we start by appending the `EventTime` feature to your data to timestamp entries, then we load the feature definition, and instantiate the Feature Group object. Then lastly we ingest the data into your feature store."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_group_name = \"ground-truth-classification-feature-group-\" + strftime(\n",
    "    \"%d-%H-%M-%S\", gmtime()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instantiate a `FeatureGroup` object for your data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_group = FeatureGroup(name=feature_group_name, sagemaker_session=sagemaker_session)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "record_identifier_feature_name = \"item_name\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Append the `EventTime` feature to your data frame. This parameter is required, and time stamps each data point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "current_time_sec = int(round(time.time()))\n",
    "\n",
    "event_time_feature_name = \"EventTime\"\n",
    "# append EventTime feature\n",
    "df[event_time_feature_name] = pd.Series([current_time_sec] * len(df), dtype=\"float64\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load Feature Definition's of your data into your feature group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_group.load_feature_definitions(data_frame=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create your feature group.\n",
    "\n",
    "**Important**: You will need to substitute your KMS Key ARN for `kms_key` for server side encryption. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_group.create(\n",
    "    s3_uri=f\"s3://{s3_bucket_name}/{prefix}\",\n",
    "    record_identifier_name=record_identifier_feature_name,\n",
    "    event_time_feature_name=\"EventTime\",\n",
    "    role_arn=role,\n",
    "    enable_online_store=False,\n",
    "    offline_store_kms_key_id=kms_key,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_group.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Continually check your offline store until your data is available in it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def check_feature_group_status(feature_group):\n",
    "    \"\"\"\n",
    "    Print when the feature group has been successfully created\n",
    "    Parameters:\n",
    "        feature_group: FeatureGroup\n",
    "    Returns:\n",
    "        None\n",
    "    \"\"\"\n",
    "    status = feature_group.describe().get(\"FeatureGroupStatus\")\n",
    "    while status == \"Creating\":\n",
    "        print(\"Waiting for Feature Group to be Created\")\n",
    "        time.sleep(5)\n",
    "        status = feature_group.describe().get(\"FeatureGroupStatus\")\n",
    "    print(f\"FeatureGroup {feature_group.name} successfully created.\")\n",
    "\n",
    "\n",
    "check_feature_group_status(feature_group)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ingest your data into your feature group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_group.ingest(data_frame=df, max_workers=5, wait=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "time.sleep(30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_client = sagemaker_session.boto_session.client(\"s3\", region_name=region)\n",
    "\n",
    "feature_group_s3_uri = (\n",
    "    feature_group.describe()\n",
    "    .get(\"OfflineStoreConfig\")\n",
    "    .get(\"S3StorageConfig\")\n",
    "    .get(\"ResolvedOutputS3Uri\")\n",
    ")\n",
    "\n",
    "feature_group_s3_prefix = feature_group_s3_uri.replace(f\"s3://{s3_bucket_name}/\", \"\")\n",
    "offline_store_contents = None\n",
    "while offline_store_contents is None:\n",
    "    objects_in_bucket = s3_client.list_objects(\n",
    "        Bucket=s3_bucket_name, Prefix=feature_group_s3_prefix\n",
    "    )\n",
    "    if \"Contents\" in objects_in_bucket and len(objects_in_bucket[\"Contents\"]) > 1:\n",
    "        offline_store_contents = objects_in_bucket[\"Contents\"]\n",
    "    else:\n",
    "        print(\"Waiting for data in offline store...\\n\")\n",
    "        time.sleep(60)\n",
    "\n",
    "print(\"Data available.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clean up resources\n",
    "Remove the Feature Group that was created. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_group.delete()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "In this notebook we covered how to securely store the output of an image or text classification labelling job from [Amazon Ground Truth](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html) directly into Feature Store using KMS key.\n",
    "\n",
    "To learn more about how server-side encryption is done with Feature Store, see [Feature Store: Encrypt Data in your Online or Offline Feature Store using KMS key](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/feature_store_kms_key_encryption.html).\n",
    "\n",
    "To learn more about how to do client-side encryption to encrypt your image dataset prior to storing it in your feature store, see [Amazon SageMaker Feature Store: Client-side Encryption using AWS Encryption SDK](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-featurestore/feature_store_client_side_encryption.html). For more information on the AWS Encryption library, see [AWS Encryption SDK library](https://docs.aws.amazon.com/encryption-sdk/latest/developer-guide/introduction.html).\n",
    "\n",
    "For detailed information about Feature Store, see the [Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html).\n",
    "\n",
    "For a complete list of Feature Store notebooks, see [Feature Store notebook examples](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-notebooks.html#feature-store-sample-notebooks)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-featurestore|feature_store_classification_job_to_ground_truth.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science 3.0)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}