{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setup environment for data transformation and ingestion workflow\n",
"This notebook sets up needed resources and parameters for a custom SageMaker project which provision a data transformation and ingestion workflow:\n",
"\n",
"\n",
"\n",
"1. Data file or files uploaded to an Amazon S3 bucket\n",
"2. Data processing and transformation process is launched\n",
"3. Extracted, processed, and transformed features are ingested into a designated feature group in Feature Store\n",
"\n",
"The notebook takes you through following activites to create the pre-requisite resources:\n",
"- Get an Amazon S3 bucket for data upload\n",
"- download the dataset and explore the data\n",
"- create Amazon Data Wrangler flow for data transformation and feature ingestion\n",
"- create a new feature group in Feature Store where features are stored\n",
"\n",
"⭐ Depending on your specific use case and requirements, for your own custom project you can consider to create all these resources as part of the project provisioning."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load packages:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import json\n",
"import boto3\n",
"import pandas as pd\n",
"import sagemaker\n",
"from sagemaker.session import Session\n",
"from sagemaker.feature_store.feature_definition import FeatureDefinition\n",
"from sagemaker.feature_store.feature_definition import FeatureTypeEnum\n",
"from sagemaker.feature_store.feature_group import FeatureGroup\n",
"\n",
"import time\n",
"from time import gmtime, strftime\n",
"import uuid\n",
"\n",
"\n",
"print(sagemaker.__version__)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%store -r\n",
"%store"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get `domain_id` and `execution_role`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"NOTEBOOK_METADATA_FILE = \"/opt/ml/metadata/resource-metadata.json\"\n",
"domain_id = None\n",
"\n",
"if os.path.exists(NOTEBOOK_METADATA_FILE):\n",
" with open(NOTEBOOK_METADATA_FILE, \"rb\") as f:\n",
" domain_id = json.loads(f.read()).get('DomainId')\n",
" print(f\"SageMaker domain id: {domain_id}\")\n",
"\n",
"%store domain_id"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"r = boto3.client(\"sagemaker\").describe_domain(DomainId=domain_id)\n",
"execution_role = r[\"DefaultUserSettings\"][\"ExecutionRole\"]\n",
"\n",
"%store execution_role"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Get S3 bucket for data\n",
"We use the SageMaker default bucket for storing all solution artifacts and data. You can choose to create or use your own bucket. Make sure you have corresponding permissions attached to the SageMaker execution role and to `AmazonSageMakerServiceCatalogProductsUseRole` role to be able to list, read, and put objects into the bucket."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_bucket = None # you can use your own S3 bucket name\n",
"sagemaker_session = Session()\n",
"\n",
"if data_bucket is None:\n",
" data_bucket = sagemaker_session.default_bucket()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(data_bucket)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"⭐ You can keep the following literals set to their default values or change them if you would like."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# set some literals\n",
"s3_data_prefix = f\"{data_bucket}/feature-store-ingestion-pipeline/dataset/\"\n",
"s3_flow_prefix = f\"{data_bucket}/feature-store-ingestion-pipeline/dw-flow/\"\n",
"s3_fs_query_output_prefix = f\"{data_bucket}/feature-store-ingestion-pipeline/fs_query_results/\"\n",
"\n",
"dw_flow_name = \"dw-flow\" # change to your custom file name if you use a different one\n",
"unique_suffix = f\"{strftime('%d-%H-%M-%S', gmtime())}-{str(uuid.uuid4())[:8]}\"\n",
"abalone_dataset_file_name = \"abalone.csv\"\n",
"abalone_dataset_local_path = \"../dataset/\"\n",
"abalone_dataset_local_url = f\"{abalone_dataset_local_path}{abalone_dataset_file_name}\"\n",
"\n",
"print(f\"Data Wrangler flow upload and a feature group will have this unique suffix: {unique_suffix}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download the dataset\n",
"We use a well-known [Abalone dataset](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#abalone) in this solution. The dataset contains 4177 rows of data, and 8 features.\n",
"\n",
"Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!mkdir -p ../dataset\n",
"!rm -fr ../dataset/*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Download the dataset from [UCI website](http://archive.ics.uci.edu/ml/datasets/Abalone):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!cd {abalone_dataset_local_path} && wget -t inf http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data\n",
"!cd {abalone_dataset_local_path} && wget -t inf http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the dataset and print first five rows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# dictionary of dataset columns and data types\n",
"columns = {\n",
" \"sex\":\"string\", \n",
" \"length\":\"float\", \n",
" \"diameter\":\"float\", \n",
" \"height\":\"float\", \n",
" \"whole_weight\":\"float\", \n",
" \"shucked_weight\":\"float\", \n",
" \"viscera_weight\":\"float\", \n",
" \"shell_weight\":\"float\",\n",
" \"rings\":\"long\"\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_df = pd.read_csv(f\"{abalone_dataset_local_path}abalone.data\", names=columns.keys())\n",
"print(f\"Data shape: {data_df.shape}\")\n",
"data_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# save the dataframe as CSV with the header and index\n",
"data_df.to_csv(abalone_dataset_local_url, index_label=\"record_id\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Upload the data to the data S3 bucket."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!aws s3 cp {abalone_dataset_local_path}. s3://{s3_data_prefix} --recursive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"Data uploaded to s3://{s3_data_prefix}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Wrangler flow\n",
"You can use the provided [Data Wrangler flow file](dw-flow.flow) and skip the **Create Data Wrangler flow** section and move on directly to **Set output name** step. Alternatively you can follow the instructions how to create a new flow with data transformations."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create Data Wrangler flow (OPTIONAL)\n",
"\n",
"
Shutting down your kernel for this notebook to release resources.
\n", "\n", " \n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Proceed to the [`01-feature-store-ingest-pipeline` notebook](01-feature-store-ingest-pipeline.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.12 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "vscode": { "interpreter": { "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" } } }, "nbformat": 4, "nbformat_minor": 4 }