{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Upload sample data and setup SageMaker Data Wrangler data flow\n", "\n", "This notebook uploads the sample data files provided in the `./data` directory to the default Amazon SageMaker S3 bucket. You can also generate a new Data Wrangler `.flow` file using the provided template.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import required dependencies and initialize variables\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using AWS Region: us-east-2\n" ] }, { "data": { "text/plain": [ "'sagemaker-us-east-2-716469146435'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import json\n", "import time\n", "import boto3\n", "import string\n", "import sagemaker\n", "\n", "region = sagemaker.Session().boto_region_name\n", "print(\"Using AWS Region: {}\".format(region))\n", "\n", "boto3.setup_default_session(region_name=region)\n", "\n", "s3_client = boto3.client('s3', region_name=region)\n", "# Sagemaker session\n", "sess = sagemaker.Session()\n", "\n", "# You can configure this with your own bucket name, e.g.\n", "# bucket = \"my-bucket\"\n", "bucket = sess.default_bucket()\n", "prefix = \"data-wrangler-pipeline\"\n", "bucket" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Upload sample data to S3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have provided two sample data files `claims.csv` and `customers.csv` in the `/data` directory. These contain synthetically generated insurance claim data which we will use to train an XGBoost model. The purpose of the model is to identify if an insurance claim is fraudulent or legitimate.\n", "\n", "To begin with, we will upload both the files to the default SageMaker bucket." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "s3_client.upload_file(Filename='data/claims.csv', Bucket=bucket, Key=f'{prefix}/claims.csv')\n", "s3_client.upload_file(Filename='data/customers.csv', Bucket=bucket, Key=f'{prefix}/customers.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Generate Data Wrangler `.flow` file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have provided a convenient Data Wrangler flow file template named `insurance_claims_flow_template` using which we can create the `.flow` file. This template has a number of transformations that are applied to the features available in both the `claims.csv` and `customers.csv` files, and finally it also joins the two file to generate a single training CSV dataset. \n", "\n", "To create the `insurance_claims.flow` file execute the code cell below" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "claims_flow_template_file = \"insurance_claims_flow_template\"\n", "\n", "# Updates the S3 bucket and prefix in the template\n", "with open(claims_flow_template_file, 'r') as f:\n", " variables = {'bucket': bucket, 'prefix': prefix}\n", " template = string.Template(f.read())\n", " claims_flow = template.safe_substitute(variables)\n", " claims_flow = json.loads(claims_flow)\n", "\n", "# Creates the .flow file\n", "with open('insurance_claims.flow', 'w') as f:\n", " json.dump(claims_flow, f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Open the `insurance_claim.flow` file in SageMaker Studio.\n", "\n", "
⚠️ NOTE: \n", " Note: The UI for Data Wrangler is only available via SageMaker Studio environment. If you are using SageMaker Classic notebooks, you will not be able to view the Data Wrangler UI but can still use the flow file programmatically.\n", "
\n", "\n", "The flow should look as shown below\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Alternatively\n", "\n", "You can also create this `.flow` file manually using the SageMaker Studio's Data Wrangler UI. Visit the [get started](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-getting-started.html) documentation to learn how to create a data flow using SageMaker Data Wrangler." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Upload the `.flow` file to S3\n", "\n", "Next we will upload the flow file to the S3 bucket. The executable python script we generated earlier will make use of this `.flow` file to perform transformations." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Stored 'ins_claim_flow_uri' (str)\n" ] } ], "source": [ "import time\n", "import uuid\n", "\n", "# unique flow export ID\n", "flow_export_id = f\"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}\"\n", "flow_export_name = f\"flow-{flow_export_id}\"\n", "\n", "s3_client.upload_file(Filename='insurance_claims.flow', Bucket=bucket, Key=f'{prefix}/flow/{flow_export_name}.flow')\n", "ins_claim_flow_uri=f\"s3://{bucket}/{prefix}/flow/{flow_export_name}.flow\"\n", "%store ins_claim_flow_uri" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }