{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Upload sample data and setup SageMaker Data Wrangler data flow\n", "\n", "This notebook uploads the sample data files provided in the `./data` directory to the default Amazon SageMaker S3 bucket. You can also generate a new Data Wrangler `.flow` file using the provided template.\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import required dependencies and initialize variables\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using AWS Region: us-east-2\n" ] }, { "data": { "text/plain": [ "'sagemaker-us-east-2-716469146435'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import json\n", "import time\n", "import boto3\n", "import string\n", "import sagemaker\n", "\n", "region = sagemaker.Session().boto_region_name\n", "print(\"Using AWS Region: {}\".format(region))\n", "\n", "boto3.setup_default_session(region_name=region)\n", "\n", "s3_client = boto3.client('s3', region_name=region)\n", "# Sagemaker session\n", "sess = sagemaker.Session()\n", "\n", "# You can configure this with your own bucket name, e.g.\n", "# bucket = \"my-bucket\"\n", "bucket = sess.default_bucket()\n", "prefix = \"data-wrangler-pipeline\"\n", "bucket" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Upload sample data to S3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have provided two sample data files `claims.csv` and `customers.csv` in the `/data` directory. These contain synthetically generated insurance claim data which we will use to train an XGBoost model. The purpose of the model is to identify if an insurance claim is fraudulent or legitimate.\n", "\n", "To begin with, we will upload both the files to the default SageMaker bucket." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "s3_client.upload_file(Filename='data/claims.csv', Bucket=bucket, Key=f'{prefix}/claims.csv')\n", "s3_client.upload_file(Filename='data/customers.csv', Bucket=bucket, Key=f'{prefix}/customers.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Generate Data Wrangler `.flow` file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have provided a convenient Data Wrangler flow file template named `insurance_claims_flow_template` using which we can create the `.flow` file. This template has a number of transformations that are applied to the features available in both the `claims.csv` and `customers.csv` files, and finally it also joins the two file to generate a single training CSV dataset. \n", "\n", "To create the `insurance_claims.flow` file execute the code cell below" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "claims_flow_template_file = \"insurance_claims_flow_template\"\n", "\n", "# Updates the S3 bucket and prefix in the template\n", "with open(claims_flow_template_file, 'r') as f:\n", " variables = {'bucket': bucket, 'prefix': prefix}\n", " template = string.Template(f.read())\n", " claims_flow = template.safe_substitute(variables)\n", " claims_flow = json.loads(claims_flow)\n", "\n", "# Creates the .flow file\n", "with open('insurance_claims.flow', 'w') as f:\n", " json.dump(claims_flow, f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Open the `insurance_claim.flow` file in SageMaker Studio.\n", "\n", "