# Upload sample data and setup SageMaker Data Wrangler data flow

This notebook uploads the sample data files provided in the `./data` directory to the default Amazon SageMaker S3 bucket. You can also generate a new Data Wrangler `.flow` file using the provided template.

---

Import required dependencies and initialize variables


In [2]:
import json
import time
import boto3
import string
import sagemaker

region = sagemaker.Session().boto_region_name
print("Using AWS Region: {}".format(region))

boto3.setup_default_session(region_name=region)

s3_client = boto3.client('s3', region_name=region)
# Sagemaker session
sess = sagemaker.Session()

# You can configure this with your own bucket name, e.g.
# bucket = "my-bucket"
bucket = sess.default_bucket()
prefix = "data-wrangler-pipeline"
bucket

Using AWS Region: us-east-2


'sagemaker-us-east-2-716469146435'

---
# Upload sample data to S3

We have provided two sample data files `claims.csv` and `customers.csv` in the `/data` directory. These contain synthetically generated insurance claim data which we will use to train an XGBoost model. The purpose of the model is to identify if an insurance claim is fraudulent or legitimate.

To begin with, we will upload both the files to the default SageMaker bucket.

In [3]:
s3_client.upload_file(Filename='data/claims.csv', Bucket=bucket, Key=f'{prefix}/claims.csv')
s3_client.upload_file(Filename='data/customers.csv', Bucket=bucket, Key=f'{prefix}/customers.csv')

---
# Generate Data Wrangler `.flow` file

We have provided a convenient Data Wrangler flow file template named `insurance_claims_flow_template` using which we can create the `.flow` file. This template has a number of transformations that are applied to the features available in both the `claims.csv` and `customers.csv` files, and finally it also joins the two file to generate a single training CSV dataset. 

To create the `insurance_claims.flow` file execute the code cell below

In [5]:
claims_flow_template_file = "insurance_claims_flow_template"

# Updates the S3 bucket and prefix in the template
with open(claims_flow_template_file, 'r') as f:
    variables   = {'bucket': bucket, 'prefix': prefix}
    template    = string.Template(f.read())
    claims_flow = template.safe_substitute(variables)
    claims_flow = json.loads(claims_flow)

# Creates the .flow file
with open('insurance_claims.flow', 'w') as f:
    json.dump(claims_flow, f)

Open the `insurance_claim.flow` file in SageMaker Studio.

<div class="alert alert-warning"> ⚠️ <strong> NOTE: </strong>
    Note: The UI for Data Wrangler is only available via SageMaker Studio environment. If you are using SageMaker Classic notebooks, you will not be able to view the Data Wrangler UI but can still use the flow file programmatically.
</div>

The flow should look as shown below

<img src="images/flow.png" width="800"/>

# Alternatively

You can also create this `.flow` file manually using the SageMaker Studio's Data Wrangler UI. Visit the [get started](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-getting-started.html) documentation to learn how to create a data flow using SageMaker Data Wrangler.

---
# Upload the `.flow` file to S3

Next we will upload the flow file to the S3 bucket. The executable python script we generated earlier will make use of this `.flow` file to perform transformations.

In [6]:
import time
import uuid

# unique flow export ID
flow_export_id = f"{time.strftime('%d-%H-%M-%S', time.gmtime())}-{str(uuid.uuid4())[:8]}"
flow_export_name = f"flow-{flow_export_id}"

s3_client.upload_file(Filename='insurance_claims.flow', Bucket=bucket, Key=f'{prefix}/flow/{flow_export_name}.flow')
ins_claim_flow_uri=f"s3://{bucket}/{prefix}/flow/{flow_export_name}.flow"
%store ins_claim_flow_uri

Stored 'ins_claim_flow_uri' (str)
