# Get started with SageMaker Processing


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

---


This notebook corresponds to the section "Preprocessing Data With The Built-In Scikit-Learn Container" in the blog post [Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation](https://aws.amazon.com/blogs/aws/amazon-sagemaker-processing-fully-managed-data-processing-and-model-evaluation/). 
It shows a lightweight example of using SageMaker Processing to create train, test, and validation datasets. SageMaker Processing is used to create these datasets, which then are written back to S3.

## Runtime

This notebook takes approximately 5 minutes to run.

## Contents

1. [Prepare resources](#Prepare-resources)
1. [Download data](#Download-data)
1. [Prepare Processing script](#Prepare-Processing-script)
1. [Run Processing job](#Run-Processing-job)
1. [Conclusion](#Conclusion)

## Prepare resources

First, let’s create an SKLearnProcessor object, passing the scikit-learn version we want to use, as well as our managed infrastructure requirements.

In [None]:
!pip install -U sagemaker

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

region = sagemaker.Session().boto_region_name
role = get_execution_role()
sklearn_processor = SKLearnProcessor(
 framework_version="1.2-1", role=role, instance_type="ml.m5.xlarge", instance_count=1
)

## Download data

Read in the raw data from a public S3 bucket. This example uses the [Census-Income (KDD) Dataset](https://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29) from the UCI Machine Learning Repository.

> Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

In [None]:
import pandas as pd

s3 = boto3.client("s3")
s3.download_file(
 "sagemaker-sample-data-{}".format(region),
 "processing/census/census-income.csv",
 "census-income.csv",
)
df = pd.read_csv("census-income.csv")
df.to_csv("dataset.csv")
df.head()

## Prepare Processing script

Write the Python script that will be run by SageMaker Processing. This script reads the single data file from S3; splits the rows into train, test, and validation sets; and then writes the three output files to S3.

In [None]:
%%writefile preprocessing.py
import pandas as pd
import os
from sklearn.model_selection import train_test_split

input_data_path = os.path.join("/opt/ml/processing/input", "dataset.csv")
df = pd.read_csv(input_data_path)
print("Shape of data is:", df.shape)
train, test = train_test_split(df, test_size=0.2)
train, validation = train_test_split(train, test_size=0.2)

try:
 os.makedirs("/opt/ml/processing/output/train")
 os.makedirs("/opt/ml/processing/output/validation")
 os.makedirs("/opt/ml/processing/output/test")
 print("Successfully created directories")
except Exception as e:
 # if the Processing call already creates these directories (or directory otherwise cannot be created)
 print(e)
 print("Could not make directories")
 pass

try:
 train.to_csv("/opt/ml/processing/output/train/train.csv")
 validation.to_csv("/opt/ml/processing/output/validation/validation.csv")
 test.to_csv("/opt/ml/processing/output/test/test.csv")
 print("Wrote files successfully")
except Exception as e:
 print("Failed to write the files")
 print(e)
 pass

print("Completed running the processing job")

## Run Processing job

Run the Processing job, specifying the script name, input file, and output files.

In [None]:
%%capture output

from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor.run(
 code="preprocessing.py",
 # arguments = ["arg1", "arg2"], # Arguments can optionally be specified here
 inputs=[ProcessingInput(source="dataset.csv", destination="/opt/ml/processing/input")],
 outputs=[
 ProcessingOutput(source="/opt/ml/processing/output/train"),
 ProcessingOutput(source="/opt/ml/processing/output/validation"),
 ProcessingOutput(source="/opt/ml/processing/output/test"),
 ],
)

Get the Processing job logs and retrieve the job name.

In [None]:
print(output)
job_name = str(output).split("\n")[1].split(" ")[-1]

Confirm that the output dataset files were written to S3.

In [None]:
import boto3

s3_client = boto3.client("s3")
default_bucket = sagemaker.Session().default_bucket()
for i in range(1, 4):
 prefix = s3_client.list_objects(Bucket=default_bucket, Prefix="sagemaker-scikit-learn")[
 "Contents"
 ][-i]["Key"]
 print("s3://" + default_bucket + "/" + prefix)

## Conclusion

In this notebook, we read a dataset from S3 and processed it into train, test, and validation sets using a SageMaker Processing job. You can extend this example for preprocessing your own datasets in preparation for machine learning or other applications.

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)
