{ "cells": [ { "cell_type": "markdown", "id": "a8f18b23", "metadata": { "papermill": { "duration": 0.006395, "end_time": "2022-04-18T00:08:55.010149", "exception": false, "start_time": "2022-04-18T00:08:55.003754", "status": "completed" }, "pycharm": { "name": "#%% md\n" }, "tags": [] }, "source": [ "# Get started with SageMaker Processing\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "a8f18b23", "metadata": { "papermill": { "duration": 0.006395, "end_time": "2022-04-18T00:08:55.010149", "exception": false, "start_time": "2022-04-18T00:08:55.003754", "status": "completed" }, "pycharm": { "name": "#%% md\n" }, "tags": [] }, "source": [ "\n", "This notebook corresponds to the section \"Preprocessing Data With The Built-In Scikit-Learn Container\" in the blog post [Amazon SageMaker Processing \u2013 Fully Managed Data Processing and Model Evaluation](https://aws.amazon.com/blogs/aws/amazon-sagemaker-processing-fully-managed-data-processing-and-model-evaluation/). \n", "It shows a lightweight example of using SageMaker Processing to create train, test, and validation datasets. SageMaker Processing is used to create these datasets, which then are written back to S3.\n", "\n", "## Runtime\n", "\n", "This notebook takes approximately 5 minutes to run.\n", "\n", "## Contents\n", "\n", "1. [Prepare resources](#Prepare-resources)\n", "1. [Download data](#Download-data)\n", "1. [Prepare Processing script](#Prepare-Processing-script)\n", "1. [Run Processing job](#Run-Processing-job)\n", "1. [Conclusion](#Conclusion)" ] }, { "cell_type": "markdown", "id": "3cf7028a", "metadata": { "papermill": { "duration": 0.006333, "end_time": "2022-04-18T00:08:55.022942", "exception": false, "start_time": "2022-04-18T00:08:55.016609", "status": "completed" }, "pycharm": { "name": "#%% md\n" }, "tags": [] }, "source": [ "## Prepare resources\n", "\n", "First, let\u2019s create an SKLearnProcessor object, passing the scikit-learn version we want to use, as well as our managed infrastructure requirements." ] }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "!pip install -U sagemaker" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 2, "id": "862f8d1f", "metadata": { "execution": { "iopub.execute_input": "2022-04-18T00:08:55.039310Z", "iopub.status.busy": "2022-04-18T00:08:55.038857Z", "iopub.status.idle": "2022-04-18T00:08:56.057474Z", "shell.execute_reply": "2022-04-18T00:08:56.057892Z" }, "papermill": { "duration": 1.028712, "end_time": "2022-04-18T00:08:56.058050", "exception": false, "start_time": "2022-04-18T00:08:55.029338", "status": "completed" }, "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [], "source": [ "import boto3\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.sklearn.processing import SKLearnProcessor\n", "\n", "region = sagemaker.Session().boto_region_name\n", "role = get_execution_role()\n", "sklearn_processor = SKLearnProcessor(\n", " framework_version=\"1.2-1\", role=role, instance_type=\"ml.m5.xlarge\", instance_count=1\n", ")" ] }, { "cell_type": "markdown", "id": "b35ea4ea", "metadata": { "papermill": { "duration": 0.006588, "end_time": "2022-04-18T00:08:56.071404", "exception": false, "start_time": "2022-04-18T00:08:56.064816", "status": "completed" }, "pycharm": { "name": "#%% md\n" }, "tags": [] }, "source": [ "## Download data\n", "\n", "Read in the raw data from a public S3 bucket. This example uses the [Census-Income (KDD) Dataset](https://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29) from the UCI Machine Learning Repository.\n", "\n", "> Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science." ] }, { "cell_type": "code", "execution_count": 3, "id": "6eaf6050", "metadata": { "execution": { "iopub.execute_input": "2022-04-18T00:08:56.096003Z", "iopub.status.busy": "2022-04-18T00:08:56.095500Z", "iopub.status.idle": "2022-04-18T00:09:00.816015Z", "shell.execute_reply": "2022-04-18T00:09:00.815586Z" }, "papermill": { "duration": 4.738175, "end_time": "2022-04-18T00:09:00.816126", "exception": false, "start_time": "2022-04-18T00:08:56.077951", "status": "completed" }, "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | age | \n", "class of worker | \n", "detailed industry recode | \n", "detailed occupation recode | \n", "education | \n", "wage per hour | \n", "enroll in edu inst last wk | \n", "marital stat | \n", "major industry code | \n", "major occupation code | \n", "... | \n", "country of birth father | \n", "country of birth mother | \n", "country of birth self | \n", "citizenship | \n", "own business or self employed | \n", "fill inc questionnaire for veteran's admin | \n", "veterans benefits | \n", "weeks worked in year | \n", "year | \n", "income | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "73 | \n", "Not in universe | \n", "0 | \n", "0 | \n", "High school graduate | \n", "0 | \n", "Not in universe | \n", "Widowed | \n", "Not in universe or children | \n", "Not in universe | \n", "... | \n", "United-States | \n", "United-States | \n", "United-States | \n", "Native- Born in the United States | \n", "0 | \n", "Not in universe | \n", "2 | \n", "0 | \n", "95 | \n", "- 50000. | \n", "
1 | \n", "58 | \n", "Self-employed-not incorporated | \n", "4 | \n", "34 | \n", "Some college but no degree | \n", "0 | \n", "Not in universe | \n", "Divorced | \n", "Construction | \n", "Precision production craft & repair | \n", "... | \n", "United-States | \n", "United-States | \n", "United-States | \n", "Native- Born in the United States | \n", "0 | \n", "Not in universe | \n", "2 | \n", "52 | \n", "94 | \n", "- 50000. | \n", "
2 | \n", "18 | \n", "Not in universe | \n", "0 | \n", "0 | \n", "10th grade | \n", "0 | \n", "High school | \n", "Never married | \n", "Not in universe or children | \n", "Not in universe | \n", "... | \n", "Vietnam | \n", "Vietnam | \n", "Vietnam | \n", "Foreign born- Not a citizen of U S | \n", "0 | \n", "Not in universe | \n", "2 | \n", "0 | \n", "95 | \n", "- 50000. | \n", "
3 | \n", "9 | \n", "Not in universe | \n", "0 | \n", "0 | \n", "Children | \n", "0 | \n", "Not in universe | \n", "Never married | \n", "Not in universe or children | \n", "Not in universe | \n", "... | \n", "United-States | \n", "United-States | \n", "United-States | \n", "Native- Born in the United States | \n", "0 | \n", "Not in universe | \n", "0 | \n", "0 | \n", "94 | \n", "- 50000. | \n", "
4 | \n", "10 | \n", "Not in universe | \n", "0 | \n", "0 | \n", "Children | \n", "0 | \n", "Not in universe | \n", "Never married | \n", "Not in universe or children | \n", "Not in universe | \n", "... | \n", "United-States | \n", "United-States | \n", "United-States | \n", "Native- Born in the United States | \n", "0 | \n", "Not in universe | \n", "0 | \n", "0 | \n", "94 | \n", "- 50000. | \n", "
5 rows \u00d7 42 columns
\n", "