{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Get started with SageMaker Processing\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "\n", "This notebook corresponds to the section \"Preprocessing Data With The Built-In Scikit-Learn Container\" in the blog post [Amazon SageMaker Processing – Fully Managed Data Processing and Model Evaluation](https://aws.amazon.com/blogs/aws/amazon-sagemaker-processing-fully-managed-data-processing-and-model-evaluation/). \n", "It shows a lightweight example of using SageMaker Processing to create train, test, and validation datasets. SageMaker Processing is used to create these datasets, which then are written back to S3.\n", "\n", "## Runtime\n", "\n", "This notebook takes approximately 5 minutes to run.\n", "\n", "## Contents\n", "\n", "1. [Prepare resources](#Prepare-resources)\n", "1. [Download data](#Download-data)\n", "1. [Prepare Processing script](#Prepare-Processing-script)\n", "1. [Run Processing job](#Run-Processing-job)\n", "1. [Conclusion](#Conclusion)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Prepare resources\n", "\n", "First, let’s create an SKLearnProcessor object, passing the scikit-learn version we want to use, as well as our managed infrastructure requirements." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "!pip install -U sagemaker" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import boto3\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.sklearn.processing import SKLearnProcessor\n", "\n", "region = sagemaker.Session().boto_region_name\n", "role = get_execution_role()\n", "sklearn_processor = SKLearnProcessor(\n", " framework_version=\"1.2-1\", role=role, instance_type=\"ml.m5.xlarge\", instance_count=1\n", ")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Download data\n", "\n", "Read in the raw data from a public S3 bucket. This example uses the [Census-Income (KDD) Dataset](https://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29) from the UCI Machine Learning Repository.\n", "\n", "> Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "s3 = boto3.client(\"s3\")\n", "s3.download_file(\n", " \"sagemaker-sample-data-{}\".format(region),\n", " \"processing/census/census-income.csv\",\n", " \"census-income.csv\",\n", ")\n", "df = pd.read_csv(\"census-income.csv\")\n", "df.to_csv(\"dataset.csv\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Prepare Processing script\n", "\n", "Write the Python script that will be run by SageMaker Processing. This script reads the single data file from S3; splits the rows into train, test, and validation sets; and then writes the three output files to S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "%%writefile preprocessing.py\n", "import pandas as pd\n", "import os\n", "from sklearn.model_selection import train_test_split\n", "\n", "input_data_path = os.path.join(\"/opt/ml/processing/input\", \"dataset.csv\")\n", "df = pd.read_csv(input_data_path)\n", "print(\"Shape of data is:\", df.shape)\n", "train, test = train_test_split(df, test_size=0.2)\n", "train, validation = train_test_split(train, test_size=0.2)\n", "\n", "try:\n", " os.makedirs(\"/opt/ml/processing/output/train\")\n", " os.makedirs(\"/opt/ml/processing/output/validation\")\n", " os.makedirs(\"/opt/ml/processing/output/test\")\n", " print(\"Successfully created directories\")\n", "except Exception as e:\n", " # if the Processing call already creates these directories (or directory otherwise cannot be created)\n", " print(e)\n", " print(\"Could not make directories\")\n", " pass\n", "\n", "try:\n", " train.to_csv(\"/opt/ml/processing/output/train/train.csv\")\n", " validation.to_csv(\"/opt/ml/processing/output/validation/validation.csv\")\n", " test.to_csv(\"/opt/ml/processing/output/test/test.csv\")\n", " print(\"Wrote files successfully\")\n", "except Exception as e:\n", " print(\"Failed to write the files\")\n", " print(e)\n", " pass\n", "\n", "print(\"Completed running the processing job\")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Run Processing job" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Run the Processing job, specifying the script name, input file, and output files." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "%%capture output\n", "\n", "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", "\n", "sklearn_processor.run(\n", " code=\"preprocessing.py\",\n", " # arguments = [\"arg1\", \"arg2\"], # Arguments can optionally be specified here\n", " inputs=[ProcessingInput(source=\"dataset.csv\", destination=\"/opt/ml/processing/input\")],\n", " outputs=[\n", " ProcessingOutput(source=\"/opt/ml/processing/output/train\"),\n", " ProcessingOutput(source=\"/opt/ml/processing/output/validation\"),\n", " ProcessingOutput(source=\"/opt/ml/processing/output/test\"),\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Get the Processing job logs and retrieve the job name." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "print(output)\n", "job_name = str(output).split(\"\\n\")[1].split(\" \")[-1]" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Confirm that the output dataset files were written to S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "import boto3\n", "\n", "s3_client = boto3.client(\"s3\")\n", "default_bucket = sagemaker.Session().default_bucket()\n", "for i in range(1, 4):\n", " prefix = s3_client.list_objects(Bucket=default_bucket, Prefix=\"sagemaker-scikit-learn\")[\n", " \"Contents\"\n", " ][-i][\"Key\"]\n", " print(\"s3://\" + default_bucket + \"/\" + prefix)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Conclusion\n", "\n", "In this notebook, we read a dataset from S3 and processed it into train, test, and validation sets using a SageMaker Processing job. You can extend this example for preprocessing your own datasets in preparation for machine learning or other applications." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker_processing|basic_sagemaker_data_processing|basic_sagemaker_processing.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (Data Science 3.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }