{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bring your own pipe-mode algorithm to Amazon SageMaker\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_**Create a Docker container for training SageMaker algorithms using Pipe-mode**_\n", "\n", "---\n", "\n", "## Contents\n", "\n", "1. [Overview](#Overview)\n", "1. [Preparation](#Preparation)\n", " 1. [Permissions](#Permissions)\n", "1. [Code](#Code)\n", " 1. [train.py](#train.py)\n", " 1. [Dockerfile](#Dockerfile)\n", "1. [Customize](#Customize)\n", "1. [Train](#Train)\n", "1. [Conclusion](#Conclusion)\n", "\n", "\n", "---\n", "## Overview\n", "\n", "SageMaker Training supports two different mechanisms with which to transfer training data to a training algorithm: File-mode and Pipe-mode.\n", "\n", "In File-mode, training data is downloaded to an encrypted EBS volume prior to commencing training. Once downloaded, the training algorithm trains by reading the downloaded training data files.\n", "\n", "On the other hand, in Pipe-mode, the input data is transferred to the algorithm while it is training. This poses a few significant advantages over File-mode:\n", "\n", "\n", "* In File-mode, training startup time is proportional to size of the input data. In Pipe-mode, the startup delay is constant, independent of the size of the input data. This translates to much faster training startup for training jobs with large GB/PB-scale training datasets.\n", "* You do not need to allocate (and pay for) a large disk volume to be able to download the dataset.\n", "* Throughput on IO-bound Pipe-mode algorithms can be multiple times faster than on equivalent File-mode algorithms.\n", "\n", "However, these advantages come at a cost - a more complicated programming model than simply reading from files on a disk. This notebook aims to clarify what you need to do in order to use Pipe-mode in your custom training algorithm.\n", "\n", "\n", "---\n", "## Preparation\n", "\n", "_This notebook was created and tested on an ml.t2.medium notebook instance._\n", "\n", "Let's start by specifying:\n", "\n", "- S3 URIs `s3_training_input` and `s3_model_output` that you want to use for training input and model data respectively. These should be within the same region as the Notebook Instance, training, and hosting. Since the \"algorithm\" you're building here doesn't really have any specific data-format, feel free to point `s3_training_input` to any s3 dataset you have, the bigger the dataset the better to test the raw IO throughput performance. For this example, the California Housing dataset will be copied over to your s3 bucket.\n", "- The `training_instance_type` to use for training. More powerful instance types have more CPU and bandwidth which would result in higher throughput.\n", "- The IAM role arn used to give training access to your data.\n", "\n", "The California Housing dataset was originally published in:\n", "\n", "> Pace, R. Kelley, and Ronald Barry. \\\"Sparse spatial autoregressions.\\\" Statistics & Probability Letters 33.3 (1997): 291-297.\n", "\n", "### Permissions\n", "\n", "Running this notebook requires permissions in addition to the normal `SageMakerFullAccess` permissions. This is because you'll be creating a new repository in Amazon ECR. The easiest way to add these permissions is simply to add the managed policy `AmazonEC2ContainerRegistryFullAccess` to the role that you used to start your notebook instance. There's no need to restart your notebook instance when you do this, the new permissions will be available immediately." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "isConfigCell": true }, "outputs": [], "source": [ "import boto3\n", "import pandas as pd\n", "import sagemaker\n", "\n", "from sklearn.datasets import fetch_california_housing\n", "\n", "# Get SageMaker session & default S3 bucket\n", "role = sagemaker.get_execution_role()\n", "sagemaker_session = sagemaker.Session()\n", "region = sagemaker_session.boto_region_name\n", "s3 = sagemaker_session.boto_session.resource(\"s3\")\n", "bucket = sagemaker_session.default_bucket() # replace with your own bucket name if you have one" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# helper functions to upload data to s3\n", "def write_to_s3(filename, bucket, prefix):\n", " filename_key = filename.split(\".\")[0]\n", " key = \"{}/{}/{}\".format(prefix, filename_key, filename)\n", " return s3.Bucket(bucket).upload_file(filename, key)\n", "\n", "\n", "def upload_to_s3(bucket, prefix, filename):\n", " url = \"s3://{}/{}/{}\".format(bucket, prefix, filename)\n", " print(\"Writing data to {}\".format(url))\n", " write_to_s3(filename, bucket, prefix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have a larger dataset you want to try, here is the place to swap in your dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filename = \"california_housing.csv\"\n", "# Download files from sklearns.datasets\n", "tabular_data = fetch_california_housing()\n", "tabular_data_full = pd.DataFrame(tabular_data.data, columns=tabular_data.feature_names)\n", "tabular_data_full[\"target\"] = pd.DataFrame(tabular_data.target)\n", "tabular_data_full.to_csv(filename, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Upload the dataset to your bucket. You'll find it with the 'pipe_bring_your_own/training' prefix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prefix = \"pipe_bring_your_own/training\"\n", "training_data = \"s3://{}/{}\".format(bucket, prefix)\n", "print(\"Training data in {}\".format(training_data))\n", "upload_to_s3(bucket, prefix, filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Code\n", "\n", "For the purposes of this demo you're going to write an extremely simple \u201ctraining\u201d algorithm in Python. In essence it will conform to the specifications required by SageMaker Training and will read data in Pipe-mode but will do nothing with the data, simply reading it and throwing it away. You're doing it this way to be able to illustrate only exactly what's needed to support Pipe-mode without complicating the code with a real training algorithm.\n", "\n", "In Pipe-mode, data is pre-fetched from S3 at high-concurrency and throughput and streamed into Unix Named Pipes (aka FIFOs) - one FIFO per Channel per epoch. The algorithm must open the FIFO for reading and read through to (or optionally abort mid-stream) and close its end of the file descriptor when done. It can then optionally wait for the next epoch's FIFO to get created and commence reading, iterating through epochs until it has achieved its completion criteria.\n", "\n", "For this example, you'll need two supporting files:\n", "\n", "### train.py\n", "\n", "`train.py` simply iterates through 5 epochs on the `training` Channel. Each epoch involves reading the training data stream from a FIFO named `/opt/ml/input/data/training_${epoch}`. At the end of the epoch the code simply iterates to the next epoch, waits for the new epoch's FIFO to get created and continues on.\n", "\n", "A lot of the code in `train.py` is merely boilerplate code, dealing with printing log messages, trapping termination signals etc. The main code that iterates through reading each epoch's data through its corresponding FIFO is the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pygmentize train.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dockerfile\n", "You can use any of the preconfigured Docker containers that SageMaker provides, or build one from scratch. This example uses the [PyTorch - AWS Deep Learning Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md), then adds `train.py`, and finally runs `train.py` when the entrypoint is launched. To learn more about bring your own container training options, see the [Amazon SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%cat Dockerfile" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Customize\n", "\n", "To fetch the PyTorch AWS Deep Learning Container (DLC), first login to ECR." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sh\n", "REGION=$(aws configure get region)\n", "account=$(aws sts get-caller-identity --query Account --output text)\n", "docker login --username AWS --password $(aws ecr get-login-password --region us-west-2) 763104351884.dkr.ecr.us-west-2.amazonaws.com\n", "aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin ${account}.dkr.ecr.${REGION}.amazonaws.com" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, build your custom docker container, tagging it with the name \"pipe_bring_your_own\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%sh\n", "docker build -t pipe_bring_your_own . --build-arg region=$(aws configure get region)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the container built, you can now tag it with the full name you will need when calling it for training (`ecr_image`). Then upload your custom container to ECR." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "account = !aws sts get-caller-identity --query Account --output text\n", "algorithm_name = \"pipe_bring_your_own\"\n", "ecr_image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account[0], region, algorithm_name)\n", "print('ecr_image: {}'.format(ecr_image))\n", "\n", "ecr_client = boto3.client('ecr')\n", "try:\n", " response = ecr_client.describe_repositories(\n", " repositoryNames=[\n", " algorithm_name,\n", " ],\n", " )\n", " print(\"Repo exists...\")\n", "except Exception as e:\n", " create_repo = ecr_client.create_repository(repositoryName=algorithm_name)\n", " print(\"Created repo...\")\n", "\n", "!docker tag {algorithm_name} {ecr_image}\n", "!docker push {ecr_image}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train\n", "\n", "Now, you will use the `Estimator` function and pass in the information needed to run the training container in SageMaker.\n", "Note that `input_mode` is the parameter required for you to set pipe mode for this training run. Also note that the `base_job_name` doesn't let you use underscores, so that's why you're using dashes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.estimator import Estimator\n", "\n", "estimator = Estimator(\n", " image_uri=ecr_image,\n", " role=role,\n", " base_job_name=\"pipe-bring-your-own-test\",\n", " instance_count=1,\n", " instance_type=\"ml.c4.xlarge\",\n", " input_mode=\"Pipe\",\n", ")\n", "\n", "# Start training\n", "estimator.fit(training_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the throughput logged by the training logs above. By way of comparison a File-mode algorithm will achieve at most approximately 150MB/s on a high-end `ml.c5.18xlarge` and approximately 75MB/s on a `ml.m4.xlarge`.\n", "\n", "---\n", "## Conclusion\n", "There are a few situations where Pipe-mode may not be the optimum choice for training in which case you should stick to using File-mode:\n", "\n", "* If your algorithm needs to backtrack or skip ahead within an epoch. This is simply not possible in Pipe-mode since the underlying FIFO cannot not support `lseek()` operations.\n", "* If your training dataset is small enough to fit in memory and you need to run multiple epochs. In this case may be quicker and easier just to load it all into memory and iterate.\n", "* Your training dataset is not easily parse-able from a streaming source.\n", "\n", "In all other scenarios, if you have an IO-bound training algorithm, switching to Pipe-mode may give you a significant throughput-boost and will reduce the size of the disk volume required. This should result in both saving you time and reducing training costs.\n", "\n", "You can read more about building your own training algorithms in the [SageMaker Training documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/advanced_functionality|pipe_bring_your_own|pipe_bring_your_own.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" }, "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." }, "nbformat": 4, "nbformat_minor": 4 }