{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Custom Framework Container

" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook demonstrates how to build and use a simple custom Docker container for training with Amazon SageMaker that leverages on the sagemaker-training-toolkit library to define framework containers.\n", "A framework container is similar to a script-mode container, but in addition it loads a Python framework module that is used to configure the framework and then run the user-provided module.\n", "\n", "Reference documentation is available at https://github.com/aws/sagemaker-training-toolkit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start by defining some variables like the current execution role, the ECR repository that we are going to use for pushing the custom Docker container and a default Amazon S3 bucket to be used by Amazon SageMaker." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "\n", "ecr_namespace = \"sagemaker-training-containers/\"\n", "prefix = \"framework-container\"\n", "\n", "ecr_repository_name = ecr_namespace + prefix\n", "role = get_execution_role()\n", "account_id = role.split(\":\")[4]\n", "region = boto3.Session().region_name\n", "sagemaker_session = sagemaker.session.Session()\n", "bucket = sagemaker_session.default_bucket()\n", "\n", "print(account_id)\n", "print(region)\n", "print(role)\n", "print(bucket)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the Dockerfile which defines the statements for building our custom framework container:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pygmentize ../docker/Dockerfile" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At high-level the Dockerfile specifies the following operations for building this container:\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Training module

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When looking at the Dockerfile above, you might be askiong yourself what the custom_framework_training-1.0.0.tar.gz package is.\n", "When building a framework container, sagemaker-training-toolkit allows you to specify a framework module that will be run first, and then invoke a user-provided module.\n", "\n", "The advantage of using this approach is that you can use the framework module to configure the framework of choice or apply any settings related to the libraries installed in the environment, and then run the user module (we will see shortly how).\n", "\n", "Our framework module is part of a Python package - that you can find in the folder ../package/ - distributed as a .tar.gz by the Python setuptools library (https://setuptools.readthedocs.io/en/latest/).\n", "\n", "Setuptools uses a setup.py file to build the package. Following is the content of this file:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pygmentize ../package/setup.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This build script looks at the packages under the local src/ path and specifies the dependency on sagemaker-training. The training module contains the following code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pygmentize ../package/src/custom_framework_training/training.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The idea here is that we will use the entry_point.run() function of the sagemaker-training-toolkit library to execute the user-provided module.\n", "You might want to set additional framework-level configurations (e.g. parameter servers) before calling the user module." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Build and push the container

\n", "We are now ready to build this container and push it to Amazon ECR. This task is executed using a shell script stored in the ../script/ folder. Let's take a look at this script and then execute it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pygmentize ../scripts/build_and_push.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

--------------------------------------------------------------------------------------------------------------------

\n", "First, the script runs the setup.py to create the training package, which is copied under ../docker/code/.\n", "\n", "Then it builds the Docker container, creates the repository if it does not exist, and finally pushes the container to the ECR repository. The build task requires a few minutes to be executed the first time, then Docker caches build outputs to be reused for the subsequent build operations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "! ../scripts/build_and_push.sh $account_id $region $ecr_repository_name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Training with Amazon SageMaker

\n", "\n", "Once we have correctly pushed our container to Amazon ECR, we are ready to start training with Amazon SageMaker, which requires the ECR path to the Docker container used for training as parameter for starting a training job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "container_image_uri = \"{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest\".format(\n", " account_id, region, ecr_repository_name\n", ")\n", "print(container_image_uri)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given the purpose of this example is explaining how to build custom framework containers, we are not going to train a real model. The script that will be executed does not define a specific training logic; it just outputs the configurations injected by SageMaker and implements a dummy training loop. Training data is also dummy. Let's analyze the script first:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pygmentize source_dir/train.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can realize that the training code has been implemented as a standard Python script, that will be invoked as a module by the framework container code, passing hyperparameters as arguments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we upload some dummy data to Amazon S3, in order to define our S3-based training channels." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! echo \"val1, val2, val3\" > dummy.csv\n", "print(sagemaker_session.upload_data(\"dummy.csv\", bucket, prefix + \"/train\"))\n", "print(sagemaker_session.upload_data(\"dummy.csv\", bucket, prefix + \"/val\"))\n", "! rm dummy.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Framework containers enable dynamically running user-provided code loading it from Amazon S3, so we need to:\n", "\n", "\n", "Note: these steps are executed automatically by the Amazon SageMaker Python SDK when using framework estimators for MXNet, Tensorflow, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import tarfile\n", "import os\n", "\n", "\n", "def create_tar_file(source_files, target=None):\n", " if target:\n", " filename = target\n", " else:\n", " _, filename = tempfile.mkstemp()\n", "\n", " with tarfile.open(filename, mode=\"w:gz\") as t:\n", " for sf in source_files:\n", " # Add all files from the directory into the root of the directory structure of the tar\n", " t.add(sf, arcname=os.path.basename(sf))\n", " return filename\n", "\n", "\n", "create_tar_file([\"source_dir/train.py\", \"source_dir/utils.py\"], \"sourcedir.tar.gz\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sources = sagemaker_session.upload_data(\"sourcedir.tar.gz\", bucket, prefix + \"/code\")\n", "print(sources)\n", "! rm sourcedir.tar.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When starting the training job, we need to let the sagemaker-training-toolkit library know where the sources are stored in Amazon S3 and what is the module to be invoked. These parameters are specified through the following reserved hyperparameters (these reserved hyperparameters are injected automatically when using framework estimators of the Amazon SageMaker Python SDK):\n", "\n", "\n", "Finally, we can execute the training job by calling the fit() method of the generic Estimator object defined in the Amazon SageMaker Python SDK (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/estimator.py). This corresponds to calling the CreateTrainingJob() API (https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTrainingJob.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "import json\n", "\n", "# JSON encode hyperparameters.\n", "def json_encode_hyperparameters(hyperparameters):\n", " return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}\n", "\n", "\n", "hyperparameters = json_encode_hyperparameters(\n", " {\n", " \"sagemaker_program\": \"train.py\",\n", " \"sagemaker_submit_directory\": sources,\n", " \"hp1\": \"value1\",\n", " \"hp2\": 300,\n", " \"hp3\": 0.001,\n", " }\n", ")\n", "\n", "est = sagemaker.estimator.Estimator(\n", " container_image_uri,\n", " role,\n", " train_instance_count=1,\n", " train_instance_type=\"local\",\n", " base_job_name=prefix,\n", " hyperparameters=hyperparameters,\n", ")\n", "\n", "train_config = sagemaker.session.s3_input(\n", " \"s3://{0}/{1}/train/\".format(bucket, prefix), content_type=\"text/csv\"\n", ")\n", "val_config = sagemaker.session.s3_input(\n", " \"s3://{0}/{1}/val/\".format(bucket, prefix), content_type=\"text/csv\"\n", ")\n", "\n", "est.fit({\"train\": train_config, \"validation\": val_config})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Training with a custom SDK framework estimator

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you have seen, in the previous steps we had to upload our code to Amazon S3 and then inject reserved hyperparameters to execute training. In order to facilitate this task, you can also try defining a custom framework estimator using the Amazon SageMaker Python SDK and run training with that class, which will take care of managing these tasks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.estimator import Framework\n", "\n", "\n", "class CustomFramework(Framework):\n", " def __init__(\n", " self,\n", " entry_point,\n", " source_dir=None,\n", " hyperparameters=None,\n", " py_version=\"py3\",\n", " framework_version=None,\n", " image_name=None,\n", " distributions=None,\n", " **kwargs,\n", " ):\n", " super(CustomFramework, self).__init__(\n", " entry_point, source_dir, hyperparameters, image_name=image_name, **kwargs\n", " )\n", "\n", " def _configure_distribution(self, distributions):\n", " return\n", "\n", " def create_model(\n", " self,\n", " model_server_workers=None,\n", " role=None,\n", " vpc_config_override=None,\n", " entry_point=None,\n", " source_dir=None,\n", " dependencies=None,\n", " image_name=None,\n", " **kwargs,\n", " ):\n", " return None\n", "\n", "\n", "import sagemaker\n", "\n", "est = CustomFramework(\n", " image_name=container_image_uri,\n", " role=role,\n", " entry_point=\"train.py\",\n", " source_dir=\"source_dir/\",\n", " train_instance_count=1,\n", " train_instance_type=\"local\", # we use local mode\n", " # train_instance_type='ml.m5.xlarge',\n", " base_job_name=prefix,\n", " hyperparameters={\"hp1\": \"value1\", \"hp2\": \"300\", \"hp3\": \"0.001\"},\n", ")\n", "\n", "train_config = sagemaker.session.s3_input(\n", " \"s3://{0}/{1}/train/\".format(bucket, prefix), content_type=\"text/csv\"\n", ")\n", "val_config = sagemaker.session.s3_input(\n", " \"s3://{0}/{1}/val/\".format(bucket, prefix), content_type=\"text/csv\"\n", ")\n", "\n", "est.fit({\"train\": train_config, \"validation\": val_config})" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/advanced_functionality|custom-training-containers|framework-container|notebook|framework-container.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 4 }