{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using TensorFlow Scripts in SageMaker - Quickstart\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Starting with TensorFlow version 1.11, you can use SageMaker's TensorFlow containers to train TensorFlow scripts the same way you would train outside SageMaker. This feature is named **Script Mode**. \n", "\n", "This example uses \n", "[Multi-layer Recurrent Neural Networks (LSTM, RNN) for character-level language models in Python using Tensorflow](https://github.com/sherjilozair/char-rnn-tensorflow). \n", "You can use the same technique for other scripts or repositories, including \n", "[TensorFlow Model Zoo](https://github.com/tensorflow/models) and \n", "[TensorFlow benchmark scripts](https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get the data\n", "For training data, we use plain text versions of Sherlock Holmes stories.\n", "Let's create a folder named **sherlock** to store our dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "data_dir = os.path.join(os.getcwd(), \"sherlock\")\n", "\n", "os.makedirs(data_dir, exist_ok=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to download the dataset to this folder:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget https://sherlock-holm.es/stories/plain-text/cnus.txt --force-directories --output-document=sherlock/input.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing the training script\n", "\n", "For training scripts, let's use Git integration for SageMaker Python SDK here. That is, you can specify a training script that is stored in a GitHub, CodeCommit or other Git repository as the entry point for the estimator, so that you don't have to download the scripts locally. If you do so, source directory and dependencies should be in the same repo if they are needed.\n", "\n", "To use Git integration, pass a dict `git_config` as a parameter when you create the `TensorFlow` Estimator object. In the `git_config` parameter, you specify the fields `repo`, `branch` and `commit` to locate the specific repo you want to use. If authentication is required to access the repo, you can specify fields `2FA_enabled`, `username`, `password` and token accordingly.\n", "\n", "The scripts we want to use for this example is stored in GitHub repo \n", "[https://github.com/awslabs/amazon-sagemaker-examples/tree/training-scripts](https://github.com/awslabs/amazon-sagemaker-examples/tree/training-scripts), \n", "under the branch `training-scripts`. It is a public repo so we don't need authentication to access it. Let's specify the `git_config` argument here: \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "git_config = {\n", " \"repo\": \"https://github.com/awslabs/amazon-sagemaker-examples.git\",\n", " \"branch\": \"training-scripts\",\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we did not specify `commit` in `git_config` here, so the latest commit of the specified repo and branch will be used by default. \n", "\n", "The scripts we will use are under the `char-rnn-tensorflow` directory in the repo. The directory also includes a [README.md](https://github.com/awslabs/amazon-sagemaker-examples/blob/training-scripts/README.md#basic-usage) with an overview of the project, requirements, and basic usage:\n", "\n", "> #### **Basic Usage**\n", "> _To train with default parameters on the tinyshakespeare corpus, run **python train.py**. \n", "To access all the parameters use **python train.py --help.**_\n", "\n", "[train.py](https://github.com/awslabs/amazon-sagemaker-examples/blob/training-scripts/char-rnn-tensorflow/train.py#L11) uses the [argparse](https://docs.python.org/3/library/argparse.html) library and requires the following arguments:\n", "\n", "```python\n", "parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n", "# Data and model checkpoints directories\n", "parser.add_argument('--data_dir', type=str, default='data/tinyshakespeare', help='data directory containing input.txt with training examples')\n", "parser.add_argument('--save_dir', type=str, default='save', help='directory to store checkpointed models')\n", "...\n", "args = parser.parse_args()\n", "\n", "```\n", "When SageMaker training finishes, it deletes all data generated inside the container with exception of the directories `_/opt/ml/model_` and `_/opt/ml/output_`. To ensure that model data is not lost during training, training scripts are invoked in SageMaker with an additional argument `--model_dir`. The training script should save the model data that results from the training job to this directory..\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The training script executes in the container as shown bellow:\n", "\n", "```bash\n", "python train.py --num-epochs 1 --data_dir /opt/ml/input/data/training --model_dir /opt/ml/model\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test locally using SageMaker Python SDK TensorFlow Estimator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use the SageMaker Python SDK [`TensorFlow`](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/README.rst#training-with-tensorflow) estimator to easily train locally and in SageMaker. \n", "\n", "Let's start by setting the training script arguments `--num_epochs` and `--data_dir` as hyperparameters. Remember that we don't need to provide `--model_dir`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hyperparameters = {\"num_epochs\": 1, \"data_dir\": \"/opt/ml/input/data/training\"}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook shows how to use the SageMaker Python SDK to run your code in a local container before deploying to SageMaker's managed training or hosting environments. Just change your estimator's train_instance_type to local or local_gpu. For more information, see: https://github.com/aws/sagemaker-python-sdk#local-mode.\n", "\n", "In order to use this feature you'll need to install docker-compose (and nvidia-docker if training with a GPU). Running following script will install docker-compose or nvidia-docker-compose and configure the notebook environment for you.\n", "\n", "Note, you can only run a single local notebook at a time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!/bin/bash ./setup.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To train locally, you set `train_instance_type` to [local](https://github.com/aws/sagemaker-python-sdk#local-mode):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_instance_type = \"local\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create the `TensorFlow` Estimator, passing the `git_config` argument and the flag `script_mode=True`. Note that we are using Git integration here, so `source_dir` should be a relative path inside the Git repo; otherwise it should be a relative or absolute local path. the `Tensorflow` Estimator is created as following: \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import sagemaker\n", "from sagemaker.tensorflow import TensorFlow\n", "\n", "\n", "estimator = TensorFlow(\n", " entry_point=\"train.py\",\n", " source_dir=\"char-rnn-tensorflow\",\n", " git_config=git_config,\n", " instance_type=train_instance_type,\n", " instance_count=1,\n", " hyperparameters=hyperparameters,\n", " role=sagemaker.get_execution_role(), # Passes to the container the AWS role that you are using on this notebook\n", " framework_version=\"1.15.2\",\n", " py_version=\"py3\",\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To start a training job, we call `estimator.fit(inputs)`, where inputs is a dictionary where the keys, named **channels**, \n", "have values pointing to the data location. `estimator.fit(inputs)` downloads the TensorFlow container with TensorFlow Python 3, CPU version, locally and simulates a SageMaker training job. \n", "When training starts, the TensorFlow container executes **train.py**, passing `hyperparameters` and `model_dir` as script arguments, executing the example as follows:\n", "```bash\n", "python -m train --num-epochs 1 --data_dir /opt/ml/input/data/training --model_dir /opt/ml/model\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inputs = {\"training\": f\"file://{data_dir}\"}\n", "\n", "estimator.fit(inputs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's explain the values of `--data_dir` and `--model_dir` with more details:\n", "\n", "- **/opt/ml/input/data/training** is the directory inside the container where the training data is downloaded. The data is downloaded to this folder because `training` is the channel name defined in ```estimator.fit({'training': inputs})```. See [training data](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html#your-algorithms-training-algo-running-container-trainingdata) for more information. \n", "\n", "- **/opt/ml/model** use this directory to save models, checkpoints, or any other data. Any data saved in this folder is saved in the S3 bucket defined for training. See [model data](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html#your-algorithms-training-algo-envvariables) for more information.\n", "\n", "### Reading additional information from the container\n", "\n", "Often, a user script needs additional information from the container that is not available in ```hyperparameters```.\n", "SageMaker containers write this information as **environment variables** that are available inside the script.\n", "\n", "For example, the example above can read information about the `training` channel provided in the training job request by adding the environment variable `SM_CHANNEL_TRAINING` as the default value for the `--data_dir` argument:\n", "\n", "```python\n", "if __name__ == '__main__':\n", " parser = argparse.ArgumentParser()\n", " # reads input channels training and testing from the environment variables\n", " parser.add_argument('--data_dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])\n", "```\n", "\n", "Script mode displays the list of available environment variables in the training logs. You can find the [entire list here](https://github.com/aws/sagemaker-containers/blob/master/README.rst#list-of-provided-environment-variables-by-sagemaker-containers)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training in SageMaker" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After you test the training job locally, upload the dataset to an S3 bucket so SageMaker can access the data during training:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "\n", "inputs = sagemaker.Session().upload_data(path=\"sherlock\", key_prefix=\"datasets/sherlock\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The returned variable inputs above is a string with a S3 location which SageMaker Tranining has permissions\n", "to read data from." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inputs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To train in SageMaker:\n", "- change the estimator argument `train_instance_type` to any SageMaker ml instance available for training.\n", "- set the `training` channel to a S3 location." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator = TensorFlow(\n", " entry_point=\"train.py\",\n", " source_dir=\"char-rnn-tensorflow\",\n", " git_config=git_config,\n", " instance_type=\"ml.c4.xlarge\", # Executes training in a ml.c4.xlarge instance\n", " instance_count=1,\n", " hyperparameters=hyperparameters,\n", " role=sagemaker.get_execution_role(),\n", " framework_version=\"1.15.2\",\n", " py_version=\"py3\",\n", ")\n", "\n", "\n", "estimator.fit({\"training\": inputs})" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-python-sdk|tensorflow_script_mode_quickstart|tensorflow_script_mode_quickstart.ipynb)\n" ] } ], "metadata": { "kernelspec": { "display_name": "conda_tensorflow2_p38", "language": "python", "name": "conda_tensorflow2_p38" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }