{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Build a Custom Training Container and Debug Training Jobs with Amazon SageMaker Debugger"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
"\n",
"\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Amazon SageMaker Debugger enables you to debug your model through its built-in rules and tools (`smdebug` hook and core features) to store and retrieve output tensors in Amazon Simple Storage Service (S3). \n",
"To run your customized machine learning/deep learning (ML/DL) models, use Amazon Elastic Container Registry (ECR) to build and push your customized training container. \n",
"Use SageMaker Debugger for training jobs run on Amazon EC2 instance and take the benefit of its built-in functionalities.\n",
"\n",
"You can bring your own model customized with state-of-the-art ML/DL frameworks, such as TensorFlow, PyTorch, MXNet, and XGBoost. \n",
"You can also use your Docker base image or AWS Deep Learning Container base images to build a custom training container.\n",
"To run and debug your training script using SageMaker Debugger, you need to register the Debugger hook to the script.\n",
"Using the `smdebug` trial feature, you can retrieve the output tensors and visualize it for analysis.\n",
"\n",
"By monitoring the output tensors, the Debugger rules detect training issues and invoke a `IssueFound` rule job status. \n",
"The rule job status also returns at which step or epoch the training job started having the issues. \n",
"You can send this invoked status to Amazon CloudWatch and AWS Lambda to stop the training job when the Debugger rule triggers the `IssueFound` status.\n",
"\n",
"The workflow is as follows:\n",
"\n",
"- [Step 1: Prepare prerequisites](#step1)\n",
"- [Step 2: Prepare a Dockerfile and register the Debugger hook to you training script](#step2)\n",
"- [Step 3: Create a Docker image, build the Docker training container, and push to Amazon ECR](#step3)\n",
"- [Step 4: Use Amazon SageMaker to set the Debugger hook and rule configuration](#step4)\n",
"- [Step 5: Define a SageMaker Estimator object with Debugger and initiate a training job](#step5)\n",
"- [Step 6: Retrieve output tensors using the smdebug trials class](#step6)\n",
"- [Step 7: Analyze the training job using the smdebug trial methods and rule job status](#step7)\n",
"\n",
"**Important:** You can run this notebook only on SageMaker Notebook instances. You cannot run this in SageMaker Studio. Studio does not support Docker container build."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 1: Prepare prerequisites\n",
"\n",
"### Install the SageMaker Python SDK v2 and the smdebug library\n",
"\n",
"This notebook runs on the latest version of the SageMaker Python SDK and the `smdebug` client library. If you want to use one of the previous version, specify the version number for installation. For example, `pip install sagemaker==x.xx.0`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import sys\n",
"\n",
"!{sys.executable} -m pip install \"sagemaker==1.72.0\" smdebug"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### [Optional Step] Restart the kernel to apply the update"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** If you are using **Jupyter Notebook**, the previous cell automatically installs and updates the libraries. If you are using **JupyterLab**, you have to manually choose the \"Restart Kernel\" under the **Kernel** tab in the top menu bar.\n",
"\n",
"Check the SageMaker Python SDK version by running the following cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sagemaker\n",
"\n",
"sagemaker.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 2: Prepare a Dockerfile and register the Debugger hook to you training script\n",
"\n",
"You need to put your **Dockerfile** and training script (**tf_keras_resnet_byoc.py** in this case) in the **docker** folder. Specify the location of the training script in the **Dockerfile** script in the line for `COPY` and `ENV`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare a Dockerfile\n",
"\n",
"The following cell prints the **Dockerfile** in the **docker** folder. You must install `sagemaker-training` and `smdebug` libraries to fully access the SageMaker Debugger features."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[34mFROM\u001b[39;49;00m \u001b[33mtensorflow/tensorflow:2.2.0rc2-py3-jupyter\u001b[39;49;00m\r\n",
"\r\n",
"\u001b[37m# Install Amazon SageMaker training toolkit and smdebug libraries\u001b[39;49;00m\r\n",
"\u001b[34mRUN\u001b[39;49;00m pip install sagemaker-training\r\n",
"\u001b[34mRUN\u001b[39;49;00m pip install smdebug\r\n",
"\r\n",
"\u001b[37m# Copies the training code inside the container\u001b[39;49;00m\r\n",
"\u001b[34mCOPY\u001b[39;49;00m tf_keras_resnet_byoc.py /opt/ml/code/tf_keras_resnet_byoc.py\r\n",
"\r\n",
"\u001b[37m# Defines train.py as script entrypoint\u001b[39;49;00m\r\n",
"\u001b[34mENV\u001b[39;49;00m SAGEMAKER_PROGRAM tf_keras_resnet_byoc.py\r\n"
]
}
],
"source": [
"! pygmentize docker/Dockerfile"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prepare a training script\n",
"\n",
"The following cell prints an example training script **tf_keras_resnet_byoc.py** in the **docker** folder. To register the Debugger hook, you need to use the Debugger client library `smdebug`. \n",
"\n",
"In the `main` function, a Keras hook is registered after the line where the `model` object is defined and before the line where the `model.compile()` function is called. \n",
"\n",
"In the `train` function, you pass the Keras hook and set it as a Keras callback for the `model.fit()` function. The `hook.save_scalar()` method is used to save scalar parameters for mini batch settings, such as epoch, batch size, and the number of steps per epoch in training and validation modes."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33m\"\"\"\u001b[39;49;00m\r\n",
"\u001b[33mThis script is a ResNet training script which uses Tensorflow's Keras interface, and provides an example of how to use SageMaker Debugger when you use your own custom container in SageMaker or your own script outside SageMaker.\u001b[39;49;00m\r\n",
"\u001b[33mIt has been orchestrated with SageMaker Debugger hooks to allow saving tensors during training.\u001b[39;49;00m\r\n",
"\u001b[33mThese hooks have been instrumented to read from a JSON configuration that SageMaker puts in the training container.\u001b[39;49;00m\r\n",
"\u001b[33mConfiguration provided to the SageMaker python SDK when creating a job will be passed on to the hook.\u001b[39;49;00m\r\n",
"\u001b[33mThis allows you to use the same script with different configurations across different runs.\u001b[39;49;00m\r\n",
"\u001b[33m\u001b[39;49;00m\r\n",
"\u001b[33mIf you use an official SageMaker Framework container (i.e. AWS Deep Learning Container), you do not have to orchestrate your script as below. Hooks are automatically added in those environments. This experience is called a \"zero script change\". For more information, see https://github.com/awslabs/sagemaker-debugger/blob/master/docs/sagemaker.md#zero-script-change. An example of the same is provided at https://github.com/awslabs/amazon-sagemaker-examples/sagemaker-debugger/tensorflow2/tensorflow2_zero_code_change.\u001b[39;49;00m\r\n",
"\u001b[33m\"\"\"\u001b[39;49;00m\r\n",
"\r\n",
"\u001b[37m# Standard Library\u001b[39;49;00m\r\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36margparse\u001b[39;49;00m\r\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mrandom\u001b[39;49;00m\r\n",
"\r\n",
"\u001b[37m# Third Party\u001b[39;49;00m\r\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mnumpy\u001b[39;49;00m \u001b[34mas\u001b[39;49;00m \u001b[04m\u001b[36mnp\u001b[39;49;00m\r\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mtensorflow\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mcompat\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mv2\u001b[39;49;00m \u001b[34mas\u001b[39;49;00m \u001b[04m\u001b[36mtf\u001b[39;49;00m\r\n",
"\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mtensorflow\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mkeras\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mapplications\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mresnet50\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m ResNet50\r\n",
"\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mtensorflow\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mkeras\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mdatasets\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m cifar10\r\n",
"\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36mtensorflow\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mkeras\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mutils\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m to_categorical\r\n",
"\r\n",
"\u001b[37m# smdebug modification: Import smdebug support for Tensorflow\u001b[39;49;00m\r\n",
"\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36msmdebug\u001b[39;49;00m\u001b[04m\u001b[36m.\u001b[39;49;00m\u001b[04m\u001b[36mtensorflow\u001b[39;49;00m \u001b[34mas\u001b[39;49;00m \u001b[04m\u001b[36msmd\u001b[39;49;00m\r\n",
"\r\n",
"\r\n",
"\u001b[34mdef\u001b[39;49;00m \u001b[32mtrain\u001b[39;49;00m(batch_size, epoch, model, hook):\r\n",
" (X_train, y_train), (X_valid, y_valid) = cifar10.load_data()\r\n",
"\r\n",
" Y_train = to_categorical(y_train, \u001b[34m10\u001b[39;49;00m)\r\n",
" Y_valid = to_categorical(y_valid, \u001b[34m10\u001b[39;49;00m)\r\n",
"\r\n",
" X_train = X_train.astype(\u001b[33m'\u001b[39;49;00m\u001b[33mfloat32\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m)\r\n",
" X_valid = X_valid.astype(\u001b[33m'\u001b[39;49;00m\u001b[33mfloat32\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m)\r\n",
"\r\n",
" mean_image = np.mean(X_train, axis=\u001b[34m0\u001b[39;49;00m)\r\n",
" X_train -= mean_image\r\n",
" X_valid -= mean_image\r\n",
" X_train /= \u001b[34m128.\u001b[39;49;00m\r\n",
" X_valid /= \u001b[34m128.\u001b[39;49;00m\r\n",
" \r\n",
" \u001b[37m# register hook to save the following scalar values\u001b[39;49;00m\r\n",
" hook.save_scalar(\u001b[33m\"\u001b[39;49;00m\u001b[33mepoch\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, epoch)\r\n",
" hook.save_scalar(\u001b[33m\"\u001b[39;49;00m\u001b[33mbatch_size\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, batch_size)\r\n",
" hook.save_scalar(\u001b[33m\"\u001b[39;49;00m\u001b[33mtrain_steps_per_epoch\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mlen\u001b[39;49;00m(X_train)/batch_size)\r\n",
" hook.save_scalar(\u001b[33m\"\u001b[39;49;00m\u001b[33mvalid_steps_per_epoch\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mlen\u001b[39;49;00m(X_valid)/batch_size)\r\n",
" \r\n",
" model.fit(X_train, Y_train,\r\n",
" batch_size=batch_size,\r\n",
" epochs=epoch,\r\n",
" validation_data=(X_valid, Y_valid),\r\n",
" shuffle=\u001b[34mFalse\u001b[39;49;00m,\r\n",
" \u001b[37m# smdebug modification: Pass the hook as a Keras callback\u001b[39;49;00m\r\n",
" callbacks=[hook])\r\n",
"\r\n",
"\r\n",
"\u001b[34mdef\u001b[39;49;00m \u001b[32mmain\u001b[39;49;00m():\r\n",
" parser = argparse.ArgumentParser(description=\u001b[33m\"\u001b[39;49;00m\u001b[33mTrain resnet50 cifar10\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\r\n",
" parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--batch_size\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mint\u001b[39;49;00m, default=\u001b[34m50\u001b[39;49;00m)\r\n",
" parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--epoch\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mint\u001b[39;49;00m, default=\u001b[34m15\u001b[39;49;00m)\r\n",
" parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--model_dir\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mstr\u001b[39;49;00m, default=\u001b[33m\"\u001b[39;49;00m\u001b[33m./model_keras_resnet\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\r\n",
" parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--lr\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mfloat\u001b[39;49;00m, default=\u001b[34m0.001\u001b[39;49;00m)\r\n",
" parser.add_argument(\u001b[33m\"\u001b[39;49;00m\u001b[33m--random_seed\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m, \u001b[36mtype\u001b[39;49;00m=\u001b[36mbool\u001b[39;49;00m, default=\u001b[34mFalse\u001b[39;49;00m)\r\n",
" \r\n",
" args = parser.parse_args()\r\n",
"\r\n",
" \u001b[34mif\u001b[39;49;00m args.random_seed:\r\n",
" tf.random.set_seed(\u001b[34m2\u001b[39;49;00m)\r\n",
" np.random.seed(\u001b[34m2\u001b[39;49;00m)\r\n",
" random.seed(\u001b[34m12\u001b[39;49;00m)\r\n",
"\r\n",
" \r\n",
" mirrored_strategy = tf.distribute.MirroredStrategy()\r\n",
" \u001b[34mwith\u001b[39;49;00m mirrored_strategy.scope():\r\n",
" \r\n",
" model = ResNet50(weights=\u001b[34mNone\u001b[39;49;00m, input_shape=(\u001b[34m32\u001b[39;49;00m,\u001b[34m32\u001b[39;49;00m,\u001b[34m3\u001b[39;49;00m), classes=\u001b[34m10\u001b[39;49;00m)\r\n",
"\r\n",
" \u001b[37m# smdebug modification:\u001b[39;49;00m\r\n",
" \u001b[37m# Create hook from the configuration provided through sagemaker python sdk.\u001b[39;49;00m\r\n",
" \u001b[37m# This configuration is provided in the form of a JSON file.\u001b[39;49;00m\r\n",
" \u001b[37m# Default JSON configuration file:\u001b[39;49;00m\r\n",
" \u001b[37m# {\u001b[39;49;00m\r\n",
" \u001b[37m# \"LocalPath\": \u001b[39;49;00m\r\n",
" \u001b[37m# }\"\u001b[39;49;00m\r\n",
" \u001b[37m# Alternatively, you could pass custom debugger configuration (using DebuggerHookConfig)\u001b[39;49;00m\r\n",
" \u001b[37m# through SageMaker Estimator. For more information, https://github.com/aws/sagemaker-python-sdk/blob/master/doc/amazon_sagemaker_debugger.rst\u001b[39;49;00m\r\n",
" hook = smd.KerasHook.create_from_json_file()\r\n",
"\r\n",
" opt = tf.keras.optimizers.Adam(learning_rate=args.lr)\r\n",
" model.compile(loss=\u001b[33m'\u001b[39;49;00m\u001b[33mcategorical_crossentropy\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m,\r\n",
" optimizer=opt,\r\n",
" metrics=[\u001b[33m'\u001b[39;49;00m\u001b[33maccuracy\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m])\r\n",
"\r\n",
" \u001b[37m# start the training.\u001b[39;49;00m\r\n",
" train(args.batch_size, args.epoch, model, hook)\r\n",
"\r\n",
"\u001b[34mif\u001b[39;49;00m \u001b[31m__name__\u001b[39;49;00m == \u001b[33m\"\u001b[39;49;00m\u001b[33m__main__\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m:\r\n",
" main()\r\n"
]
}
],
"source": [
"! pygmentize docker/tf_keras_resnet_byoc.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 3: Create a Docker image, build the Docker training container, and push to Amazon ECR"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a Docker image\n",
"\n",
"AWS Boto3 Python SDK provides tools to automatically locate your region and account information to create a Docker image uri."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import boto3\n",
"\n",
"account_id = boto3.client(\"sts\").get_caller_identity().get(\"Account\")\n",
"ecr_repository = \"sagemaker-debugger-mnist-byoc-tf2\"\n",
"tag = \":latest\"\n",
"\n",
"region = boto3.session.Session().region_name\n",
"\n",
"uri_suffix = \"amazonaws.com\"\n",
"if region in [\"cn-north-1\", \"cn-northwest-1\"]:\n",
" uri_suffix = \"amazonaws.com.cn\"\n",
"byoc_image_uri = \"{}.dkr.ecr.{}.{}/{}\".format(account_id, region, uri_suffix, ecr_repository + tag)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print the image URI address."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"byoc_image_uri"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### [Optional Step] Login to access the Deep Learning Containers image repository\n",
"\n",
"If you use one of the AWS Deep Learning Container base images, uncomment the following cell and execute to login to the image repository."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# ! aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Build the Docker container and push it to Amazon ECR\n",
"\n",
"The following code cell builds a Docker container based on the Dockerfile, create an Amazon ECR repository, and push the container to the ECR repository."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!docker build -t $ecr_repository docker\n",
"!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)\n",
"!aws ecr create-repository --repository-name $ecr_repository\n",
"!docker tag {ecr_repository + tag} $byoc_image_uri\n",
"!docker push $byoc_image_uri"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** If this returns a permission error, see the [Get Started with Custom Training Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/build-container-to-train-script-get-started.html#byoc-training-step5) in the Amazon SageMaker developer guide. Follow the note in Step 5 to register the **AmazonEC2ContainerRegistryFullAccess** policy to your IAM role."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 4: Use Amazon SageMaker to set the Debugger hook and rule configuration"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Define Debugger hook configuration\n",
"\n",
"Now you have the custom training container with the Debugger hooks registered to your training script. In this section, you import the SageMaker Debugger API operations, `Debugger hook Config` and `CollectionConfig`, to define the hook configuration. You can choose Debugger pre-configured tensor collections, adjust `save_interval` parameters, or configure custom collections.\n",
"\n",
"In the following notebook cell, the `hook_config` object is configured with the pre-configured tensor collections, `losses`. This will save the tensor outputs to the default S3 bucket. At the end of this notebook, we will retrieve the `loss` values to plot the overfitting problem that the example training job will be experiencing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sagemaker\n",
"from sagemaker.debugger import DebuggerHookConfig, CollectionConfig\n",
"\n",
"sagemaker_session = sagemaker.Session()\n",
"\n",
"train_save_interval = 100\n",
"eval_save_interval = 10\n",
"\n",
"hook_config = DebuggerHookConfig(\n",
" collection_configs=[\n",
" CollectionConfig(\n",
" name=\"losses\",\n",
" parameters={\n",
" \"train.save_interval\": str(train_save_interval),\n",
" \"eval.save_interval\": str(eval_save_interval),\n",
" },\n",
" )\n",
" ]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Select Debugger built-in rules\n",
"\n",
"The following cell shows how to directly use the Debugger built-in rules. The maximum number of rules you can run in parallel is 20."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sagemaker.debugger import Rule, rule_configs\n",
"\n",
"rules = [\n",
" Rule.sagemaker(rule_configs.vanishing_gradient()),\n",
" Rule.sagemaker(rule_configs.overfit()),\n",
" Rule.sagemaker(rule_configs.overtraining()),\n",
" Rule.sagemaker(rule_configs.saturated_activation()),\n",
" Rule.sagemaker(rule_configs.weight_update_ratio()),\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 5. Define a SageMaker Estimator object with Debugger and initiate a training job\n",
"\n",
"Construct a SageMaker Estimator using the image URI of the custom training container you created in **Step 3**.\n",
"\n",
"**Note:** This example uses the SageMaker Python SDK v1. If you want to use the SageMaker Python SDK v2, you need to change the parameter names. You can find the SageMaker Estimator parameters at [Get Started with Custom Training Containers](https://docs.aws.amazon.com/sagemaker/latest/dg/build-container-to-train-script-get-started.html#byoc-training-step5) in the AWS SageMaker Developer Guide or at [the SageMaker Estimator API](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) in one of the older version of SageMaker Python SDK documentation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sagemaker.estimator import Estimator\n",
"from sagemaker import get_execution_role\n",
"\n",
"role = get_execution_role()\n",
"\n",
"estimator = Estimator(\n",
" image_name=byoc_image_uri,\n",
" role=role,\n",
" train_instance_count=1,\n",
" train_instance_type=\"ml.p3.16xlarge\",\n",
" # Debugger-specific parameters\n",
" rules=rules,\n",
" debugger_hook_config=hook_config,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Initiate the training job in the background\n",
"\n",
"With the `wait=False` option, the `estimator.fit()` function will run the training job in the background. You can proceed to the next cells. If you want to see logs in real time, go to the [CloudWatch console](https://console.aws.amazon.com/cloudwatch/home), choose **Log Groups** in the left navigation pane, and choose **/aws/sagemaker/TrainingJobs** for training job logs and **/aws/sagemaker/ProcessingJobs** for Debugger rule job logs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"estimator.fit(wait=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Print the training job name\n",
"\n",
"The following cell outputs the training job running in the background."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"job_name = estimator.latest_training_job.name\n",
"print(\"Training job name: {}\".format(job_name))\n",
"\n",
"client = estimator.sagemaker_session.sagemaker_client\n",
"\n",
"description = client.describe_training_job(TrainingJobName=job_name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Output the current job status\n",
"\n",
"The following cell tracks the status of training job until the `SecondaryStatus` changes to `Training`. While training, Debugger collects output tensors from the training job and monitors the training job with the rules. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"if description[\"TrainingJobStatus\"] != \"Completed\":\n",
" while description[\"SecondaryStatus\"] not in {\"Training\", \"Completed\"}:\n",
" description = client.describe_training_job(TrainingJobName=job_name)\n",
" primary_status = description[\"TrainingJobStatus\"]\n",
" secondary_status = description[\"SecondaryStatus\"]\n",
" print(\n",
" \"Current job status: [PrimaryStatus: {}, SecondaryStatus: {}]\".format(\n",
" primary_status, secondary_status\n",
" )\n",
" )\n",
" time.sleep(15)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 6: Retrieve output tensors using the smdebug trials class\n",
"\n",
"### Call the latest Debugger artifact and create a smdebug trial"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following smdebug `trial` object calls the output tensors once they become available in the default S3 bucket. You can use the `estimator.latest_job_debugger_artifacts_path()` method to automatically detect the default S3 bucket that is currently being used while the training job is running. \n",
"\n",
"Once the tensors are available in the dafault S3 bucket, you can plot the loss curve in the next sections."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from smdebug.trials import create_trial\n",
"\n",
"trial = create_trial(estimator.latest_job_debugger_artifacts_path())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** If you want to re-visit tensor data from a previous training job that has already done, you can retrieve them by specifying the exact S3 bucket location. The S3 bucket path is configured in a similar way to the following sample: `trial=\"s3://sagemaker-us-east-1-111122223333/sagemaker-debugger-mnist-byoc-tf2-2020-08-27-05-49-34-037/debug-output\"`. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Print the hyperparameter configuration saved as scalar values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"trial.tensor_names(regex=\"scalar\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Print the size of the `steps` list to check the training progress"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from smdebug.core.modes import ModeKeys\n",
"\n",
"len(trial.tensor(\"loss\").steps(mode=ModeKeys.TRAIN))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(trial.tensor(\"loss\").steps(mode=ModeKeys.EVAL))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Step 7: Analyze the training job using the smdebug `trial` methods and the Debugger rule job status"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plot training and validation loss curves in real time"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following cell retrieves the `loss` tensor from training and evaluation mode and plots the loss curves. \n",
"\n",
"In this notebook example, the dataset was `cifar10` that divided into 50,000 32x32 color training images and 10,000 test images, labeled over 10 categories. (See the [TensorFlow Keras Datasets cifar10 load data documentation](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/cifar10/load_data) for more details.) In the Debugger configuration step (Step 4), the save interval was set to 100 for training mode and 10 for evaluation mode. Since the batch size is set to 100, there are 1,000 training steps and 200 validation steps in each epoch. \n",
"\n",
"The following cell includes scripts to call those mini batch parameters saved by `smdebug`, computes the average loss in each epoch, and renders the loss curve in a single plot. \n",
"\n",
"As the training job proceeds, you will be able to observe that the validation loss curve starts deviating from the training loss curve, which is a clear indication of overfitting problem."
]
},
{
"cell_type": "code",
"execution_count": 400,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"# Retrieve the loss tensors collected in training mode\n",
"y = []\n",
"for step in trial.tensor(\"loss\").steps(mode=ModeKeys.TRAIN):\n",
" y.append(trial.tensor(\"loss\").value(step, mode=ModeKeys.TRAIN)[0])\n",
"y = np.asarray(y)\n",
"\n",
"# Retrieve the loss tensors collected in evaluation mode\n",
"y_val = []\n",
"for step in trial.tensor(\"loss\").steps(mode=ModeKeys.EVAL):\n",
" y_val.append(trial.tensor(\"loss\").value(step, mode=ModeKeys.EVAL)[0])\n",
"y_val = np.asarray(y_val)\n",
"\n",
"train_save_points = int(\n",
" trial.tensor(\"scalar/train_steps_per_epoch\").value(0)[0] / train_save_interval\n",
")\n",
"val_save_points = int(trial.tensor(\"scalar/valid_steps_per_epoch\").value(0)[0] / eval_save_interval)\n",
"\n",
"y_mean = []\n",
"x_epoch = []\n",
"for e in range(int(trial.tensor(\"scalar/epoch\").value(0)[0])):\n",
" ei = e * train_save_points\n",
" ef = (e + 1) * train_save_points - 1\n",
" y_mean.append(np.mean(y[ei:ef]))\n",
" x_epoch.append(e)\n",
"\n",
"y_val_mean = []\n",
"for e in range(int(trial.tensor(\"scalar/epoch\").value(0)[0])):\n",
" ei = e * val_save_points\n",
" ef = (e + 1) * val_save_points - 1\n",
" y_val_mean.append(np.mean(y_val[ei:ef]))\n",
"\n",
"plt.plot(x_epoch, y_mean, label=\"Training Loss\")\n",
"plt.plot(x_epoch, y_val_mean, label=\"Validation Loss\")\n",
"\n",
"plt.legend(bbox_to_anchor=(1.04, 1), loc=\"upper left\")\n",
"plt.xlabel(\"Epoch\")\n",
"plt.ylabel(\"Loss\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check the rule job summary\n",
"\n",
"The following cell returns the Debugger rule job summary. In this example notebook, we used the five built-in rules: `VanishingGradient`, `Overfit`, `Overtraining`, `SaturationActivation`, and `WeightUpdateRatio`. For more information about what each of the rules evaluate on the on-going training job, see the [List of Debugger built-in rules](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html) documentation in the Amazon SageMaker developer guide. Define the following `rule_status` object to retrieve Debugger rule job summaries."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rule_status = estimator.latest_training_job.rule_job_summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the following cells, you can print the Debugger rule job summaries and the latest logs. The outputs are in the following format:\n",
"\n",
"```\n",
"{'RuleConfigurationName': 'Overfit',\n",
" 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-east-1:111122223333:processing-job/sagemaker-debugger-mnist-b-overfit-e841d0bf',\n",
" 'RuleEvaluationStatus': 'IssuesFound',\n",
" 'StatusDetails': 'RuleEvaluationConditionMet: Evaluation of the rule Overfit at step 7200 resulted in the condition being met\\n',\n",
" 'LastModifiedTime': datetime.datetime(2020, 8, 27, 18, 17, 4, 789000, tzinfo=tzlocal())}\n",
"```\n",
"\n",
"The `Overfit` rule job summary above is an actual output example of the training job in this notebook. It changes `RuleEvaluationStatus` to the `IssuesFound` status when it reaches the global step 7200 (in the 6th epoch). The `Overfit` rule algorithm determines if the training job is having Overfit issue based on its criteria. The default criteria to invoke the overfitting issue is to have at least 10 percent deviation between the training loss and validation loss.\n",
"\n",
"Another issue that the training job has is the `WeightUpdateRatio` issue at the global step 500 in the first epoch, as shown in the following log.\n",
"\n",
"```\n",
"{'RuleConfigurationName': 'WeightUpdateRatio',\n",
" 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-east-1:111122223333:processing-job/sagemaker-debugger-mnist-b-weightupdateratio-e9c353fe',\n",
" 'RuleEvaluationStatus': 'IssuesFound',\n",
" 'StatusDetails': 'RuleEvaluationConditionMet: Evaluation of the rule WeightUpdateRatio at step 500 resulted in the condition being met\\n',\n",
" 'LastModifiedTime': datetime.datetime(2020, 8, 27, 18, 17, 4, 789000, tzinfo=tzlocal())}\n",
"```\n",
"\n",
"This rule monitors the weight update ratio between two consecutive global steps and determines if it is too small (less than 0.00000001) or too large (above 10). In other words, this rule can identify if the weight parameters are updated abnormally during the forward and backward pass in each step, not being able to start converging and improving the model.\n",
"\n",
"In combination of the two issues, it is clear that the model is not well setup to improve from the early stage of training. \n",
"\n",
"Run the following cells to track the rule job summaries.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**`VanishingGradient` rule job summary**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rule_status[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**`Overfit` rule job summary**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rule_status[1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**`Overtraining` rule job summary**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rule_status[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**`SaturationActivation` rule job summary**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rule_status[3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**`WeightUpdateRatio` rule job summary**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"rule_status[4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebook Summary and Other Applications\n",
"\n",
"This notebook presented how you can have insights into training jobs by using SageMaker Debugger for any of your model running in a customized training container. The AWS cloud infrastructure, the SageMaker ecosystem, and the SageMaker Debugger tools make debugging process more convenient and transparent. The Debugger rule's `RuleEvaluationStatus` invocation system can be further extended to the Amazon CloudWatch Events and AWS Lambda function to take automatic actions, such as stopping training jobs once issues are detected. A sample notebook to set the combination of Debugger, CloudWatch, and Lambda is provided at [Amazon SageMaker Debugger - Reacting to CloudWatch Events from Rules](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/tensorflow_action_on_rule/tf-mnist-stop-training-job.ipynb)."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Notebook CI Test Results\n",
"\n",
"This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "conda_python3",
"language": "python",
"name": "conda_python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}