{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "de7e4dca-3653-4bf6-80a0-d964492d1d91",
"metadata": {},
"source": [
"# Track an experiment while training a Pytorch model with a SageMaker Training Job\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c8b50aaf",
"metadata": {},
"source": [
"---\n",
"\n",
"This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
"\n",
"\n",
"\n",
"---"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "fd487ee3",
"metadata": {},
"source": [
"\n",
"This notebook shows how you can use the SageMaker SDK to track a Machine Learning experiment using a Pytorch model trained in a SageMaker Training Job with Script mode, where you will provide the model script file.\n",
"\n",
"We introduce two concepts in this notebook -\n",
"\n",
"* *Experiment:* An experiment is a collection of runs. When you initialize a run in your training loop, you include the name of the experiment that the run belongs to. Experiment names must be unique within your AWS account. \n",
"* *Run:* A run consists of all the inputs, parameters, configurations, and results for one iteration of model training. Initialize an experiment run for tracking a training job with Run(). \n",
"\n",
"\n",
"To execute this notebook in SageMaker Studio, you should select the `PyTorch 1.12 Python 3.8 CPU Optimizer` image.\n",
"\n",
"\n",
"You can track artifacts for experiments, including datasets, algorithms, hyperparameters and metrics. Experiments executed on SageMaker such as SageMaker training jobs are automatically tracked and any existen SageMaker experiment on your AWS account is automatically migrated to the new UI version.\n",
"\n",
"In this notebook we will demonstrate the capabilities through an MNIST handwritten digits classification example. The notebook is organized as follow:\n",
"\n",
"1. Train a Convolutional Neural Network (CNN) Model and log the model training metrics\n",
"1. Tune the hyperparameters that configures the number of hidden channels and the optimized in the model. Track teh parameter's configuration, resulting model loss and accuracy and automatically plot a confusion matrix using the Experiments capabilities of the SageMaker SDK.\n",
"1. Analyse your model results and plot graphs comparing your model different runs generated from the tunning step 3.\n",
"\n",
"## Runtime\n",
"This notebook takes approximately 45 minutes to run.\n",
"\n",
"## Contents\n",
"1. [Install modules](#Install-modules)\n",
"1. [Setup](#Setup)\n",
"1. [Create model training script](#Create-model-training-script)\n",
"1. [Train model with Run context](#Train-model-with-Run-context)\n",
"1. [Contact](#Contact)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "1141d3f8-45ed-4a56-8651-8964446befac",
"metadata": {},
"source": [
"## Install modules\n",
"\n",
"Let's ensure we have the latest SageMaker SDK available, including the SageMaker Experiments functionality"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8d538673-0c04-455a-83a4-157d72edd3c1",
"metadata": {},
"outputs": [],
"source": [
"import sys"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ba6534d0-316b-4227-af84-37349d39c81b",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# update boto3 and sagemaker to ensure latest SDK version\n",
"!{sys.executable} -m pip install --upgrade pip\n",
"!{sys.executable} -m pip install --upgrade boto3\n",
"!{sys.executable} -m pip install --upgrade sagemaker\n",
"!{sys.executable} -m pip install torch\n",
"!{sys.executable} -m pip install torchvision"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "3368d208-aebb-4844-bf27-2b2e373ef3d2",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"Import required libraries and set logging and experiment configuration\n",
"\n",
"SageMaker Experiments now provides the `Run` class that allows you to create a new experiment run. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "037c2813-b191-4420-b37b-9c6d1cbb8057",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"from sagemaker.pytorch import PyTorch\n",
"from sagemaker.experiments.run import Run\n",
"from sagemaker.session import Session\n",
"from sagemaker import get_execution_role\n",
"from sagemaker.utils import unique_name_from_base\n",
"\n",
"role = get_execution_role()\n",
"region = Session().boto_session.region_name\n",
"\n",
"\n",
"# set new experiment configuration\n",
"experiment_name = unique_name_from_base(\"training-job-experiment\")\n",
"run_name = \"experiment-run-example\"\n",
"print(experiment_name)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "d9dc0054-d7dd-4ec8-b1e9-0b292fc7b1c0",
"metadata": {},
"source": [
"## Create model training script\n",
"Let's create `mnist.py`, the pytorch script file to train our model.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "49259885-530e-4675-bd72-e21934014e0f",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!mkdir -p script"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20c6e08a-92d3-4819-a080-4858337813cf",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%writefile ./script/mnist.py\n",
"# ensure that the latest version of the SageMaker SDK is available\n",
"import os\n",
"\n",
"os.system(\"pip install -U sagemaker\")\n",
"\n",
"import argparse\n",
"import json\n",
"import logging\n",
"import sys\n",
"import time\n",
"from os.path import join\n",
"import boto3\n",
"import torch\n",
"from torchvision import datasets, transforms\n",
"from sagemaker.session import Session\n",
"from sagemaker.experiments.run import load_run\n",
"\n",
"logger = logging.getLogger(__name__)\n",
"logger.setLevel(logging.DEBUG)\n",
"logger.addHandler(logging.StreamHandler(sys.stdout))\n",
"\n",
"if \"SAGEMAKER_METRICS_DIRECTORY\" in os.environ:\n",
" log_file_handler = logging.FileHandler(\n",
" join(os.environ[\"SAGEMAKER_METRICS_DIRECTORY\"], \"metrics.json\")\n",
" )\n",
" formatter = logging.Formatter(\n",
" \"{'time':'%(asctime)s', 'name': '%(name)s', \\\n",
" 'level': '%(levelname)s', 'message': '%(message)s'}\",\n",
" style=\"%\",\n",
" )\n",
" log_file_handler.setFormatter(formatter)\n",
" logger.addHandler(log_file_handler)\n",
"\n",
"\n",
"# Based on https://github.com/pytorch/examples/blob/master/mnist/main.py\n",
"class Net(torch.nn.Module):\n",
" def __init__(self, hidden_channels, kernel_size, drop_out):\n",
" super(Net, self).__init__()\n",
" self.conv1 = torch.nn.Conv2d(1, hidden_channels, kernel_size=kernel_size)\n",
" self.conv2 = torch.nn.Conv2d(hidden_channels, 20, kernel_size=kernel_size)\n",
" self.conv2_drop = torch.nn.Dropout2d(p=drop_out)\n",
" self.fc1 = torch.nn.Linear(320, 50)\n",
" self.fc2 = torch.nn.Linear(50, 10)\n",
"\n",
" def forward(self, x):\n",
" x = torch.nn.functional.relu(torch.nn.functional.max_pool2d(self.conv1(x), 2))\n",
" x = torch.nn.functional.relu(\n",
" torch.nn.functional.max_pool2d(self.conv2_drop(self.conv2(x)), 2)\n",
" )\n",
" x = x.view(-1, 320)\n",
" x = torch.nn.functional.relu(self.fc1(x))\n",
" x = torch.nn.functional.dropout(x, training=self.training)\n",
" x = self.fc2(x)\n",
" return torch.nn.functional.log_softmax(x, dim=1)\n",
"\n",
"\n",
"def log_performance(model, data_loader, device, epoch, run, metric_type=\"Test\"):\n",
" model.eval()\n",
" loss = 0\n",
" correct = 0\n",
" with torch.no_grad():\n",
" for data, target in data_loader:\n",
" data, target = data.to(device), target.to(device)\n",
" output = model(data)\n",
" loss += torch.nn.functional.nll_loss(\n",
" output, target, reduction=\"sum\"\n",
" ).item() # sum up batch loss\n",
" # get the index of the max log-probability\n",
" pred = output.max(1, keepdim=True)[1]\n",
" correct += pred.eq(target.view_as(pred)).sum().item()\n",
" loss /= len(data_loader.dataset)\n",
" accuracy = 100.0 * correct / len(data_loader.dataset)\n",
" # log metrics\n",
" run.log_metric(name=metric_type + \":loss\", value=loss, step=epoch)\n",
" run.log_metric(name=metric_type + \":accuracy\", value=accuracy, step=epoch)\n",
" logger.info(\n",
" \"{} Average loss: {:.4f}, {} Accuracy: {:.4f}%;\\n\".format(\n",
" metric_type, loss, metric_type, accuracy\n",
" )\n",
" )\n",
"\n",
"\n",
"def train_model(\n",
" run, train_set, test_set, data_dir=\"mnist_data\", optimizer=\"sgd\", epochs=10, hidden_channels=10\n",
"):\n",
" \"\"\"\n",
" Function that trains the CNN classifier to identify the MNIST digits.\n",
" Args:\n",
" run (sagemaker.experiments.run.Run): SageMaker Experiment run object\n",
" train_set (torchvision.datasets.mnist.MNIST): train dataset\n",
" test_set (torchvision.datasets.mnist.MNIST): test dataset\n",
" data_dir (str): local directory where the MNIST datasource is stored\n",
" optimizer (str): the optimization algorthm to use for training your CNN\n",
" available options are sgd and adam\n",
" epochs (int): number of complete pass of the training dataset through the algorithm\n",
" hidden_channels (int): number of hidden channels in your model\n",
" \"\"\"\n",
"\n",
" # log the parameters of your model\n",
" run.log_parameter(\"device\", \"cpu\")\n",
" run.log_parameters(\n",
" {\n",
" \"data_dir\": data_dir,\n",
" \"optimizer\": optimizer,\n",
" \"epochs\": epochs,\n",
" \"hidden_channels\": hidden_channels,\n",
" }\n",
" )\n",
"\n",
" # train the model on the CPU (no GPU)\n",
" device = torch.device(\"cpu\")\n",
"\n",
" # set the seed for generating random numbers\n",
" torch.manual_seed(42)\n",
"\n",
" train_loader = torch.utils.data.DataLoader(train_set, batch_size=64, shuffle=True)\n",
" test_loader = torch.utils.data.DataLoader(test_set, batch_size=1000, shuffle=True)\n",
" logger.info(\n",
" \"Processes {}/{} ({:.0f}%) of train data\".format(\n",
" len(train_loader.sampler),\n",
" len(train_loader.dataset),\n",
" 100.0 * len(train_loader.sampler) / len(train_loader.dataset),\n",
" )\n",
" )\n",
"\n",
" logger.info(\n",
" \"Processes {}/{} ({:.0f}%) of test data\".format(\n",
" len(test_loader.sampler),\n",
" len(test_loader.dataset),\n",
" 100.0 * len(test_loader.sampler) / len(test_loader.dataset),\n",
" )\n",
" )\n",
" model = Net(hidden_channels, kernel_size=5, drop_out=0.5).to(device)\n",
" model = torch.nn.DataParallel(model)\n",
" momentum = 0.5\n",
" lr = 0.01\n",
" log_interval = 100\n",
" if optimizer == \"sgd\":\n",
" optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=momentum)\n",
" else:\n",
" optimizer = torch.optim.Adam(model.parameters(), lr=lr)\n",
"\n",
" for epoch in range(1, epochs + 1):\n",
" print(\"Training Epoch:\", epoch)\n",
" model.train()\n",
" for batch_idx, (data, target) in enumerate(train_loader, 1):\n",
" data, target = data.to(device), target.to(device)\n",
" optimizer.zero_grad()\n",
" output = model(data)\n",
" loss = torch.nn.functional.nll_loss(output, target)\n",
" loss.backward()\n",
" optimizer.step()\n",
" if batch_idx % log_interval == 0:\n",
" logger.info(\n",
" \"Train Epoch: {} [{}/{} ({:.0f}%)], Train Loss: {:.6f};\".format(\n",
" epoch,\n",
" batch_idx * len(data),\n",
" len(train_loader.sampler),\n",
" 100.0 * batch_idx / len(train_loader),\n",
" loss.item(),\n",
" )\n",
" )\n",
" log_performance(model, train_loader, device, epoch, run, \"Train\")\n",
" log_performance(model, test_loader, device, epoch, run, \"Test\")\n",
" # log confusion matrix\n",
" with torch.no_grad():\n",
" for data, target in test_loader:\n",
" data, target = data.to(device), target.to(device)\n",
" output = model(data)\n",
" pred = output.max(1, keepdim=True)[1]\n",
" run.log_confusion_matrix(target, pred, \"Confusion-Matrix-Test-Data\")\n",
" return model\n",
"\n",
"\n",
"def model_fn(model_dir):\n",
" device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
"\n",
" hidden_channels = int(os.environ.get(\"hidden_channels\", \"5\"))\n",
" kernel_size = int(os.environ.get(\"kernel_size\", \"5\"))\n",
" dropout = float(os.environ.get(\"dropout\", \"0.5\"))\n",
" model = torch.nn.DataParallel(Net(hidden_channels, kernel_size, dropout))\n",
" with open(os.path.join(model_dir, \"model.pth\"), \"rb\") as f:\n",
" model.load_state_dict(torch.load(f))\n",
" return model.to(device)\n",
"\n",
"\n",
"def save_model(model, model_dir, run):\n",
" logger.info(\"Saving the model.\")\n",
" path = os.path.join(model_dir, \"model.pth\")\n",
" # recommended way from http://pytorch.org/docs/master/notes/serialization.html\n",
" torch.save(model.cpu().state_dict(), path)\n",
"\n",
"\n",
"if __name__ == \"__main__\":\n",
" parser = argparse.ArgumentParser()\n",
"\n",
" parser.add_argument(\n",
" \"--epochs\",\n",
" type=int,\n",
" default=10,\n",
" metavar=\"N\",\n",
" help=\"number of epochs to train (default: 10)\",\n",
" )\n",
" parser.add_argument(\"--optimizer\", type=str, default=\"sgd\", help=\"optimizer for training.\")\n",
" parser.add_argument(\n",
" \"--hidden_channels\",\n",
" type=int,\n",
" default=10,\n",
" help=\"number of channels in hidden conv layer\",\n",
" )\n",
" parser.add_argument(\"--region\", type=str, default=\"us-east-2\", help=\"SageMaker Region\")\n",
"\n",
" # Container environment\n",
" parser.add_argument(\"--hosts\", type=list, default=json.loads(os.environ[\"SM_HOSTS\"]))\n",
" parser.add_argument(\"--current-host\", type=str, default=os.environ[\"SM_CURRENT_HOST\"])\n",
" parser.add_argument(\"--model-dir\", type=str, default=os.environ[\"SM_MODEL_DIR\"])\n",
" parser.add_argument(\"--num-gpus\", type=int, default=os.environ[\"SM_NUM_GPUS\"])\n",
"\n",
" args = parser.parse_args()\n",
" # download the dataset\n",
" # this will not only download data to ./mnist folder, but also load and transform (normalize) them\n",
" datasets.MNIST.urls = [\n",
" f\"https://sagemaker-example-files-prod-{args.region}.s3.amazonaws.com/datasets/image/MNIST/train-images-idx3-ubyte.gz\",\n",
" f\"https://sagemaker-example-files-prod-{args.region}s.s3.amazonaws.com/datasets/image/MNIST/train-labels-idx1-ubyte.gz\",\n",
" f\"https://sagemaker-example-files-prod-{args.region}.s3.amazonaws.com/datasets/image/MNIST/t10k-images-idx3-ubyte.gz\",\n",
" f\"https://sagemaker-example-files-prod-{args.region}.s3.amazonaws.com/datasets/image/MNIST/t10k-labels-idx1-ubyte.gz\",\n",
" ]\n",
" train_set = datasets.MNIST(\n",
" \"mnist_data\",\n",
" train=True,\n",
" transform=transforms.Compose(\n",
" [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]\n",
" ),\n",
" download=True,\n",
" )\n",
"\n",
" test_set = datasets.MNIST(\n",
" \"mnist_data\",\n",
" train=False,\n",
" transform=transforms.Compose(\n",
" [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]\n",
" ),\n",
" download=True,\n",
" )\n",
"\n",
" session = Session(boto3.session.Session(region_name=args.region))\n",
" with load_run(sagemaker_session=session) as run:\n",
" run.log_parameters(\n",
" {\"num_train_samples\": len(train_set.data), \"num_test_samples\": len(test_set.data)}\n",
" )\n",
" for f in os.listdir(train_set.raw_folder):\n",
" print(\"Logging\", train_set.raw_folder + \"/\" + f)\n",
" run.log_file(train_set.raw_folder + \"/\" + f, name=f, is_output=False)\n",
" model = train_model(\n",
" run,\n",
" train_set,\n",
" test_set,\n",
" data_dir=\"mnist_data\",\n",
" optimizer=args.optimizer,\n",
" epochs=args.epochs,\n",
" hidden_channels=args.hidden_channels,\n",
" )\n",
" save_model(model, args.model_dir, run)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0342b174-1d33-4b27-a7cd-28571f1a1507",
"metadata": {},
"source": [
"The cell above saves the `mnist.py` file to our script folder. The file implements the code necessary to train our PyTorch model in SageMaker, using the SageMaker PyTorch image. It uses the `load_run` function to automatically detect the experiment configuration and `run.log_parameter`, `run.log_parameters`, `run.log_file`, `run.log_metric` and `run.log_confusion_matrix` to track the model training"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "cc1913d9-fe5f-4cf1-aca6-c6f6c12bd21c",
"metadata": {},
"source": [
"## Train model with Run context\n",
"\n",
"Let's now train the model with passing the experiement run context to the training job"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "93f266e0-d73d-452c-a3ed-51b0fc48075d",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"%%time\n",
"# Start training job with experiment setting\n",
"with Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=Session()) as run:\n",
" est = PyTorch(\n",
" entry_point=\"./script/mnist.py\",\n",
" role=role,\n",
" model_dir=False,\n",
" framework_version=\"1.12\",\n",
" py_version=\"py38\",\n",
" instance_type=\"ml.c5.xlarge\",\n",
" instance_count=1,\n",
" hyperparameters={\"epochs\": 10, \"hidden_channels\": 5, \"optimizer\": \"adam\", \"region\": region},\n",
" keep_alive_period_in_seconds=3600,\n",
" )\n",
"\n",
" est.fit()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "259c4fca-5128-4213-8065-cb68bd50b973",
"metadata": {},
"source": [
"Checking the SageMaker Experiments UI, you can observe the Experiment run, populated with the metrics and parameters logged. We can also see the automatically generated outputs for the model data\n",
"\n",
"\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9a8d7c57",
"metadata": {},
"source": [
"## Notebook CI Test Results\n",
"\n",
"This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
}
],
"metadata": {
"forced_instance_type": "ml.t3.medium",
"forced_lcc_arn": "",
"instance_type": "ml.t3.medium",
"kernelspec": {
"display_name": "Python 3 (PyTorch 1.12 Python 3.8 CPU Optimized)",
"language": "python",
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/pytorch-1.12-cpu-py38"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}