{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1e63dc82",
   "metadata": {},
   "source": [
    "# 2D Semantic Segmentation on the Audi A2D2 Dataset\n",
    "In this notebook, we train a model on the 2D semantic segmentation annotations from the Audi A2D2 Dataset https://www.a2d2.audi/a2d2/en/dataset.html. The dataset can also be accessed from the the AWS Open Data Registry https://registry.opendata.aws/aev-a2d2/. \n",
    "\n",
    "We do the following: \n",
    "\n",
    " 1. We download the semantic segmentation dataset archive\n",
    " 1. We inspect and describe the data\n",
    " 1. We run local processing to produce a dataset manifest (list of all records), and split the data in training and validation sections. \n",
    " 1. We send the data to Amazon S3\n",
    " 1. We create a PyTorch script training a DeepLabV3 model, that we test locally for few iterations\n",
    " 1. We launch our training script on a remote, long-running machine with SageMaker Training API. \n",
    " 1. We show how to run bayesian parameter search to tune the metric of your choice (loss, accuracy, troughput...). To keep costs low, this is deactivated by default\n",
    " 1. We open a model checkpoint (collected in parallel to training by SageMaker Training) to check prediction quality\n",
    "\n",
    "The demo was created from a SageMaker ml.g4dn.16xlarge Notebook instance, with a Jupyter Kernel `conda_pytorch_latest_p37` (`torch 1.8.1+cu111`, `torchvision 0.9.1+cu111`). Feel free to use a different instance for the download and pre-processing step so that a GPU doesn't sit idle, and switch to a different instance type later. Note that the dataset download and extraction **do not run well on SageMaker Studio Notebooks**, whose storage is EFS based and struggles to handle the 80k+ files composing the dataset. Launching API calls (training job, tuning jobs) from Studio should run fine though.\n",
    "\n",
    "**IMPORTANT NOTES**\n",
    "\n",
    "* **This sample is written for single-GPU instances only. Using machines with more than 1 GPU or running the training code on more than 1 machines will not use all available hardware**\n",
    "\n",
    "* **Running this demo necessitates at least 400 Gb of local storage space**\n",
    "\n",
    "* **Running this demo on an ml.G4dn.16xlarge instance in region eu-west-1 takes approximately 50min of notebook uptime and approximately 12h of SageMaker Training job execution (excluding the bayesian parameter search, de-activated by default). This represents approximately 6 USD of notebook usage (if running on ml.g4dn.16xlarge) and 72 USD of training API**\n",
    "\n",
    "* **This demo uses non-AWS, open-source libraries including PyTorch, PIL, matplotlib, Torchvision. Use appropriate due diligence to verify if that use fits the software standards and compliance rules in place at your organization** \n",
    "\n",
    "* **This sample is provided for demonstration purposes. Make sure to conduct appropriate testing if derivating this code for your own use-cases. In general it is recommend to isolate development from production environments. Read more in the AWS Well Architected Framework https://aws.amazon.com/architecture/well-architected/**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01d8ba82",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import multiprocessing as mp\n",
    "import os\n",
    "import time\n",
    "import uuid\n",
    "\n",
    "import boto3\n",
    "from PIL import Image\n",
    "import sagemaker\n",
    "from sagemaker import Session\n",
    "from sagemaker.pytorch import PyTorch\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "\n",
    "sess = Session()\n",
    "bucket = '<enter an S3 bucket of your choice here>'  # SageMaker will use this bucket to store data, script and model checkpoints\n",
    "\n",
    "s3 = boto3.client('s3')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f334c36c",
   "metadata": {},
   "source": [
    "# 1. Dataset preparation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f23036d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data will be downloaded there, and new folders created. Feel free to customize\n",
    "work_dir = '/home/ec2-user/SageMaker'\n",
    "dataset_prefix = 'a2d2_images'\n",
    "data_dir = work_dir + '/' + dataset_prefix\n",
    "\n",
    "# locations used for local testing\n",
    "local_dataset_cache = work_dir + '/a2d2-tmp'\n",
    "local_checkpoint_location = work_dir + '/a2d2-checkpoints'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cf8eee0",
   "metadata": {},
   "source": [
    "### Download files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f07c802c",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# Download images. This took 12min on a ml.g4dn.16xlarge instance in eu-west-1 region\n",
    "! aws s3 cp s3://aev-autonomous-driving-dataset/camera_lidar_semantic.tar $work_dir"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e32d2e54",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# Download labels\n",
    "! aws s3 cp s3://aev-autonomous-driving-dataset/camera_lidar_semantic_instance.tar $work_dir"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7407ff25",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# Download the README\n",
    "! aws s3 cp s3://aev-autonomous-driving-dataset/README-SemSeg.txt $work_dir"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22b0d291",
   "metadata": {},
   "source": [
    "### Uncompress\n",
    "This takes about 20min"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "889444f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# We create a new folder dedicated to the A2D2 dataset\n",
    "print('Creating folder {}'.format(data_dir))\n",
    "\n",
    "try:\n",
    "    os.mkdir(data_dir)\n",
    "    \n",
    "except(FileExistsError):\n",
    "    print('Directory already exists')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73c325a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "! tar -xf {work_dir}/camera_lidar_semantic.tar -C $data_dir"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d1cfbec",
   "metadata": {},
   "source": [
    "### Analyse dataset structure\n",
    "We check how labels and images are organized. This was necessary to build an appropriate Dataset class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "319d359d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Frames are grouped in 23 sequences\n",
    "data_folder = 'camera_lidar_semantic'\n",
    "os.listdir(os.path.join(data_dir, data_folder))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64c81205",
   "metadata": {},
   "outputs": [],
   "source": [
    "# each sequence contain folders for labels, lidar and camera capture\n",
    "os.listdir(os.path.join(data_dir, data_folder, '20180925_112730'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6049a6bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# each of those folders contain one of multiple folders based on camera that captured the data\n",
    "os.listdir(os.path.join(data_dir, data_folder, '20180925_112730/camera'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e23aa29",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 10 first records of the front center camera capture of the 2018-09-25 11:27:30 sequence\n",
    "os.listdir(os.path.join(data_dir, data_folder, '20180925_112730/camera/cam_front_center'))[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30ad9788",
   "metadata": {},
   "outputs": [],
   "source": [
    "# view one image\n",
    "\n",
    "image_id = '000074771'\n",
    "\n",
    "\n",
    "with Image.open(os.path.join(data_dir, data_folder, '20180925_112730/camera/cam_front_center/'\n",
    "                + '20180925112730_camera_frontcenter_{}.png'.format(image_id))) as im:\n",
    "\n",
    "    im.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d1afc84",
   "metadata": {},
   "outputs": [],
   "source": [
    "# view associated label\n",
    "\n",
    "with Image.open(os.path.join(data_dir, data_folder,  '20180925_112730/label/cam_front_center/'\n",
    "                + '20180925112730_label_frontcenter_{}.png'.format(image_id))) as im:\n",
    "    \n",
    "    im.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eedb8950",
   "metadata": {},
   "source": [
    "### Anomalies to watch out of\n",
    "\n",
    "* On October 2021 record `a2d2_images/camera_lidar_semantic/20180925_135056/label/cam_side_left/20180925135056_label_sideleft_000026512.png` returns a 4-channel image"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24a9f52a",
   "metadata": {},
   "source": [
    "### Pre-process\n",
    "To simplify the ML process, we build a flat JSON manifest mapping, for a given record ID, the path to image and to label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3478a283",
   "metadata": {},
   "outputs": [],
   "source": [
    "root = os.path.join(data_dir, data_folder)  # where we'll read images from\n",
    "relative = os.path.join(dataset_prefix, data_folder)  # the image key prefix we'll use to write images in S3\n",
    "\n",
    "# we sort sequences so that train-test split by sequence index is deterministic\n",
    "sequences = [s for s in os.listdir(root) if s.startswith('2018')]\n",
    "sequences.sort()\n",
    "print(sequences)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2fa545fb",
   "metadata": {},
   "outputs": [],
   "source": [
    "manifest = {}\n",
    "\n",
    "for s in sequences:\n",
    "    cameras = os.listdir(root + '/{}/camera'.format(s))\n",
    "    for c in cameras:\n",
    "       \n",
    "        images = [f for f in os.listdir(root + '/{}/camera/{}'.format(s, c))\n",
    "                  if f.endswith('.png')]\n",
    "        \n",
    "        for i in images:\n",
    "            label_name = i.replace('camera', 'label')\n",
    "            im_id = i[:i.find('_')] + '_' + i[i.rfind('.')-9:i.rfind('.')]\n",
    "            image_path_local = root + '/{}/camera/{}/{}'.format(s, c, i)\n",
    "            label_path_local = root + '/{}/label/{}/{}'.format(s, c, label_name)\n",
    "            image_path_manifest = relative + '/{}/camera/{}/{}'.format(s, c, i)\n",
    "            label_path_manifest = relative + '/{}/label/{}/{}'.format(s, c, label_name)\n",
    "            \n",
    "            # create record only if both image file and label file exist:\n",
    "            if os.path.isfile(image_path_local) and os.path.isfile(label_path_local):\n",
    "                manifest[im_id] = {}\n",
    "                manifest[im_id]['sequence_id'] = s\n",
    "                manifest[im_id]['image_name'] = i\n",
    "                manifest[im_id]['label_name'] = label_name\n",
    "                # remove the work-dir from the path so that the manifest stays small and generic\n",
    "                manifest[im_id]['image_path'] = image_path_manifest\n",
    "                manifest[im_id]['label_path'] = label_path_manifest\n",
    "            else:\n",
    "                print('issue with image {} : -------'.format(image_path_local))\n",
    "                # check if both image and label exist\n",
    "                print('image file {} exists: {}'.format(image_path_local, os.path.isfile(image_path_local)))\n",
    "                print('label file {} exists: {}'.format(image_path_local, os.path.isfile(image_path_local)))                \n",
    "            \n",
    "print(\"Created a dataset manifest with {} records\".format(len(manifest)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f6316bf",
   "metadata": {},
   "source": [
    "We then send images to S3 with a multi-processing call. This should take 10-15min on a large G4 instance and results in 139 Gb on S3. You can try to go faster using more workers in the `multiprocessing.Pool(workers)`, but be aware that too much concurrency may cause instability and crashes in your kernel and instance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2caf42e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def send_images_to_s3(image_id):\n",
    "    \n",
    "    s3.upload_file(Filename=work_dir + '/' + manifest[image_id]['image_path'],\n",
    "                   Bucket=bucket,\n",
    "                   Key=manifest[image_id]['image_path'])\n",
    "    \n",
    "    s3.upload_file(Filename=work_dir + '/' + manifest[image_id]['label_path'],\n",
    "                   Bucket=bucket,\n",
    "                   Key=manifest[image_id]['label_path'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f225aca",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "with mp.Pool(mp.cpu_count()) as pool:\n",
    "    \n",
    "    pool.map(send_images_to_s3, manifest.keys())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2fa39363",
   "metadata": {},
   "outputs": [],
   "source": [
    "# we also need to send class_list to S3\n",
    "s3.upload_file(\n",
    "    Filename=root + '/' + 'class_list.json',\n",
    "    Bucket=bucket,\n",
    "    Key=dataset_prefix + '/metadata/class_list.json')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba7a85b3",
   "metadata": {},
   "source": [
    "We split the dataset in a training and validation manifest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "482430cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "split = 0.9\n",
    "train_sequences = sequences[:int(split*len(sequences))]\n",
    "val_sequences = sequences[int(split*len(sequences)):]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8195041c",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_manifest = {k:manifest[k] for k in manifest.keys() if manifest[k]['sequence_id'] in train_sequences}\n",
    "val_manifest = {k:manifest[k] for k in manifest.keys() if manifest[k]['sequence_id'] in val_sequences}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6709b0c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"training set contains {} records\".format(len(train_manifest)))\n",
    "print(\"validation set contains {} records\".format(len(val_manifest)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d434134",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(work_dir + \"/train_manifest.json\", \"w\") as file:\n",
    "    json.dump(train_manifest, file)\n",
    "    \n",
    "with open(work_dir + \"/val_manifest.json\", \"w\") as file:\n",
    "    json.dump(val_manifest, file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0a47cea",
   "metadata": {},
   "outputs": [],
   "source": [
    "for file in ['train_manifest.json', 'val_manifest.json']:\n",
    "\n",
    "    s3.upload_file(\n",
    "        Filename=work_dir + '/' + file,\n",
    "        Bucket=bucket,\n",
    "        Key=dataset_prefix + '/metadata/{}'.format(file))\n",
    "    \n",
    "train_path = 's3://{}/'.format(bucket) + dataset_prefix + '/metadata/'\n",
    "print('Training manifests sent to {}'.format(train_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b205e26b",
   "metadata": {},
   "source": [
    "# 2. Single-GPU training\n",
    "We create the training script as single Python file. \n",
    "To make training code scalable and portable, we create a custom PyTorch Dataset that reads images and segmentation masks directly from S3, and save in local cache in case of later re-use (eg if training with multiple epochs). That was we have a data pipeline that does not need to wait for all dataset to be downloaded locally, but that will read at low-latency after the first epoch.\n",
    "\n",
    "**Note** this DL training code is **far from state-of-the-art**. The goal of this sample is not to reach a good accuracy, but rather to show how to scale custom training jobs in Amazon SageMaker. \n",
    "\n",
    " * **Better accuracy** can likely be reached using data augmentation, learning rate scheduling, a better backbone, and adding the auxiliary DeepLabV3 loss. And why not a totally different segmentation model instead of DeepLabV3\n",
    " \n",
    " * **Better throughput** can likely be reached using a sequential-access dataset, that reads group of records, or the SageMaker Fast File Mode, that streams files upon read request. Also, although I'm configuring it below, I am not sure if float16 precision compute occur and if NVIDIA TensorCores are actually used. This would be an important step to make full use of the computational power of modern NVIDIA cards. Converting labels to grayscale should also help making the dataloading lighter. Offloading data loading to the GPU, for example using NVIDIA DALI, is another axis to explore to boost throughput."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "594072ac",
   "metadata": {},
   "source": [
    "## Run locally"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9065c68a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a folder to cache dataset as it is downloaded by the dataset class\n",
    "print('Local dataset cache created at {}'.format(local_dataset_cache))\n",
    "print('Local checkpoints will be stored at {}'.format(local_checkpoint_location))\n",
    "\n",
    "try:\n",
    "    os.mkdir(local_dataset_cache)\n",
    "    \n",
    "except(FileExistsError):\n",
    "    print('{} already exists'.format(local_dataset_cache))\n",
    "\n",
    "try:\n",
    "    os.mkdir(local_checkpoint_location)\n",
    "    \n",
    "except(FileExistsError):\n",
    "    print('{} already exists'.format(local_checkpoint_location))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcbefe47",
   "metadata": {},
   "source": [
    "### Single-device code\n",
    "can be run in a Python process"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72d83ec7",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# test on 20 iterations. \n",
    "# This takes a few minutes. You can see instance activity live using htop or nividia-smi in instance terminal\n",
    "! python a2d2_code/train.py --dataset $work_dir \\\n",
    "                            --cache $local_dataset_cache \\\n",
    "                            --height 604 \\\n",
    "                            --width 960 \\\n",
    "                            --checkpoint-dir $local_checkpoint_location \\\n",
    "                            --batch 12 \\\n",
    "                            --network deeplabv3_mobilenet_v3_large \\\n",
    "                            --workers 24 \\\n",
    "                            --log-freq 20 \\\n",
    "                            --prefetch 2 \\\n",
    "                            --bucket $bucket \\\n",
    "                            --eval-size 10 \\\n",
    "                            --iterations 20 \\\n",
    "                            --class-list a2d2_images/camera_lidar_semantic/class_list.json"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f352b8c8",
   "metadata": {},
   "source": [
    "## Launch in SageMaker Training\n",
    "We use the SageMaker Python SDK to orchestrate SageMaker Training clusters. Note that if you don't want to learn yet another SDK, you can also do exactly the same thing with existing AWS SDKs, for example the AWS CLI (https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-training-job.html) and boto3 (https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6cd13d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "config = {\n",
    "    'bucket': bucket,\n",
    "    'cache': '/opt/ml/input/data/dataset',\n",
    "    'height': 604,\n",
    "    'width': 960,\n",
    "    'epochs': 10,\n",
    "    'batch': 12,\n",
    "    'prefetch': 1,\n",
    "    'workers': 40,\n",
    "    'eval-size': 36,\n",
    "    'lr': 0.183,\n",
    "    'momentum': 0.928,\n",
    "    'lr_warmup_ratio':0.1,\n",
    "    'lr_decay_per_epoch': 0.3,\n",
    "    'log-freq': 500}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9be69715",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Training time of this job will be approximately 12h\n",
    "\n",
    "token = str(uuid.uuid4())[:10]  # we create a unique token to avoid checkpoint collisions in S3\n",
    "\n",
    "job = PyTorch(\n",
    "    entry_point='train.py',\n",
    "    source_dir='a2d2_code',\n",
    "    role=get_execution_role(),\n",
    "    framework_version='1.8.1',\n",
    "    instance_count=1,\n",
    "    instance_type='ml.g4dn.16xlarge',\n",
    "    base_job_name='A2D2-single-GPU-seg-training',\n",
    "    py_version='py36',\n",
    "    hyperparameters=config,\n",
    "    checkpoint_s3_uri='s3://{}/{}/checkpoints'.format(bucket, token),  # S3 destination of /opt/ml/checkpoints files\n",
    "    output_path='s3://{}/{}'.format(bucket, token),\n",
    "    code_location='s3://{}/{}/code'.format(bucket, token), # source_dir code will be staged in S3 there\n",
    "    environment={\"SMDEBUG_LOG_LEVEL\":\"off\"},  # reduce verbosity of Debugger\n",
    "    debugger_hook_config=False,  # deactivate debugger to avoid warnings in model artifact\n",
    "    disable_profiler=True,  # keep running resources to a minimum to avoid permission errors\n",
    "    metric_definitions=[\n",
    "        {\"Name\": \"Train_loss\", \"Regex\": \"Training_loss: ([0-9.]+).*$\"},\n",
    "        {\"Name\": \"Learning_rate\", \"Regex\": \"learning rate: ([0-9.]+).*$\"},        \n",
    "        {\"Name\": \"Val_loss\", \"Regex\": \"Val_loss: ([0-9.]+).*$\"},        \n",
    "        {\"Name\": \"Throughput\", \"Regex\": \"Throughput: ([0-9.]+).*$\"}\n",
    "    ],\n",
    "    tags=[{'Key': 'Project', 'Value': 'A2D2_segmentation'}])  # tag the job for experiment tracking"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d394e7db",
   "metadata": {},
   "source": [
    "SageMaker-managed I/O uploads only dataset metadata (class_list and manifest). The actual records and labels are fetched upon request directly from S3 or local cache via our custom `Dataset`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "09553b84",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# we do an asynchronous fit, so the job doesn't keep the client waiting. \n",
    "# closing and shutting down your notebook will not stop this job.\n",
    "# if you want to stop this SageMaker Training job, use an AWS SDK or the console\n",
    "job.fit({'dataset': train_path}, wait=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8814a39e",
   "metadata": {},
   "source": [
    "## Custom Metric Tuning"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "828e3d9f",
   "metadata": {},
   "source": [
    "SageMaker Automated Model Tuning is a serverless managed, use-case agnostic parameter search service. With SageMaker AMT (sometimes named HPO - Hyperparameter Optimization) you can tune any parameter declared in the container hyperparameter dictionary (continuous, integer or categorical) and you can tune for any metric (minimize or maximize) that you can regexp from your container or script logs. SageMaker AMT is not limited to model tuning. You can be creative with it, and for example tune jobs to minimize the training time or training cost. See https://aws.amazon.com/blogs/machine-learning/aerobotics-improves-training-speed-by-24-times-per-sample-with-amazon-sagemaker-and-tensorflow/ for a nice example. \n",
    "\n",
    "More info:\n",
    "\n",
    " * https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html\n",
    " * *Amazon SageMaker Automatic Model Tuning: Scalable Gradient-Free Optimization*, Perone et al. (https://arxiv.org/abs/2012.08489)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98e3d39b",
   "metadata": {},
   "outputs": [],
   "source": [
    "Tuning = False  # set to true if you want to test the tuning below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "89f9bc3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"tuning cell set to {}\".format(Tuning))\n",
    "\n",
    "if Tuning:\n",
    "\n",
    "    # we use the SageMaker Tuner\n",
    "    from sagemaker.tuner import IntegerParameter, ContinuousParameter\n",
    "    \n",
    "    \n",
    "    tuning_config = {\n",
    "        'bucket': bucket,\n",
    "        'cache': '/opt/ml/input/data/dataset',\n",
    "        'height': 604,\n",
    "        'width': 960,\n",
    "        'epochs': 5,\n",
    "        'prefetch': 1,\n",
    "        'workers': 40,\n",
    "        'eval-size': 36,\n",
    "        'log-freq': 500}\n",
    "    \n",
    "    \n",
    "    tuning_config = PyTorch(\n",
    "        entry_point='train.py',\n",
    "        source_dir='a2d2_code',\n",
    "        role=get_execution_role(),\n",
    "        framework_version='1.8.1',\n",
    "        instance_count=1,\n",
    "        instance_type='ml.g4dn.16xlarge',\n",
    "        py_version='py36',\n",
    "        max_run=28800,  # cap the max runtime at 8h per job\n",
    "        hyperparameters=tuning_config,\n",
    "        checkpoint_s3_uri='s3://{}/checkpoints'.format(bucket),  # S3 destination of /opt/ml/checkpoints files\n",
    "        output_path='s3://{}'.format(bucket),\n",
    "        code_location='s3://{}/code'.format(bucket), # source_dir code will be staged in S3 there        \n",
    "        environment={\"SMDEBUG_LOG_LEVEL\":\"off\"},  # reduce verbosity of Debugger\n",
    "        debugger_hook_config=False,  # deactivate debugger to avoid warnings in model artifact\n",
    "        disable_profiler=True,  # keep running resources to a minimum to avoid permission errors\n",
    "        metric_definitions=[     \n",
    "            {\"Name\": \"Val_loss\", \"Regex\": \"Val_loss: ([0-9.]+).*$\"},        \n",
    "        ],\n",
    "        tags=[{'Key': 'Project', 'Value': 'A2D2_segmentation'}])\n",
    "    \n",
    "    \n",
    "    # Define exploration boundaries\n",
    "    hyperparameter_ranges = {\n",
    "        'lr': ContinuousParameter(0.001, 0.01),\n",
    "        'momentum': ContinuousParameter(0.8, 0.99),\n",
    "        'lr_warmup_ratio': ContinuousParameter(1, 10),\n",
    "        'lr_decay_per_epoch': ContinuousParameter(0.1, 0.8),\n",
    "        'batch': IntegerParameter(6, 12)\n",
    "    }\n",
    "    \n",
    "    \n",
    "    # create Optimizer\n",
    "    # you can tune for anything you can regexp from your logs\n",
    "    # in this sample we minimize the validation loss\n",
    "    Optimizer = sagemaker.tuner.HyperparameterTuner(\n",
    "        estimator=tuning_config,\n",
    "        hyperparameter_ranges=hyperparameter_ranges,\n",
    "        base_tuning_job_name='Loss-tuner',\n",
    "        objective_type='Minimize',\n",
    "        objective_metric_name='Val_loss',\n",
    "        strategy='Bayesian',\n",
    "        early_stopping_type='Auto',\n",
    "        metric_definitions=[\n",
    "            {\"Name\": \"Val_loss\", \"Regex\": \"Val_loss: ([0-9.]+).*$\"}\n",
    "        ],        \n",
    "        max_jobs=40,\n",
    "        max_parallel_jobs=2)\n",
    "    \n",
    "    \n",
    "    Optimizer.fit({'dataset': train_path}, wait=False)\n",
    "    \n",
    "    print(\"Tuning job launched\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c988d3f9",
   "metadata": {},
   "source": [
    "# 3. Predict with trained model\n",
    "to test the trained model, we run inference on couple samples from the validation set. You can run this section on its own once you have a trained model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b07d91de",
   "metadata": {},
   "outputs": [],
   "source": [
    "from matplotlib import pyplot as plt\n",
    "import torch\n",
    "from torchvision import models\n",
    "from torchvision.io import read_image\n",
    "from torchvision.datasets.vision import VisionDataset\n",
    "from torchvision.models.segmentation.deeplabv3 import DeepLabHead\n",
    "from torchvision.transforms import Resize\n",
    "from torchvision.transforms.functional import InterpolationMode"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1f7ce3e",
   "metadata": {},
   "source": [
    "### Bring a checkpoint from S3\n",
    "In the cell below we download a checkpoint produced by a training job, which could come either from the above-launched training job or from the training job launched from the optional tuning step.\n",
    "\n",
    "To check available checkpoints for a given training job, you can inspect the S3 ARN returned at `CheckpointConfig` by `boto3` `describe_training_job`, or you can also check in the training job detail page the S3 Output Path URL in the \"Checkpoint configuration\" section"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd592105",
   "metadata": {},
   "outputs": [],
   "source": [
    "# you need to wait around 15min until you have the first checkpoint showing up in Amazon S3\n",
    "! aws s3 cp <s3 URI of a checkpoint> $work_dir"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98968e3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = torch.load(os.path.join(work_dir, 'final_model.pth'))  # replace with your model name if different"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6c14396d",
   "metadata": {},
   "outputs": [],
   "source": [
    "model.eval()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9b4182e",
   "metadata": {},
   "source": [
    "### instantiate the dataset\n",
    "This is necessary for inference pre-processing (applying same transforms to input and label as at training)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d51b07ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "height = 604\n",
    "width = 960\n",
    "\n",
    "from a2d2_code.a2d2_utils import A2D2_S3_dataset\n",
    "\n",
    "\n",
    "image_transform = Resize(\n",
    "    (height, width),\n",
    "    interpolation=InterpolationMode.BILINEAR)\n",
    "\n",
    "target_transform = Resize(\n",
    "    (height, width),\n",
    "    interpolation=InterpolationMode.NEAREST)\n",
    "\n",
    "train_data = A2D2_S3_dataset(\n",
    "    manifest_file=work_dir + '/train_manifest.json',\n",
    "    class_list=data_dir + '/camera_lidar_semantic/class_list.json',\n",
    "    transform=image_transform,\n",
    "    target_transform=target_transform,\n",
    "    cache='/home/ec2-user/',\n",
    "    height=height,\n",
    "    width=width,\n",
    "    s3_bucket=bucket)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f140dd57",
   "metadata": {},
   "source": [
    "We measure pixel accuracy on couple pictures. IoU is also relevant for segmentation, we leave that for a later iteration :)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90f94caf",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.rcParams[\"figure.figsize\"] = [15, 7]\n",
    "\n",
    "\n",
    "def pixel_acc(T1, T2):\n",
    "    return (T1 == T2).sum()/(T1.size()[0]*T1.size()[1])\n",
    "\n",
    "\n",
    "# take first 10 pictures from the val_manifest\n",
    "\n",
    "with open(work_dir + '/val_manifest.json') as file:\n",
    "    val_manifest = json.load(file)\n",
    "\n",
    "pic_ids = list(val_manifest.keys())[:10]\n",
    "    \n",
    "    \n",
    "for pic_id in pic_ids:\n",
    "    \n",
    "    image_path = val_manifest[pic_id]['image_path']\n",
    "    label_path = val_manifest[pic_id]['label_path']\n",
    "    pic = image_transform(read_image(os.path.join(work_dir, image_path)))\n",
    "    label = target_transform(read_image(os.path.join(work_dir, label_path)))\n",
    "    \n",
    "    mask = torch.zeros(height, width)\n",
    "    for rgb, cid in train_data.rgb2ids.items():\n",
    "        color_mask = label == torch.Tensor(rgb).reshape([3,1,1]) \n",
    "        seg_mask = color_mask.sum(dim=0) == 3\n",
    "        mask[seg_mask] = cid          \n",
    "    \n",
    "    mask = mask.type(torch.int64) \n",
    "    \n",
    "    pred = model(torch.div(pic, 255).unsqueeze(0).to(\"cuda:0\"))[\"out\"]\n",
    "    flat_pred = torch.argmax(pred, dim=1)[0]\n",
    "    \n",
    "    mask_np = mask.cpu().numpy()\n",
    "    flat_pred_np = flat_pred.cpu().numpy()\n",
    "    \n",
    "    fig, (ax1, ax2) = plt.subplots(1, 2)\n",
    "    fig.suptitle(pic_id)\n",
    "    ax1.imshow(mask_np)\n",
    "    ax2.imshow(flat_pred_np)\n",
    "    \n",
    "    print(\"Image {}: PIXEL ACCURACY: {}\".format(pic_id, pixel_acc(flat_pred.cuda(), mask.cuda())))\n",
    "    \n",
    "    print(\"*\"*20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e7b65c1",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "conda_pytorch_latest_p37",
   "language": "python",
   "name": "conda_pytorch_latest_p37"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}