{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic SageMaker Processing Script\n",
    "\n",
    "This notebook corresponds to the section \"Preprocessing Data With The Built-In PyTorch Container\" in [this](https://aws.amazon.com/blogs/aws/amazon-sagemaker-processing-fully-managed-data-processing-and-model-evaluation/) blog post. ** BLOGPOST URL TO BE UPDATED **  \n",
    "\n",
    "It shows a very basic example of using SageMaker Processing to create train, test and validation datasets. SageMaker Processing is used to create these datasets, which then are written back to S3.\n",
    "\n",
    "In a nutshell, we will create a PyTorchProcessor object, passing the PyTorch version we want to use, as well as our managed infrastructure requirements."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "In this notebook, we will use a dataset manifest to download animal images from the [COCO dataset](https://cocodataset.org/) for all animal classes. \n",
    "In order to simulate coming to SageMaker with your own dataset, we will use Sagemaker Processing to structure and pre-process the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The dataset \n",
    "The COCO dataset contains images from Flickr that represent a real-world dataset which isn’t formatted or resized specifically for deep learning. This makes it a good dataset for this guide because we want it to be as comprehensive as possible. \n",
    "\n",
    "### Downloading annotations \n",
    "The dataset annotation file contains info on each image in the dataset such as the class, superclass, file name and url to download the file. Just the annotations for the COCO dataset are about 242MB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import urllib\n",
    "anno_url = \"http://images.cocodataset.org/annotations/annotations_trainval2017.zip\"\n",
    "urllib.request.urlretrieve(anno_url, \"coco-annotations.zip\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparing the data (for training your model)\n",
    "This section is a description of the work being done in the separate script that will be executed in a container by Sagemaker Processing.\n",
    "\n",
    "### Sample the data\n",
    "\n",
    "First, we will limit the scope of the dataset for the sake of this example by only using the images of animals in the COCO dataset.\n",
    "\n",
    "For the train and validation sets, the data we need for the image labels and the filepaths are under different headings in the annotations. We have to extract each out and combine them into a single annotation in subsequent steps.\n",
    "\n",
    "In order to make working with the data easier, we’ll select 250 images from each class at random. To make sure you get the same set of cell images for each run of this we’ll also set Numpy’s random seed to 0. This is a small fraction of the dataset, but it demonstrates how using transfer learning can give you good results without needing very large datasets.\n",
    "\n",
    "### Proper folder structure\n",
    "\n",
    "Although most tools can accommodate data in any file structure with enough tinkering, it makes most sense to use the sensible defaults that frameworks like MXNet, TensorFlow and PyTorch all share to make data ingestion as smooth as possible. By default, most tools will look for image data in the file structure depicted below:"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "+-- train\n",
    "|   +-- class_A\n",
    "|       +-- filename.jpg\n",
    "|       +-- filename.jpg\n",
    "|       +-- filename.jpg\n",
    "|   +-- class_B\n",
    "|       +-- filename.jpg\n",
    "|       +-- filename.jpg\n",
    "|       +-- filename.jpg\n",
    "|\n",
    "+-- val\n",
    "|   +-- class_A\n",
    "|       +-- filename.jpg\n",
    "|       +-- filename.jpg\n",
    "|       +-- filename.jpg\n",
    "|   +-- class_B\n",
    "|       +-- filename.jpg\n",
    "|       +-- filename.jpg\n",
    "|       +-- filename.jpg\n",
    "|\n",
    "+-- test\n",
    "|   +-- class_A\n",
    "|       +-- filename.jpg\n",
    "|       +-- filename.jpg\n",
    "|       +-- filename.jpg\n",
    "|   +-- class_B\n",
    "|       +-- filename.jpg\n",
    "|       +-- filename.jpg\n",
    "|       +-- filename.jpg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You will notice that the COCO dataset does not come structured like above so we must use the annotation data to help restructure the folders of the COCO dataset so they match the pattern above. Once the new directory structures are created you can use your desired framework’s data loading tool to gracefully load and define transformation for your image data. Many datasets may already be in this structure in which case you can skip this guide.\n",
    "\n",
    "### Load annotation category labels\n",
    "Here we leverage the ___sample_annos___ and ___category_labels___ generated above.\n",
    "\n",
    "### Make train, validation and test splits \n",
    "We should divide our data into train, validation and test splits. A typical split ratio is 80/10/10. Our image classification algorithm will train on the first 80% (training) and evaluate its performance at each epoch with the next 10% (validation) and we’ll give our model’s final accuracy results using the last 10% (test). It’s important that before we split the data we make sure to shuffle it randomly so that class distribution among splits is roughly proportional.\n",
    "\n",
    "### Make new folder structure and copy image files \n",
    "This new folder structure can then be read by data loaders for SageMaker’s built-in algorithms, TensorFlow or PyTorch for easy loading of the image data into your framework of choice.\n",
    "\n",
    "## Define the Resize and Augmentation Transformations\n",
    "\n",
    "### Resizing the images\n",
    "Before going to the GPU for training, all image data must have the same dimensions for length, width and channel. Typically, algorithms use a square format so the length and width are the same and many pre-made datasets areadly have the images nicely cropped into squares. However, most real-world datasets will begin with images in many different dimensions and ratios. In order to prep our dataset for training we will need to resize and crop the images if they aren’t already square.\n",
    "\n",
    "This transformation is deceptivley simple. If we want to keep the images from looking squished or stretched, we need to crop it to a square and we want to make sure the important object in the image doesn’t get cropped out. Unfortunately, there is no easy way to make sure each crop is optimal so we typically choose a center crop which works well most of the time.\n",
    "\n",
    "### Augmentation\n",
    "An easy way to improve training is to randomly augment the images to help our training algorithm generalize better. \n",
    "\n",
    "There are many augmentations to choose from, but keep in mind that the more we add to our augment function, the more processing will be required before we can send the image to the GPU for training. Also, it’s important to note that we don’t need to augment the validation data because we want to generate a prediction on the image as it normally would be presented.\n",
    "\n",
    "## Create the PyTorch datasets and dataloaders\n",
    "\n",
    "### Datasets \n",
    "Datasets in PyTorch keep track of all the data in your dataset–where to find them (their path), what class they belong to and what transformations they get. In this case, we’ll use PyTorch’s handy ImageFolder to easily generate the dataset from the directory structure created in the previous guide.in PyTorch keep track of all the data in your dataset–where to find them (their path), what class they belong to and what transformations they get. In this case, we’ll use PyTorch’s handy ImageFolder to easily generate the dataset from the directory structure created in the previous guide.\n",
    "\n",
    "### Dataloaders\n",
    "Dataloaders structure how the images get sent to the CPU and GPU for training. They include important hyper-parameters such as: \n",
    "* batch_size: this tells the data loader how many images to send to the training algorithm at once for back propogagation. It will therefore also control the number to gradient updates which occur in one epoch for optimizers like SGD. \n",
    "* shuffle: this will randomize the orders of your training data \n",
    "* num_workers: this defines how many parallel processes you want to load and transform images before being sent to the GPU for training. \n",
    "Adding more workers will therefore speed up training. However, too many workers will slow training down due to the overhead of trying manage all the workers. Also, each worker will consume a considerable amount of RAM (depending on batch_size) and you cannot have more workers than cpu cores available on the EC2 instance used for training.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create the script you'd like to use with Processing with your logic\n",
    "This script is executed by Amazon SageMaker. In the `main`, it does the core of the operations: it reads and parses arguments passed as parameters, unpacks the model file, then loads the model, preprocess, predict, postprocess the data. Remember to write data locally in the final step so that SageMaker can copy them to S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pygmentize scripts/preprocessing.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create the Sagemaker Processor "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "from sagemaker.pytorch.processing import PyTorchProcessor\n",
    "\n",
    "region = boto3.session.Session().region_name\n",
    "\n",
    "role = get_execution_role()\n",
    "pytorch_processor = PyTorchProcessor(\n",
    "    framework_version=\"1.8\", role=role, instance_type=\"ml.m5.xlarge\", instance_count=1\n",
    "#    framework_version=\"1.8\", role=role, instance_type=\"ml.g4dn.xlarge\", instance_count=1 \n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Now Execute the Processing Job\n",
    "Before executing the script, Sagemaker Processing automatically uploads the input data to S3 and creates a local copy in the directory specified from where the script can access the data.\n",
    "At the end of the script execution, Sagemaker Processing automatically uploads the output data to S3 on your behalf."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.processing import ProcessingInput, ProcessingOutput\n",
    "\n",
    "pytorch_processor.run(\n",
    "    code=\"preprocessing.py\",\n",
    "    source_dir=\"scripts\",\n",
    "    arguments = ['Debug', 'Not used'],\n",
    "    inputs=[ProcessingInput(source=\"coco-annotations.zip\", destination=\"/opt/ml/processing/input\")],\n",
    "    outputs=[\n",
    "        ProcessingOutput(source=\"/opt/ml/processing/tmp/data_structured\", output_name=\"data_structured\"),\n",
    "        ProcessingOutput(source=\"/opt/ml/processing/output/train\", output_name=\"train\"),\n",
    "        ProcessingOutput(source=\"/opt/ml/processing/output/val\", output_name=\"validation\"),\n",
    "        ProcessingOutput(source=\"/opt/ml/processing/output/test\", output_name=\"test\"),\n",
    "        ProcessingOutput(source=\"/opt/ml/processing/logs\", output_name=\"logs\"),\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now check the results of our processing job, and list the outputs from S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_path = pytorch_processor.latest_job.outputs[1].destination\n",
    "\n",
    "!aws s3 ls --recursive $output_path"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:470317259841:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}