{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d824ab39-fbc9-4076-9442-edca7af5bf3c",
   "metadata": {},
   "source": [
    "# Label your dataset with Amazon SageMaker Ground Truth and SM processing jobs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e47ba6f4-e051-4b10-8cf8-29d55969cc3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import json\n",
    "import numpy\n",
    "import os\n",
    "import sagemaker\n",
    "\n",
    "from sagemaker.sklearn.processing import SKLearnProcessor\n",
    "from sagemaker.processing import ProcessingOutput\n",
    "\n",
    "sm_client = boto3.client('sagemaker')\n",
    "s3_resource = boto3.resource('s3')\n",
    "sm_session = sagemaker.Session()\n",
    "role = sagemaker.get_execution_role()\n",
    "region = boto3.Session().region_name"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fed07009-225d-4291-92ab-4fe6352bb7cd",
   "metadata": {},
   "source": [
    "### Create a labeling job in Amazon SageMaker Ground Truth"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2878ad12-24f2-4c08-bbb9-812713eba973",
   "metadata": {},
   "source": [
    "To create your custom model on YOLOv5 you are going to need to label your custom dataset. To label an object detection dataset you may use Amazon SageMaker Ground Truth."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "132888d6-089f-4703-9a0e-4d7bec664e63",
   "metadata": {},
   "source": [
    "| ⚠️ WARNING: If you have already labeled an object detection dataset with Amazon SageMaker Ground Truth you can skip to the \"**Get Job Details**\" |\n",
    "| -- |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0334df93-d28e-47e7-8695-bf86354a4008",
   "metadata": {},
   "source": [
    "#### Create a Labeling Workforce\n",
    "\n",
    "Follow the steps in the SageMaker Ground Truth documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-create-private-console.html#create-workforce-labeling-job\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "873f1915-19a6-49ba-be92-a022c84738dc",
   "metadata": {},
   "source": [
    "#### Create your bounding box labeling job\n",
    "\n",
    "Follow the steps in the SageMaker Ground Truth documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-create-labeling-job-console.html\n",
    "\n",
    "If using the AWS Console, you should create a labeling job with the following options:\n",
    "\n",
    "1. Job name: Set any unique name for the job name, for example \"Object-Detection-Example\".\n",
    "2. Leave the \"I want to specify a label attribute...\" option un-checked.\n",
    "3. Input data setup: Pick \"Automated data setup\".\n",
    "4. Input dataset location: Copy and paste the location of the single folder with your images in S3. Example: \"s3://mybucket/raw_images\".\n",
    "5. Output dataset location: Choose \"Same location as input dataset\".\n",
    "6. Data type: Choose \"Image\".\n",
    "7. IAM Role: Create a new role and give access to the S3 bucket where your images are located, or any S3 bucket.\n",
    "8. Now hit \"Complete data setup\" and wait for it to be ready.\n",
    "9. Task category: Choose \"Image\" and select \"Bounding box\", then hit \"Next\".\n",
    "10. Worker types: Select \"Private\" and choose your team for the \"Private teams\" option.\n",
    "11. For the Bounding box labeling tool: Enter a description and instructions, and for the \"Labels\" section add the relevant labels for your job. \n",
    "12. Finally choose \"Create\"."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4efa20cb-9e35-4a6b-be8d-8c0d3a910ad6",
   "metadata": {},
   "source": [
    "### Get Job Details and Labels"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a174f40-80f4-46ac-b375-ca34324e9c1a",
   "metadata": {},
   "source": [
    "Once you have finished labeling your images, let's retrieve the information we need to create our processing job which will create the dataset in the format YOLOv5 expects"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ce05191-bec6-473f-86d4-0824ca4925e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "groundtruth_job_name = \"Object-Detection-Example\" ### <-- Replace with the name you used for your labeling job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a489d7f7-59da-456d-9a4e-d092527f5a9e",
   "metadata": {},
   "outputs": [],
   "source": [
    "response = sm_client.describe_labeling_job(\n",
    "    LabelingJobName=groundtruth_job_name\n",
    ")\n",
    "\n",
    "labelingJobStatus = response[\"LabelingJobStatus\"]\n",
    "labelsListUri = response[\"LabelCategoryConfigS3Uri\"]\n",
    "\n",
    "print(\"Job Status: \",labelingJobStatus)\n",
    "print(\"Labels Uri: \", labelsListUri)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a42cd1a5-1e75-43d0-801d-993746eba937",
   "metadata": {},
   "source": [
    "### Get labels"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "158d4393-0101-4b8c-b088-e32ee4fe5621",
   "metadata": {},
   "source": [
    "We need to retrieve the labels from the training job which are located in S3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e28cd619-5163-456d-b1e7-16320602f6f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "def split_s3_path(s3_path):\n",
    "    path_parts=s3_path.replace(\"s3://\",\"\").split(\"/\")\n",
    "    bucket=path_parts.pop(0)\n",
    "    key=\"/\".join(path_parts)\n",
    "    return bucket, key\n",
    "\n",
    "def get_labels_list(labels_uri):\n",
    "    labels = []\n",
    "    bucket, key = split_s3_path(labels_uri)\n",
    "    s3_resource.meta.client.download_file(bucket, key, 'labels.json')\n",
    "    with open('labels.json') as f:\n",
    "        data = json.load(f)\n",
    "    for label in data[\"labels\"]:\n",
    "        labels.append(label[\"label\"])\n",
    "    return labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99536d91-042a-4f55-846c-b1c47060552e",
   "metadata": {},
   "outputs": [],
   "source": [
    "labels = get_labels_list(labelsListUri)\n",
    "print(\"Labels: \",labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "001766d4-67de-4c91-a1b8-249dd033814e",
   "metadata": {},
   "source": [
    "### Create a SageMaker Processing Job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf2d5fc5-e088-43bd-9296-77216c13078a",
   "metadata": {},
   "outputs": [],
   "source": [
    "sklearn_processor = SKLearnProcessor(\n",
    "    framework_version=\"1.0-1\",\n",
    "    instance_type=\"ml.c5.xlarge\",\n",
    "    env={'gt_job_name': groundtruth_job_name,\n",
    "        'region': region},\n",
    "    instance_count=1,\n",
    "    base_job_name=\"yolov5-process\",\n",
    "    role=role,\n",
    "    sagemaker_session = sm_session\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df47018a-2205-4985-bb3a-767c8d06a72e",
   "metadata": {},
   "outputs": [],
   "source": [
    "sklearn_processor.run(\n",
    "    outputs=[\n",
    "        ProcessingOutput(output_name=\"train\", source=\"/opt/ml/processing/output/train\")\n",
    "    ],\n",
    "    code=\"code/preprocessing.py\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6794675c-34c7-4318-b858-facc6bd8d05d",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_s3_uri = sklearn_processor.jobs[-1].describe()[\"ProcessingOutputConfig\"][\"Outputs\"][0][\"S3Output\"][\"S3Uri\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e97a7784-f95a-4f8d-874b-ba232537cb2f",
   "metadata": {},
   "source": [
    "| ⚠️ WARNING: These are the details you will need to train your models based on the labeling job you completed. |\n",
    "| -- |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f31131a-d18b-418d-9204-fb7eda2a3553",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Dataset S3 location: \", dataset_s3_uri)\n",
    "print(\"Labels: \", labels)"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:470317259841:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}