{ "cells": [ { "cell_type": "markdown", "id": "4379a998-cad7-40fb-99bb-5a2948f7009d", "metadata": {}, "source": [ "# Label your dataset with Amazon SageMaker Ground Truth" ] }, { "cell_type": "code", "execution_count": 12, "id": "f47a5f73-c8c5-4ce0-bb5d-b996066e47c4", "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import json\n", "import numpy\n", "import os\n", "import sagemaker\n", "\n", "from sklearn.model_selection import train_test_split\n", "\n", "sm_client = boto3.client('sagemaker')\n", "s3_resource = boto3.resource('s3')\n", "sm_session = sagemaker.Session()" ] }, { "cell_type": "markdown", "id": "792c33f6-78f7-42da-affe-b34bb34845e0", "metadata": {}, "source": [ "### Create a labeling job in Amazon SageMaker Ground Truth" ] }, { "cell_type": "markdown", "id": "7892e313-d4da-4487-a3b7-c7f28a1a0935", "metadata": {}, "source": [ "To create your custom model on YOLOv5 you are going to need to label your custom dataset. To label an object detection dataset you may use Amazon SageMaker Ground Truth." ] }, { "cell_type": "markdown", "id": "bbdda986-63f2-4514-b6eb-57f935771daa", "metadata": {}, "source": [ "| ⚠️ WARNING: If you have already labeled an object detection dataset with Amazon SageMaker Ground Truth you can skip to the \"**Get Job Details**\" |\n", "| -- |" ] }, { "cell_type": "markdown", "id": "d5523582-1a38-47cc-85d4-9c450ccedb87", "metadata": { "tags": [] }, "source": [ "#### Create a Labeling Workforce\n", "\n", "Follow the steps in the SageMaker Ground Truth documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-create-private-console.html#create-workforce-labeling-job\n" ] }, { "cell_type": "markdown", "id": "94e7cd63-e965-44d6-9f5b-0be8e7b9e8ad", "metadata": { "tags": [] }, "source": [ "#### Create your bounding box labeling job\n", "\n", "Follow the steps in the SageMaker Ground Truth documentation here: https://docs.aws.amazon.com/sagemaker/latest/dg/sms-create-labeling-job-console.html\n", "\n", "If using the AWS Console, you should create a labeling job with the following options:\n", "\n", "1. Job name: Set any unique name for the job name, for example \"Object-Detection-Example\".\n", "2. Leave the \"I want to specify a label attribute...\" option un-checked.\n", "3. Input data setup: Pick \"Automated data setup\".\n", "4. Input dataset location: Copy and paste the location of the single folder with your images in S3. Example: \"s3://mybucket/raw_images\".\n", "5. Output dataset location: Choose \"Same location as input dataset\".\n", "6. Data type: Choose \"Image\".\n", "7. IAM Role: Create a new role and give access to the S3 bucket where your images are located, or any S3 bucket.\n", "8. Now hit \"Complete data setup\" and wait for it to be ready.\n", "9. Task category: Choose \"Image\" and select \"Bounding box\", then hit \"Next\".\n", "10. Worker types: Select \"Private\" and choose your team for the \"Private teams\" option.\n", "11. For the Bounding box labeling tool: Enter a description and instructions, and for the \"Labels\" section add the relevant labels for your job. \n", "12. Finally choose \"Create\"." ] }, { "cell_type": "markdown", "id": "46aa584d-64c4-4e67-a576-59921a6c0f75", "metadata": {}, "source": [ "### Get Job Details" ] }, { "cell_type": "markdown", "id": "a1703a57-9b6a-47b0-96e2-74eba601142a", "metadata": {}, "source": [ "Once you have finished labeling your images, let's retrieve the information we need to create our dataset in the format YOLOv5 expects" ] }, { "cell_type": "code", "execution_count": 10, "id": "807670d9-3cee-47d5-9f30-0b74735f3367", "metadata": {}, "outputs": [], "source": [ "groundtruth_job_name = \"Object-Detection-Example\" ### <-- Replace with the name you used for your labeling job" ] }, { "cell_type": "code", "execution_count": 11, "id": "810be406-7cc7-4098-b2c7-80eecd167523", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Job Status: Completed\n", "Manifest Uri: s3://buzecd-aiml-demos/ground-truth-tests/image-bounding-box/Output/Object-Detection-Example/manifests/output/output.manifest\n", "Labels Uri: s3://buzecd-aiml-demos/ground-truth-tests/image-bounding-box/Output/Object-Detection-Example/annotation-tool/data.json\n" ] } ], "source": [ "response = sm_client.describe_labeling_job(\n", " LabelingJobName=groundtruth_job_name\n", ")\n", "\n", "labelingJobStatus = response[\"LabelingJobStatus\"]\n", "manifestUri = response[\"LabelingJobOutput\"][\"OutputDatasetS3Uri\"]\n", "labelsListUri = response[\"LabelCategoryConfigS3Uri\"]\n", "\n", "print(\"Job Status: \",labelingJobStatus)\n", "print(\"Manifest Uri: \", manifestUri)\n", "print(\"Labels Uri: \", labelsListUri)" ] }, { "cell_type": "markdown", "id": "249a30a9-32ec-4100-87f9-d0029ecea96e", "metadata": {}, "source": [ "### Get labels" ] }, { "cell_type": "markdown", "id": "3a93a9c8-4528-46f2-8b7a-85c7bb089134", "metadata": {}, "source": [ "We need to retrieve the labels from the training job which are located in S3." ] }, { "cell_type": "code", "execution_count": 6, "id": "acf2cf92-96b0-4707-9799-8aafe5dca589", "metadata": {}, "outputs": [], "source": [ "def split_s3_path(s3_path):\n", " path_parts=s3_path.replace(\"s3://\",\"\").split(\"/\")\n", " bucket=path_parts.pop(0)\n", " key=\"/\".join(path_parts)\n", " return bucket, key\n", "\n", "def get_labels_list(labels_uri):\n", " labels = []\n", " bucket, key = split_s3_path(labels_uri)\n", " s3_resource.meta.client.download_file(bucket, key, 'labels.json')\n", " with open('labels.json') as f:\n", " data = json.load(f)\n", " for label in data[\"labels\"]:\n", " labels.append(label[\"label\"])\n", " return labels" ] }, { "cell_type": "code", "execution_count": 7, "id": "eacbf363-06fe-460f-b559-be889ac59078", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Labels: ['Dog', 'Cat']\n" ] } ], "source": [ "labels = get_labels_list(labelsListUri)\n", "print(\"Labels: \",labels)" ] }, { "cell_type": "markdown", "id": "c40ef3ce-40d2-40cf-83c3-7412f37503ba", "metadata": {}, "source": [ "### Get manifest" ] }, { "cell_type": "markdown", "id": "1f35b0b3-edbe-4e11-b8d7-1c6970581020", "metadata": {}, "source": [ "We need to retrieve the labeled manifest file from the training job which is located in S3" ] }, { "cell_type": "code", "execution_count": 8, "id": "eb5c5536-cbf7-4f3d-9e75-4d423f05a5b4", "metadata": {}, "outputs": [], "source": [ "def get_manifest_file(manifest_uri):\n", " bucket, key = split_s3_path(manifest_uri)\n", " s3_resource.meta.client.download_file(bucket, key, 'output.manifest')\n", " return \"output.manifest\"" ] }, { "cell_type": "code", "execution_count": 9, "id": "285684f2-26b9-4d03-8e62-76312459055c", "metadata": {}, "outputs": [], "source": [ "manifest = get_manifest_file(manifestUri)" ] }, { "cell_type": "markdown", "id": "6f337cb0-f48d-4395-8928-0ddcf8a3ec8b", "metadata": {}, "source": [ "### Split manifest into training and validation" ] }, { "cell_type": "markdown", "id": "cb8e4752-6c18-4fc4-986f-6e63f4532875", "metadata": {}, "source": [ "Now we have our manifest, let's split our data into training and validation" ] }, { "cell_type": "code", "execution_count": null, "id": "860b8bc9-0e41-4d49-ac49-e3d223704170", "metadata": {}, "outputs": [], "source": [ "with open(manifest) as file:\n", " lines = file.readlines()\n", " data = numpy.array(lines)\n", " train_data, validation_data = train_test_split(data, test_size=0.2)\n", " \n", "print(\"The manifest contains {} annotations.\".format(len(data)))\n", "print(\"{} will be used for training.\".format(len(train_data)))\n", "print(\"{} will be used for validation.\".format(len(validation_data)))" ] }, { "cell_type": "markdown", "id": "50a0779c-6aa5-4037-8790-7653f90fe4de", "metadata": {}, "source": [ "### Create YOLOv5 Training and Validation datasets" ] }, { "cell_type": "markdown", "id": "89948cf6-8e54-4ecb-8b4a-3d6c29dd19ae", "metadata": {}, "source": [ "Lets download the images and create the annotation files in YOLOv5 expected format" ] }, { "cell_type": "code", "execution_count": null, "id": "a63680b8-e8c9-4fbd-8944-403fb3479fd9", "metadata": {}, "outputs": [], "source": [ "dirs = [\"dataset/images/train\", \n", " \"dataset/labels/train\",\n", " \"dataset/images/validation\",\n", " \"dataset/labels/validation\"]\n", "\n", "for directory in dirs:\n", " !mkdir -p {directory}" ] }, { "cell_type": "code", "execution_count": null, "id": "05595658-d4fc-4fb2-8143-39167fb600d2", "metadata": {}, "outputs": [], "source": [ "def ground_truth_to_yolo(dataset, dataset_category):\n", " print(\"Downloading images and creating labels for the {} dataset\".format(dataset_category))\n", " for line in dataset:\n", " line = json.loads(line)\n", " \n", " # Variables\n", " object_s3_uri = line[\"source-ref\"]\n", " bucket, key = split_s3_path(object_s3_uri)\n", " image_filename = object_s3_uri.split(\"/\")[-1]\n", " txt_filename = '.'.join(image_filename.split(\".\")[:-1]) + \".txt\"\n", " txt_path = \"dataset/labels/{}/{}\".format(dataset_category, txt_filename)\n", " \n", " # Download image\n", " s3_resource.meta.client.download_file(bucket, key, \"dataset/images/{}/{}\".format(dataset_category,image_filename))\n", " \n", " # Create txt with annotations\n", " with open(txt_path, 'w') as target:\n", " for annotation in line[groundtruth_job_name][\"annotations\"]:\n", " class_id = annotation[\"class_id\"]\n", " center_x = (annotation[\"left\"] + (annotation[\"width\"]/2)) / line[groundtruth_job_name][\"image_size\"][0][\"width\"]\n", " center_y = (annotation[\"top\"] + (annotation[\"height\"]/2)) / line[groundtruth_job_name][\"image_size\"][0][\"height\"]\n", " w = annotation[\"width\"] / line[groundtruth_job_name][\"image_size\"][0][\"width\"]\n", " h = annotation[\"height\"] / line[groundtruth_job_name][\"image_size\"][0][\"height\"]\n", " data = \"{} {} {} {} {}\\n\".format(class_id, center_x, center_y, w, h)\n", " target.write(data)" ] }, { "cell_type": "code", "execution_count": null, "id": "b802e890-1622-43a4-86e2-e7276a816382", "metadata": {}, "outputs": [], "source": [ "ground_truth_to_yolo(train_data, \"train\")\n", "ground_truth_to_yolo(validation_data, \"validation\")" ] }, { "cell_type": "markdown", "id": "1a247340-635a-4fa1-b5c1-d884ea204091", "metadata": {}, "source": [ "### Validate the number of downloaded files" ] }, { "cell_type": "code", "execution_count": null, "id": "409f2780-b839-4906-8426-d0bcd6acd2ff", "metadata": {}, "outputs": [], "source": [ "def count_files(dirs):\n", " for directory in dirs:\n", " number = len([1 for x in list(os.scandir(directory)) if x.is_file()])\n", " print(\"There are {} elements in {}\".format(number, directory))\n", "\n", "count_files(dirs)" ] }, { "cell_type": "code", "execution_count": null, "id": "06c8037a-ac58-4b96-add5-34ba671a3166", "metadata": {}, "outputs": [], "source": [ "# TODO: Show images with bounding boxes" ] }, { "cell_type": "markdown", "id": "7257e0ed-27a5-4443-9628-d291bacf9aa3", "metadata": {}, "source": [ "### Upload to S3 the labeled dataset" ] }, { "cell_type": "markdown", "id": "fdd6a607-786f-4098-9162-c532e94fe6a3", "metadata": {}, "source": [ "Let's upload our dataset to S3, this will be used for the training job" ] }, { "cell_type": "code", "execution_count": null, "id": "4a435d86-44be-4908-aef7-0f80a15d3834", "metadata": {}, "outputs": [], "source": [ "bucket = sm_session.default_bucket()\n", "#bucket = \"\" #Use this option if you want to use a specific S3 bucket\n", "dataset_s3_uri = sm_session.upload_data(\"dataset\", bucket, \"yolov5dataset\")\n", "print(\"Dataset located in: \",dataset_s3_uri)" ] }, { "cell_type": "markdown", "id": "6548b938-525f-450d-9382-9ce9595fcaa4", "metadata": {}, "source": [ "You have labeled your own custom dataset with Amazon SageMaker Ground Truth and split it a training and validation dataset in YOLOv5 expected format. For the next modules you will be able to use this dataset to train and deploy a custom YOLOv5 model" ] }, { "cell_type": "markdown", "id": "944291de-7144-4585-8662-b0592ea2c98a", "metadata": {}, "source": [ "| ⚠️ WARNING: These are the details you will need to train your models based on the labeling job you completed. |\n", "| -- |" ] }, { "cell_type": "code", "execution_count": null, "id": "e7730b90-2fd6-4cdf-ba90-551cc28424bd", "metadata": {}, "outputs": [], "source": [ "print(\"Dataset S3 location: \", dataset_s3_uri)\n", "print(\"Labels: \", labels)" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:470317259841:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 5 }