{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CamVid Dataset Preprocessing\n", "\n", "This notebook preprocesses the [CamVid (Cambridge-driving Labeled Video Database)](http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/)\n", "dataset (made avialable for download on Kaggle at \n", "[https://www.kaggle.com/datasets/carlolepelaars/camvid](https://www.kaggle.com/datasets/carlolepelaars/camvid)).\n", "\n", "As a prerequisite for this notebook, please download the dataset from Kaggle and extract the `archive.zip` file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Module imports and Amazon SageMaker setup\n", "\n", "Here we import modules from the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/)\n", "that we need for preprocessing the data.\n", "\n", "We then create a SageMaker session, set the execution role,\n", "and define the S3 bucket and prefix where data and artifacts are stored." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "%reload_ext autoreload\n", "%autoreload 2\n", "%reload_ext dotenv\n", "%dotenv\n", "\n", "import sagemaker\n", "from sagemaker.sklearn.processing import SKLearnProcessor\n", "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", "\n", "session = sagemaker.Session()\n", "# By default the bucket name follows the format \"sagemaker-{region}-{aws-account-id}\"\n", "bucket = session.default_bucket()\n", "# You can adapt the prefix to your liking. Just make sure to also update it in all other notebooks as well.\n", "prefix = 'enet-tensorflow-distributed'\n", "role = sagemaker.get_execution_role()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Paths\n", "\n", "Next we set the S3 URIs for the raw data and where the preprocessed data will be stored.\n", "This uses the bucket and prefix that were set in the previous cell.\n", "\n", "The `input_path` points to the raw dataset location in S3.\n", "\n", "🚨 Please upload the raw dataset to this location in S3 before continuing with this notebook. 🚨" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "input_path = f's3://{bucket}/{prefix}/raw-data/CamVid/'\n", "train_path = f's3://{bucket}/{prefix}/preprocessed-data/camvid/train/'\n", "train_labels_path = f's3://{bucket}/{prefix}/preprocessed-data/camvid/train_labels/'\n", "val_path = f's3://{bucket}/{prefix}/preprocessed-data/camvid/val/'\n", "val_labels_path = f's3://{bucket}/{prefix}/preprocessed-data/camvid/val_labels/'\n", "test_path = f's3://{bucket}/{prefix}/preprocessed-data/camvid/test/'\n", "test_labels_path = f's3://{bucket}/{prefix}/preprocessed-data/camvid/test_labels/'\n", "report_path = f's3://{bucket}/{prefix}/preprocessed-data/camvid/report/'\n", "preprocessing_report_path = f'{report_path}preprocessing_report.json'\n", "class_dict_path = f'{report_path}class_dict.json'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define Preprocessing Job\n", "\n", "The data preprocessing is done via the [`SKLearnProcessor`](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-processor) which is defined in the next cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sklearn_processor = SKLearnProcessor(\n", " framework_version='0.23-1',\n", " role=role,\n", " instance_type='ml.m5.large',\n", " instance_count=1, \n", " base_job_name='camvid-preprocessing'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run Processing Job\n", "\n", "The processing job is now run. The processing code is located at [`../datasets/camvid/preprocess.py`](../datasets/camvid/preprocess.py). The processing outputs are stored in S3 at the previously defined locations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sklearn_processor.run(\n", " code='../datasets/camvid/preprocess.py',\n", " inputs=[\n", " ProcessingInput(\n", " input_name='raw_dataset',\n", " source=input_path, \n", " destination='/opt/ml/processing/input',\n", " s3_input_mode='File',\n", " s3_data_distribution_type='ShardedByS3Key'\n", " )\n", " ],\n", " outputs=[\n", " ProcessingOutput(\n", " output_name='train',\n", " source='/opt/ml/processing/output/train',\n", " destination=train_path,\n", " ),\n", " ProcessingOutput(\n", " output_name='train_labels',\n", " source='/opt/ml/processing/output/train_labels',\n", " destination=train_labels_path,\n", " ),\n", " ProcessingOutput(\n", " output_name='val',\n", " source='/opt/ml/processing/output/val',\n", " destination=val_path,\n", " ),\n", " ProcessingOutput(\n", " output_name='val_labels',\n", " source='/opt/ml/processing/output/val_labels',\n", " destination=val_labels_path,\n", " ),\n", " ProcessingOutput(\n", " output_name='test',\n", " source='/opt/ml/processing/output/test',\n", " destination=test_path,\n", " ),\n", " ProcessingOutput(\n", " output_name='test_labels',\n", " source='/opt/ml/processing/output/test_labels',\n", " destination=test_labels_path,\n", " ),\n", " ProcessingOutput(\n", " output_name='report',\n", " source='/opt/ml/processing/report',\n", " destination=report_path,\n", " ),\n", " ]\n", ")" ] } ], "metadata": { "interpreter": { "hash": "23e9edbf101ec1229ee15d5e8950818b02dd17cfc3730b3f1ee235cf2fb9b8d3" }, "kernelspec": { "display_name": "Python 3.9.11 64-bit ('3.9.11')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.11" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }