{ "cells": [ { "cell_type": "markdown", "id": "8a8c2fdd", "metadata": {}, "source": [ "# **Amazon Lookout for Equipment** - Time series annotation with LabelStudio\n", "\n", "*Part 2 - Configuring the labeling task*" ] }, { "cell_type": "markdown", "id": "6e8d545c", "metadata": {}, "source": [ "## Initialization\n", "---" ] }, { "cell_type": "code", "execution_count": 1, "id": "84f821e5", "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import json\n", "import matplotlib.pyplot as plt\n", "import os\n", "import pandas as pd\n", "import requests\n", "import subprocess\n", "\n", "from IPython.display import display, Markdown" ] }, { "cell_type": "code", "execution_count": 2, "id": "8c72cf91", "metadata": {}, "outputs": [], "source": [ "prop_cycle = plt.rcParams['axes.prop_cycle']\n", "colors = prop_cycle.by_key()['color']" ] }, { "cell_type": "markdown", "id": "9a6fecbf", "metadata": {}, "source": [ "### Collect parameters\n", "Run the following cell to collect the parameters needed to run the LabelStudio container. These variables were stored in the previous notebook. Run it to store them in the Jupyter environment:" ] }, { "cell_type": "code", "execution_count": 3, "id": "eec0455a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "('GenerateAnAPIToken', 'lookout-equipment', 'eu-west-1')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%store -r token\n", "%store -r notebook_name\n", "%store -r current_region\n", "\n", "token, notebook_name, current_region" ] }, { "cell_type": "markdown", "id": "b1352248", "metadata": {}, "source": [ "## Project creation\n", "---" ] }, { "cell_type": "markdown", "id": "65393ba7", "metadata": {}, "source": [ "### Labeling template configuration\n", "When labeling a dataset, you need to provide a template. LabelStudio will use it to generate the user interface for the labeler.\n", "\n", "To label time series data, we need to now how many different time series there are in the dataset. Let's open it as a first step. The following cell will open a synthetic dataset provided as an example with this repository:" ] }, { "cell_type": "code", "execution_count": 4, "id": "81288dd7", "metadata": {}, "outputs": [], "source": [ "fname = os.path.join('example', 'timeseries.csv')\n", "df = pd.read_csv(fname, nrows=2)\n", "channels = list(df.columns)[1:]\n", "channels_list = ','.join(channels)\n", "channel_fields = '\\n'.join([f'' for index, c in enumerate(channels)])" ] }, { "cell_type": "markdown", "id": "e365228f", "metadata": {}, "source": [ "Let's now build the labeling template. In LabelStudio, labeling templates are defined using an XML file:" ] }, { "cell_type": "code", "execution_count": 5, "id": "e5d6c957", "metadata": {}, "outputs": [], "source": [ "template = f\"\"\"\n", " \n", "\n", " {channel_fields}\n", " \n", " \n", " \n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "2dee9c0a", "metadata": {}, "source": [ "Use the following cell to push the example file to a location on Amazon S3 that this notebook has access to:" ] }, { "cell_type": "code", "execution_count": 6, "id": "125659ab", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "upload: example/timeseries.csv to s3://timeseries-labeling/data/synthetic/timeseries.csv\n" ] } ], "source": [ "# ROLE_ARN = sagemaker.get_execution_role()\n", "BUCKET = 'timeseries-labeling'\n", "PREFIX = 'data/synthetic'\n", "!aws s3 cp $fname s3://$BUCKET/$PREFIX/timeseries.csv" ] }, { "cell_type": "markdown", "id": "1e52bbc9", "metadata": {}, "source": [ "### Project creation\n", "Once running, a LabelStudio instance can be queried and manipulated through a set of API [**documented here**](https://labelstud.io/api). The following cell will create a new labeling project in the currently running LabelStudio instance:" ] }, { "cell_type": "code", "execution_count": 7, "id": "bf4783b2", "metadata": {}, "outputs": [], "source": [ "payload = {\n", " \"title\":\"Synthetic data labeling\",\n", " \"description\":\"Time series labeling job for synthetic data\",\n", " \"label_config\": template,\n", " \"is_published\":True\n", "}\n", "\n", "headers = {\n", " 'Content-Type': 'application/json',\n", " 'Authorization': f'Token {token}'\n", "}\n", "\n", "response = requests.post('http://localhost:8080/api/projects/', headers=headers, data=json.dumps(payload))\n", "project_id = response.json()['id']" ] }, { "cell_type": "markdown", "id": "9e942300", "metadata": {}, "source": [ "### S3 storage configuration\n", "The following part will configure Amazon S3 as a source to provide the time series to label. To configure this, LabelStudio requires temporary credentials to synchronize the data and initialize the tasks. Let's use the last credentials obtained while running this session in our notebook:" ] }, { "cell_type": "code", "execution_count": 8, "id": "62a0deab", "metadata": {}, "outputs": [], "source": [ "current_credentials = boto3.Session().get_credentials().get_frozen_credentials()\n", "\n", "ACCESS_KEY = getattr(current_credentials, 'access_key')\n", "SECRET_KEY = getattr(current_credentials, 'secret_key')\n", "SESSION_TOKEN = getattr(current_credentials, 'token')" ] }, { "cell_type": "markdown", "id": "4c57d56a", "metadata": {}, "source": [ "Using these credentials, we will create a new Amazon S3 data source in our LabelStudio instance:" ] }, { "cell_type": "code", "execution_count": 9, "id": "2dbb83e7", "metadata": {}, "outputs": [], "source": [ "payload = {\n", " \"presign\": True,\n", " \"title\": \"Time series data source\",\n", " \"bucket\": BUCKET,\n", " \"prefix\": PREFIX,\n", " \"regex_filter\": \".*csv\",\n", " \"use_blob_urls\": True,\n", " \"aws_access_key_id\": ACCESS_KEY,\n", " \"aws_secret_access_key\": SECRET_KEY,\n", " \"aws_session_token\": SESSION_TOKEN,\n", " \"region_name\": current_region,\n", " \"recursive_scan\": True,\n", " \"project\": project_id\n", "}\n", "\n", "response = requests.post('http://localhost:8080/api/storages/s3', headers=headers, data=json.dumps(payload))\n", "storage_id = response.json()['id']" ] }, { "cell_type": "markdown", "id": "84d0dc4f", "metadata": {}, "source": [ "Synchronizing allows LabelStudio to search for any `csv` file located under the provided data source:" ] }, { "cell_type": "code", "execution_count": 10, "id": "7a6310be", "metadata": {}, "outputs": [], "source": [ "payload = {\n", " \"project\": project_id\n", "}\n", "\n", "response = requests.post(f'http://localhost:8080/api/storages/s3/{storage_id}/sync', headers=headers, data=json.dumps(payload))\n", "task_id = response.json()['id']" ] }, { "cell_type": "markdown", "id": "f0f684a7", "metadata": {}, "source": [ "## Label your time series\n", "---" ] }, { "cell_type": "code", "execution_count": 11, "id": "70f2f600", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "[**Click here**](https://lookout-equipment.notebook.eu-west-1.sagemaker.aws/proxy/8080/) to open **LabelStudio** in a new tab" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(Markdown(f'[**Click here**](https://{notebook_name}.notebook.{current_region}.sagemaker.aws/proxy/8080/) to open **LabelStudio** in a new tab'))" ] }, { "cell_type": "markdown", "id": "f43480bb", "metadata": {}, "source": [ "When you click on the previous link you will open your LabelStudio instance into a new tab. You will then be given the opportunity to login. When you ran the LabelStudio Docker image in the previous notebook, you also initialized a user by defining a `username` and `password`. Use these credentials to log in:\n", "\n", "\"Login\"" ] }, { "cell_type": "markdown", "id": "aba92924", "metadata": {}, "source": [ "Once logged it, you should already see a project:\n", " \n", "\"Projects\n", "\n", "Click anywhere on this project to bring up the time series to annotate. Each time series dataset will appear as an individual task to label:\n", "\n", "\"Tasks\n", "\n", "Scroll down to the bottom of the time series view on the right and reduce the time period using the overview slider until the time series plot appear. You can then start labeling your data (check out the [**LabelStudio website**](https://labelstud.io/) for more details about the labeling process):\n", "\n", "\"Labeling\n", "\n", "Once you have a few labels done, scroll up and click on the `Submit` button. The annotations are saved in the local database from LabelStudio (you can also configure a target location on Amazon S3)." ] }, { "cell_type": "markdown", "id": "9cd006d6", "metadata": {}, "source": [ "## Collect your annotations\n", "---\n", "Use the following API call to get the labels from your previous labeling job:" ] }, { "cell_type": "code", "execution_count": 13, "id": "4e9e52d3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
startend
02021-04-05 23:40:002021-04-11 00:50:00
\n", "
" ], "text/plain": [ " start end\n", "0 2021-04-05 23:40:00 2021-04-11 00:50:00" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "payload = {\n", " \"id\": task_id\n", "}\n", "\n", "response = requests.get(f'http://localhost:8080/api/tasks/{task_id}/annotations', headers=headers, data=json.dumps(payload))\n", "annotations_df = pd.DataFrame([result['value'] for result in response.json()[0]['result']])[['start', 'end']]\n", "annotations_df" ] }, { "cell_type": "markdown", "id": "3e1ab0e6", "metadata": {}, "source": [ "You can now save this dataframe as a CSV file ready to be used by Lookout for Equipment:" ] }, { "cell_type": "code", "execution_count": 14, "id": "1d5f116b", "metadata": {}, "outputs": [], "source": [ "annotations_df.to_csv('labels.csv', index=None, header=None)" ] }, { "cell_type": "markdown", "id": "7fdc603d", "metadata": {}, "source": [ "## Cleanup\n", "---\n", "If you want to stop LabelStudio, just go back into the first notebook and click on the `Interrupt the kernel` button in this notebooks toolbar.\n", "\n", "If you don't want to keep your labeling projects with your ongoing label work, you can safely delete the `/home/ec2-user/SageMaker/label-studio-data` folder where all the label data is stored.\n", "\n", "**Do not delete** this folder if you want to continue your labeling work later, or isn't done processing the labeling job outputs." ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 5 }