# **Amazon Lookout for Equipment** - Time series annotation with LabelStudio

*Part 2 - Configuring the labeling task*

## Initialization
---

In [None]:
import boto3
import json
import matplotlib.pyplot as plt
import os
import pandas as pd
import requests
import sagemaker

from IPython.display import display, Markdown

In [None]:
prop_cycle = plt.rcParams['axes.prop_cycle']
colors = prop_cycle.by_key()['color']

### Collect parameters
Run the following cell to collect the parameters needed to run the LabelStudio container. These variables were stored in the previous notebook. Run it to store them in the Jupyter environment:

In [None]:
%store -r token
%store -r notebook_name
%store -r current_region

token, notebook_name, current_region

## Project creation
---

### Labeling template configuration
When labeling a dataset, you need to provide a template. LabelStudio will use it to generate the user interface for the labeler.

To label time series data, we need to now how many different time series there are in the dataset. Let's open it as a first step. The following cell will open a synthetic dataset provided as an example with this repository:

In [None]:
fname = os.path.join('example', 'timeseries.csv')
df = pd.read_csv(fname, nrows=2)
channels = list(df.columns)[1:]
channels_list = ','.join(channels)
channel_fields = '\n'.join([f'<Channel column="{c}" legend="{c}" strokeColor="{colors[index % len(colors)]}" displayFormat=",.1f" />' for index, c in enumerate(channels)])

Let's now build the labeling template. In LabelStudio, labeling templates are defined using an XML file:

In [None]:
template = f"""<View>
    <TimeSeries name="ts" valueType="url" value="$csv"
                sep=","
                timeColumn="{df.columns[0]}"
                timeFormat="%Y-%m-%d %H:%M:%S"
                timeDisplayFormat="%Y-%m-%d %H:%M:%S"
                overviewChannels="{channels_list}">

        {channel_fields}
    </TimeSeries>
    <TimeSeriesLabels name="label" toName="ts">
        <Label value="Anomaly" background="red" />
    </TimeSeriesLabels>
</View>"""

Use the following cell to push the example file to a location on Amazon S3 that this notebook has access to:

In [None]:
BUCKET = '<<YOUR-BUCKET>>'
PREFIX = '<<YOUR-PREFIX>>'
!aws s3 cp $fname s3://$BUCKET/$PREFIX/timeseries.csv

To ensure your LabelStudio instance will have access to your data in Amazon S3, you need to configure the [**cross-origin resource sharing**](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enabling-cors-examples.html) (CORS) for your bucket. CORS defines a way for client web applications (LabelStudio in our case) that are loaded in one domain to interact with resources in a different domain. To enable CORS on your bucket using the S3 console, follow this documentation and use the following JSON document as CORS configuration. Don't forget to replace the `<<notebook_name>>` and `<<current_region>>` by their values (see above):

```json
[
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "GET"
        ],
        "AllowedOrigins": [
            "https://<<notebook_name>>.notebook.<<current_region>>.sagemaker.aws"
        ],
        "ExposeHeaders": [
            "x-amz-server-side-encryption",
            "x-amz-request-id",
            "x-amz-id-2"
        ],
        "MaxAgeSeconds": 3000
    }
]
```

### Project creation
Once running, a LabelStudio instance can be queried and manipulated through a set of API [**documented here**](https://labelstud.io/api). The following cell will create a new labeling project in the currently running LabelStudio instance:

In [None]:
payload = {
    "title":"Synthetic data labeling",
    "description":"Time series labeling job for synthetic data",
    "label_config": template,
    "is_published":True
}

headers = {
    'Content-Type': 'application/json',
    'Authorization': f'Token {token}'
}

response = requests.post('http://localhost:8080/api/projects/', headers=headers, data=json.dumps(payload))
project_id = response.json()['id']

### S3 storage configuration
The following part will configure Amazon S3 as a source to provide the time series to label. To configure this, LabelStudio requires temporary credentials to synchronize the data and initialize the tasks. Let's use the last credentials obtained while running this session in our notebook:

In [None]:
current_credentials = boto3.Session().get_credentials().get_frozen_credentials()

ACCESS_KEY    = getattr(current_credentials, 'access_key')
SECRET_KEY    = getattr(current_credentials, 'secret_key')
SESSION_TOKEN = getattr(current_credentials, 'token')

Using these credentials, we will create a new Amazon S3 data source in our LabelStudio instance:

In [None]:
payload = {
    "presign": True,
    "title": "Time series data source",
    "bucket": BUCKET,
    "prefix": PREFIX + '/',
    "regex_filter": ".*csv",
    "use_blob_urls": True,
    "aws_access_key_id": ACCESS_KEY,
    "aws_secret_access_key": SECRET_KEY,
    "aws_session_token": SESSION_TOKEN,
    "region_name": current_region,
    "recursive_scan": True,
    "project": project_id
}

response = requests.post('http://localhost:8080/api/storages/s3', headers=headers, data=json.dumps(payload))
storage_id = response.json()['id']

Synchronizing allows LabelStudio to search for any `csv` file located under the provided data source:

In [None]:
payload = {
    "project": project_id
}

response = requests.post(f'http://localhost:8080/api/storages/s3/{storage_id}/sync', headers=headers, data=json.dumps(payload))
task_id = response.json()['id']

## Label your time series
---

In [None]:
display(Markdown(f'[**Click here**](https://{notebook_name}.notebook.{current_region}.sagemaker.aws/proxy/8080/) to open **LabelStudio** in a new tab'))

When you click on the previous link you will open your LabelStudio instance into a new tab. You will then be given the opportunity to login. When you ran the LabelStudio Docker image in the previous notebook, you also initialized a user by defining a `username` and `password`. Use these credentials to log in:

<img src="assets/label-studio-login.png" alt="Login" />

Once logged it, you should already see a project:
    
<img src="assets/label-studio-project.png" alt="Projects list" />

Click anywhere on this project to bring up the time series to annotate. Each time series dataset will appear as an individual task to label:

<img src="assets/label-studio-tasks.png" alt="Tasks list" />

Scroll down to the bottom of the time series view on the right and reduce the time period using the overview slider until the time series plot appear. You can then start labeling your data (check out the [**LabelStudio website**](https://labelstud.io/) for more details about the labeling process):

<img src="assets/label-studio-overview.png" alt="Labeling time series data" />

Once you have a few labels done, scroll up and click on the `Submit` button. The annotations are saved in the local database from LabelStudio (you can also configure a target location on Amazon S3).

## Collect your annotations
---
Use the following API call to get the labels from your previous labeling job:

In [None]:
payload = {
    "id": task_id
}

response = requests.get(f'http://localhost:8080/api/tasks/{task_id}/annotations', headers=headers, data=json.dumps(payload))
annotations_df = pd.DataFrame([result['value'] for result in response.json()[0]['result']])[['start', 'end']]
annotations_df

You can now save this dataframe as a CSV file ready to be used by Lookout for Equipment:

In [None]:
annotations_df.to_csv('labels.csv', index=None, header=None)

## Cleanup
---
If you want to stop LabelStudio, just go back into the first notebook and click on the `Interrupt the kernel` button in this notebooks toolbar.

If you don't want to keep your labeling projects with your ongoing label work, you can safely delete the `/home/ec2-user/SageMaker/label-studio-data` folder where all the label data is stored.

**Do not delete** this folder if you want to continue your labeling work later, or isn't done processing the labeling job outputs.