{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Amazon SageMaker Ground Truth Demonstration for Named Entity Recognition\n", "\n", "1. [Introduction](#introduction)\n", "2. [Run a Ground Truth labeling job](#run-a-ground-truth-labeling-job)\n", " 1. [Prepare the data](#prepare-the-data)\n", " 2. [Specify the categories](#specify-the-categories)\n", " 3. [Assign private team to test your task](#assign-private-team-to-test-your-task)\n", " 4. [Define pre-built lambda functions for use in the labeling job](#define-pre-built-lambda-functions-for-use-in-the-labeling-job)\n", " 5. [Submit the Ground Truth job request](#submit-the-ground-truth-job-request)\n", " 6. [Monitor job progress](#monitor-job-progress)\n", " 7. [View Task Results](#view-task-results)\n", "3. [Clean Up](#clean-up)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "This sample notebook takes you through an end-to-end workflow to demonstrate the functionality of SageMaker Ground Truth. We'll start with an unlabeled text data set, acquire labels for all the entities like Person, Organization, GeoLocation and Date using SageMaker Ground Truth. Before you begin, we highly recommend you start a Ground Truth labeling job through the AWS Console first to familiarize yourself with the workflow. The AWS Console offers less flexibility than the API, but is simple to use.\n", "\n", "#### Cost and runtime\n", "You can run this demo in two modes:\n", "1. Set `RUN_FULL_AL_DEMO = True` in the next cell to label 200 unstructured text. This should cost about \\$800 given current [Ground Truth pricing scheme](https://aws.amazon.com/sagemaker/groundtruth/pricing/).The total end-to-end runtime should be about 2-4 days.\n", "1. Set `RUN_FULL_AL_DEMO = False` in the next cell to label only 10 text. This should cost about \\$0.80. **this is cheaper version of the demo. Some of the analysis plots might look awkward, but you should still be able to see good results on the human-annotated 10 text sentences.**\n", "\n", "#### Prerequisites\n", "To run this notebook, you can simply execute each cell one-by-one. To understand what's happening, you'll need:\n", "* An S3 bucket you can write to -- please provide its name in the following cell. The bucket must be in the same region as this SageMaker Notebook instance. You can also change the `EXP_NAME` to any valid S3 prefix. All the files related to this experiment will be stored in that prefix of your bucket.\n", "* Familiarity with Python and [numpy](http://www.numpy.org/).\n", "* Basic familiarity with [AWS S3](https://docs.aws.amazon.com/s3/index.html),\n", "* Basic understanding of [AWS Sagemaker](https://aws.amazon.com/sagemaker/),\n", "* Basic familiarity with [AWS Command Line Interface (CLI)](https://aws.amazon.com/cli/) -- set it up with credentials to access the AWS account you're running this notebook from. This should work out-of-the-box on SageMaker Jupyter Notebook instances.\n", "\n", "This notebook is only tested on a SageMaker notebook instance. The runtimes given are approximate, we used an `ml.m4.xlarge` instance in our tests. However, you can likely run it on a local instance by first executing the cell below on SageMaker, and then copying the `role` string to your local copy of the notebook.\n", "\n", "NOTE: This notebook will create/remove subdirectories in its working directory. We recommend to place this notebook in its own directory before running it." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The autoreload extension is already loaded. To reload it, use:\n", " %reload_ext autoreload\n" ] } ], "source": [ "# cell 01\n", "%load_ext autoreload\n", "%autoreload 2\n", "import os\n", "import json\n", "import time\n", "from datetime import datetime\n", "import pandas as pd\n", "import boto3\n", "import sagemaker\n", "from urllib.parse import urlparse\n", "\n", "sess = sagemaker.Session()\n", "BUCKET = sess.default_bucket()\n", "\n", "EXP_NAME = \"label-text/ner-annotation\" # Any valid S3 prefix.\n", "RUN_FULL_AL_DEMO = False # See 'Cost and Runtime' in the Markdown cell above!\n", "VERIFY_USING_PRIVATE_WORKFORCE = True # private team leveraged for labelling job" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 02\n", "# Make sure the bucket is in the same region as this notebook.\n", "role = sagemaker.get_execution_role()\n", "region = boto3.session.Session().region_name\n", "\n", "s3 = boto3.client(\"s3\")\n", "bucket_region = s3.head_bucket(Bucket=BUCKET)[\"ResponseMetadata\"][\"HTTPHeaders\"][\n", " \"x-amz-bucket-region\"\n", "]\n", "assert (\n", " bucket_region == region\n", "), \"You S3 bucket {} and this notebook need to be in the same region.\".format(BUCKET)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Run a Ground Truth labeling job\n", "**This section should take about 30min to complete.**\n", "\n", "We will first run a labeling job. This involves several steps: collecting the text for NER we want labeled, specifying the possible label categories, creating instructions, and writing a labeling job specification. In addition, we highly recommend to run a (free) mock job using a private workforce before you submit any job to the public workforce. This notebook will explain how to do that as an optional step. Without using a private workforce, this section until completion of your labeling job should take about 3h. However, this may vary depending on the availability of the public annotation workforce.\n", "\n", "## Prepare the data\n", "We will first download text and labels of a subset of the [CoNLL-2003](https://huggingface.co/datasets/conll2003). Our dataset will include entities in the following categories:\n", "\n", "* Person\n", "* Organization\n", "* GeoLocation\n", "* Date\n", "\n", "\n", "If you chose `RUN_FULL_AL_DEMO = False`, then we will choose a subset of 10 unstructured texts in this dataset. This is a diverse dataset of different entities, and should be fun for the human annotators to work with. You are free to ask the annotators to annotate any text you wish (as long as the text do not contain inappropriate content; in which case, you must adjust the labeling job request this job produces, please check the Ground Truth documentation).\n", "\n", "We will copy these text to our local `BUCKET`, and will create the corresponding *input manifest*. The input manifest is a formatted list of the S3 locations of the text we want Ground Truth to annotate. We will upload this manifest to our S3 `BUCKET`.\n", "\n", "#### Disclosure regarding the Open NER text Dataset:\n", "CoNLL 2003 Dataset V4 is created by Wang et al. We describe the CoNLL-2003 shared task: language-independent named entity recognition. We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance.The following paper describes CoNLL-2003 NER dataset in depth: from the data collection and annotation to detailed statistics about the data and evaluation of models trained on it.\n", "\n", "Erik F. Tjong Kim Sang, Fien De Meulder\n", "*The Open NER Dataset: Language-Independent Named Entity Recognition.* arXiv:cs/0306050 [cs.CL], 2003. ([link to PDF])(https://arxiv.org/abs/cs/0306050v1)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 37263 100 37263 0 0 155k 0 --:--:-- --:--:-- --:--:-- 156k\n" ] } ], "source": [ "# cell 03\n", "# Download the data if not exists\n", "if not os.path.isfile('ner_dataset.json'):\n", " !curl -X GET \\\n", " \"https://datasets-server.huggingface.co/rows?dataset=conll2003&config=conll2003&split=train&offset=0&limit=100\" >> ner_dataset.json\n", " \n", "read_file = pd.DataFrame(columns = ['source'])\n", "text_list = []\n", "# Read and store content of an json file \n", "with open ('ner_dataset.json', \"r\") as f:\n", " # Reading from file\n", " data = json.loads(f.read())\n", "\n", " # Iterating through the json list\n", " for i in data['rows']:\n", " text_list.append(i['row']['tokens'])\n", "#convert the tokens into text string \n", "text_list = [\" \".join(sub)for sub in text_list]\n", "\n", "read_file['source'] = text_list\n", "\n", "# If running the short version of the demo, reduce each class count 10 times.\n", "if RUN_FULL_AL_DEMO is False:\n", " text_data = read_file[:10]\n", "else:\n", " text_data = read_file\n", " \n", "# Create a manifest (jsonline) file \n", "text_data.to_json(r'./input.manifest', orient='records', lines=True)\n", "\n", "# Copy the data to s3 manifest\n", "s3 = boto3.client(\"s3\")\n", "manifest_name = \"input/input.manifest\"\n", "s3.upload_file(r'./input.manifest', BUCKET, EXP_NAME + \"/\" + manifest_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After running the cell above, you should be able to go to `s3://BUCKET/EXP_NAME/input` in [S3 console](https://console.aws.amazon.com/s3/) and see a text. We recommend you inspect the contents of these content! You can download them all to a local machine using the AWS CLI." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Specify the categories\n", "To run the text Named Entity Recognition labeling job, you need to decide on a set of classes the annotators can choose from. \n", "In our case, this list is `[\"Person\", \"Organization\", \"GeoLocation\", \"Date\"]`. In your own job you can choose any list of up to 10 classes. We recommend the classes to be as unambiguous and concrete as possible. The categories should be mutually exclusive, with multiple entities per text. \n", "\n", "To work with Ground Truth, this list needs to be converted to a .json file and uploaded to the S3 `BUCKET`.\n", "\n", "*Note: The ordering of the labels or classes in the template governs the class indices that you will see downstream in the output manifest (this numbering is zero-indexed). In other words, the class that appears second in the template will correspond to class \"1\" in the output. At the end of this demonstration, we will train a model and make predictions, and this class ordering is instrumental to interpreting the results.*" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 04\n", "# define the classes\n", "json_body = {\n", " \"document-version\": \"2018-11-28\",\n", " \"labels\": [\n", " {\n", " \"label\": \"Organization\",\n", " \"shortDisplayName\": \"Org\"\n", " },\n", " {\n", " \"label\": \"Person\",\n", " \"shortDisplayName\": \"Per\"\n", " },\n", " {\n", " \"label\": \"GeoLocation\",\n", " \"shortDisplayName\": \"Gpe\"\n", " },\n", " {\n", " \"label\": \"Date\",\n", " \"shortDisplayName\": \"Dat\"\n", " }\n", " ],\n", " \"instructions\": {\n", " \"shortInstruction\": \"

Enter description of the labels that workers have to choose from


Examples to help workers understand the label

\"\n", " \"

PER - People, including fictional.

\"\n", " \"

ORG - Companies, agencies, institutions.

\"\n", " \"

GPE - Countries, cities, states.

\"\n", " \"

DAT - Absolute or relative dates or periods.

\",\n", " \n", " \"fullInstruction\": \"
  1. Read the text carefully.
  2. Highlight words, phrases, or sections of the text.
  3. Choose the label that best matches what you have highlighted.
  4. To change a label, choose highlighted text and select a new label.
  5. To remove a label from highlighted text, choose the X next to the abbreviated label name on the highlighted text.
  6. You can select all of a previously highlighted text, but not a portion of it.
\"\n", " }\n", " }\n", "\n", "# upload the json to s3\n", "with open(\"class_labels.json\", \"w\") as f:\n", " json.dump(json_body, f)\n", "\n", "s3.upload_file(\"class_labels.json\", BUCKET, EXP_NAME + \"/input/class_labels.json\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should now see `class_labels.json` in `s3://BUCKET/EXP_NAME/input`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assign private team to test your task\n", "\n", "Refer to Prerequisites to setup private workforce team. \n", "\n", "Copy private_workteam_arn, from Amazon SageMaker console > Ground Truth > Labeling workforces > Private Teams\n", "\n", "The [SageMaker Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-management-private.html) has more details on the management of private workteams. " ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WORKTEAM_ARN : arn:aws:sagemaker:us-east-1:370501389570:workteam/private-crowd/smgt-immersionday\n" ] } ], "source": [ "# cell 05\n", "# define the workforce arn for \n", "if VERIFY_USING_PRIVATE_WORKFORCE:\n", " private_workteam_arn = \"<< your private workteam ARN here >>\"\n", " WORKTEAM_ARN = private_workteam_arn\n", "else:\n", " workteam_arn = \"arn:aws:sagemaker:{}:394669845002:workteam/public-crowd/default\".format(region)\n", " WORKTEAM_ARN = workteam_arn\n", "\n", "print(\"WORKTEAM_ARN : {}\".format(WORKTEAM_ARN))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define pre-built lambda functions for use in the labeling job \n", "\n", "Before we submit the request, we need to define the ARNs for four key components of the labeling job: 1) the workteam, 2) the annotation consolidation Lambda function, 3) the pre-labeling task Lambda function, and 4) the human task UI to perform. These functions are defined by strings with region names and AWS service account numbers, so we will define a mapping below that will enable you to run this notebook in any of our supported regions. \n", "\n", "See the official documentation for the available ARNs:\n", "* [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-management-public.html) for a discussion of the workteam ARN definition. There is only one valid selection if you choose to use the public workfofce; if you elect to use a private workteam, you should check the corresponding ARN for the workteam.\n", "* [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_HumanTaskConfig.html#SageMaker-Type-HumanTaskConfig-PreHumanTaskLambdaArn) for available pre-human ARNs for other workflows.\n", "* [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_AnnotationConsolidationConfig.html#SageMaker-Type-AnnotationConsolidationConfig-AnnotationConsolidationLambdaArn) for available annotation consolidation ANRs for other workflows." ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 06\n", "# Specify ARNs for resources needed to run a text Named Entity Recognition.\n", "ac_arn_map = {\n", " \"us-west-2\": \"081040173940\",\n", " \"us-east-1\": \"432418664414\",\n", " \"us-east-2\": \"266458841044\",\n", " \"eu-west-1\": \"568282634449\",\n", " \"ap-northeast-1\": \"477331159723\",\n", "}\n", "\n", "# PreHumanTaskLambdaArn for text NER\n", "prehuman_arn = \"arn:aws:lambda:{}:{}:function:PRE-NamedEntityRecognition\".format(\n", " region, ac_arn_map[region]\n", ")\n", "\n", "# AnnotationConsolidationConfig for text NER\n", "acs_arn = \"arn:aws:lambda:{}:{}:function:ACS-NamedEntityRecognition\".format(region, ac_arn_map[region])\n", "\n", "#Built-in instruction template for text NER\n", "ui_arn = \"arn:aws:sagemaker:{}:394669845002:human-task-ui/NamedEntityRecognition\".format(region)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Submit the Ground Truth job request\n", "The API starts a Ground Truth job by submitting a request. The request contains the \n", "full configuration of the annotation task, and allows you to modify the fine details of\n", "the job that are fixed to default values when you use the AWS Console. The parameters that make up the request are described in more detail in the [SageMaker Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateLabelingJob.html).\n", "\n", "After you submit the request, you should be able to see the job in your AWS Console, at `Amazon SageMaker > Labeling Jobs`.\n", "You can track the progress of the job there. This job will take several hours to complete. \n", "\n", "### Verify your task using a private team [OPTIONAL]\n", "If you chose to follow the steps in [Create a private team](#Create-a-private-team-to-test-your-task-[OPTIONAL]), then you can first verify that your task runs as expected. To do this:\n", "1. Set VERIFY_USING_PRIVATE_WORKFORCE to True in the cell below.\n", "2. Run the next two cells. This will define the task and submit it to the private workforce (to you).\n", "3. After a few minutes, you should be able to see your task in your private workforce interface [Create a private team](#Create-a-private-team-to-test-your-task-[OPTIONAL]).\n", "Please verify that the task appears as you want it to appear.\n", "4. If everything is in order, change `VERIFY_USING_PRIVATE_WORKFORCE` to `False` and rerun the cell below to start the real annotation task!" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "{'LabelingJobArn': 'arn:aws:sagemaker:us-east-1:370501389570:labeling-job/ground-truth-text-ner-1686344604',\n", " 'ResponseMetadata': {'RequestId': '89f51d94-780d-4764-bb5f-99863ca4b1dd',\n", " 'HTTPStatusCode': 200,\n", " 'HTTPHeaders': {'x-amzn-requestid': '89f51d94-780d-4764-bb5f-99863ca4b1dd',\n", " 'content-type': 'application/x-amz-json-1.1',\n", " 'content-length': '107',\n", " 'date': 'Fri, 09 Jun 2023 21:03:23 GMT'},\n", " 'RetryAttempts': 0}}" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# cell 07\n", "# task description\n", "task_description = 'Apply the labels provided to specific words or phrases within the larger text block.'\n", "task_keywords = [\"text\", \"ner\", \"humans\"]\n", "task_title = 'Named Entity Recognition'\n", "job_name = \"ground-truth-text-ner-\" + str(int(time.time()))\n", "output_manifest = \"output_manifest\"\n", "\n", "human_task_config = {\n", " \"AnnotationConsolidationConfig\": {\n", " \"AnnotationConsolidationLambdaArn\": acs_arn,\n", " },\n", " \"PreHumanTaskLambdaArn\": prehuman_arn,\n", " \"MaxConcurrentTaskCount\": 100, # 100 text sentences will be sent at a time to the workteam.\n", " \"NumberOfHumanWorkersPerDataObject\": 1, # number of workers required to label each text.\n", " \"TaskAvailabilityLifetimeInSeconds\": 21600, # Your worteam has 6 hours to complete all pending tasks.\n", " \"TaskDescription\": task_description,\n", " \"TaskKeywords\": task_keywords,\n", " \"TaskTimeLimitInSeconds\": 300, # Each text must be labeled within 5 minutes.\n", " \"TaskTitle\": task_title,\n", " \"UiConfig\": {\n", " 'HumanTaskUiArn': ui_arn,\n", " },\n", "}\n", "\n", "if not VERIFY_USING_PRIVATE_WORKFORCE:\n", " human_task_config[\"PublicWorkforceTaskPrice\"] = {\n", " \"AmountInUsd\": {\n", " \"Dollars\": 0,\n", " \"Cents\": 1,\n", " \"TenthFractionsOfACent\": 2,\n", " }\n", " }\n", " human_task_config[\"WorkteamArn\"] = workteam_arn\n", "else:\n", " human_task_config[\"WorkteamArn\"] = private_workteam_arn\n", "\n", "ground_truth_request = {\n", " \"InputConfig\": {\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"ManifestS3Uri\": \"s3://{}/{}/{}\".format(BUCKET, EXP_NAME, manifest_name),\n", " }\n", " },\n", " \"DataAttributes\": {\n", " \"ContentClassifiers\": [\"FreeOfPersonallyIdentifiableInformation\", \"FreeOfAdultContent\"]\n", " },\n", " },\n", " \"OutputConfig\": {\n", " \"S3OutputPath\": \"s3://{}/{}/output/\".format(BUCKET, EXP_NAME, output_manifest),\n", " },\n", " \"HumanTaskConfig\": human_task_config,\n", " \"LabelingJobName\": job_name,\n", " \"RoleArn\": role,\n", " \"LabelAttributeName\": \"category\",\n", " \"LabelCategoryConfigS3Uri\": \"s3://{}/{}/input/class_labels.json\".format(BUCKET, EXP_NAME),\n", " \"Tags\":[\n", " {\n", " 'Key': 'text',\n", " 'Value': 'NER'\n", " },\n", " ]\n", "\n", "}\n", "\n", "sagemaker_client = boto3.client(\"sagemaker\")\n", "sagemaker_client.create_labeling_job(**ground_truth_request)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "tags": [] }, "outputs": [], "source": [ "#sagemaker_client.describe_labeling_job(LabelingJobName=job_name)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!\n", "https://so3zqudq9a.labeling.us-east-1.sagemaker.aws\n" ] } ], "source": [ "# cell 08\n", "workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]\n", "print(\"Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!\")\n", "print('https://' + sagemaker_client.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Monitor job progress\n", "A Ground Truth job can take a few hours to complete (if your dataset is larger than 100 text, it can take much longer than that!). One way to monitor the job's progress is through AWS Console. In this notebook, we will use Ground Truth output files and Cloud Watch logs in order to monitor the progress. You can re-evaluate the next two cells repeatedly.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can re-evaluate the next cell repeatedly. It sends a `describe_labelging_job` request which should tell you whether the job is completed or not. If it is, then 'LabelingJobStatus' will be 'Completed'." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 09\n", "# re-evaluate repeatedly. It sends a `describe_labelging_job` request which should tell you whether the job is completed or not. If it is, then 'LabelingJobStatus' will be 'Completed'.\n", "while sagemaker_client.describe_labeling_job(LabelingJobName=job_name)['LabelingJobStatus'] == 'InProgress':\n", " job_status = sagemaker_client.describe_labeling_job(LabelingJobName=job_name)['LabelingJobStatus']\n", " print('Labelling job : {}, status : {}'.format(job_name, job_status))\n", " time.sleep(30)\n", "print('Labelling job : {}, status : {}'.format(job_name, sagemaker_client.describe_labeling_job(LabelingJobName=job_name)['LabelingJobStatus']))" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Wait For Workers to Complete Task....." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## View Task Results " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once work is completed, Amazon SageMaker GroundTruth stores results in your S3 bucket and sends a Cloudwatch event. Your results should be available in the S3 OUTPUT_PATH when all work is completed." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 10\n", "# ouput path\n", "S3_OUTPUT = boto3.client('sagemaker').describe_labeling_job(LabelingJobName=job_name)['OutputConfig']['S3OutputPath'] + job_name\n", "print('S3 OUPUT_PATH : {}'.format(S3_OUTPUT))\n", "\n", "# Download human annotation data.\n", "!aws s3 cp {S3_OUTPUT + '/manifests/output/output.manifest'} \"./output/\" #--recursive --quiet" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 11\n", "data=[]\n", "with open('./output/output.manifest') as f:\n", " for line in f:\n", " json_data = json.loads(line)\n", " data.append(json_data)\n", " \n", "output = [[] for i in range(4)]\n", "for item in data:\n", " text=item['source']\n", " for line in item['category']['annotations']['entities']:\n", " output[0].append(text) \n", " output[1].append(line['label'])\n", " output[2].append(line['startOffset'])\n", " output[3].append(line['endOffset'])\n", "\n", "pd.DataFrame(data={'source': output[0], 'entities': output[1], 'startOffset': output[2], 'endOffset': output[3]})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clean Up [OPTIONAL]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's clean up and delete this endpoint." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# cell 12\n", "if sagemaker_client.describe_labeling_job(LabelingJobName=job_name)['LabelingJobStatus'] == 'InProgress':\n", " sagemaker_client.stop_labeling_job(LabelingJobName=job_name)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## The End!" ] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 21, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 28, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 29, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 } ], "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }