{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Create an Active Learning Workflow using Amazon SageMaker Ground Truth" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "### Using this Notebook\n", "\n", "Please set the kernel to *Python 3* and image to *TensorFlow 2.10.0 Python 3.9 CPU Optimized* when running this notebook. Select instance type as `ml.t3.medium`\n", "\n", "In this notebook, we will also leverage [AWS SAM](https://aws.amazon.com/serverless/sam/) to manage and deploy AWS CloudFormation templates that will be used for Active Labeling workflow. The AWS CloudFormation template will help create AWS Lambda functions, AWS Step functions, and IAM roles that will be used for active learning workflow.\n", "\n", "While following along with this blog post, we recommend that you leave most of the cells unmodified. However, the notebook will indicate where you can modify variables to create the resources needed for a custom labeling job.\n", "\n", "If you plan to customize the Ground Truth labeling job request configuration below, you will also need the resources required to create a labeling job. For more information, see [Use Amazon SageMaker Ground Truth for Data Labeling](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html). \n", "\n", "Run the code cells in this notebook to configure a Labeling Job request in JSON format. This request JSON can be used in an active learning workflow and will determine how your labeling job task appears to human workers. \n", "\n", "To customize this notebook, you will need to modify the the cells below and configure the Ground Truth labeling job request (`human_task_config`) to meet your requirements. To learn how to create a Ground Truth labeling job using the Amazon SageMaker API, see [CreateLabelingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateLabelingJob.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prerequisites\n", "\n", "Please complete the initial prerequisites mentioned in the lab instructions before proceeding with the cells in this notebook . After completing the prerequisites, you should have: \n", "- ARN of your private workforce who will perform human annotation\n", "- Labeling portal sign-in URL for your private workforce to log in" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install --upgrade setuptools aws-sam-cli" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Setup AWS SAM " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Restart your kernel for this notebook to use updated packages " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%html\n", "\n", "

Restart your kernel for this notebook to use update package.

\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "#### Setup SageMaker environment and variables" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "tags": [] }, "outputs": [], "source": [ "import os, sys, json, sagemaker, pandas as pd, boto3, numpy as np, re\n", "from sagemaker import get_execution_role,image_uris, model_uris, script_uris, hyperparameters\n", "from sagemaker.tensorflow import TensorFlow\n", "\n", "sess = sagemaker.Session()\n", "\n", "role = get_execution_role()\n", "region = sess.boto_session.region_name\n", "account = sess.boto_session.client(\"sts\").get_caller_identity()[\"Account\"]\n", "smclient = boto3.Session().client('sagemaker')\n", "bucket = sess.default_bucket()\n", "stack_name= \"active-learning-stack\"\n", "key = \"sagemaker-byoal-pretrained\"\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Select a pre-trained model\n", "***\n", "In this lab, we are using a pretrained BERT TensorFlow uncased model . For further details and other pretrained models that are available to use for transfer learning with the Text Classification - TensorFlow algorithm - please check [TensorFlow Hub Models](https://docs.aws.amazon.com/sagemaker/latest/dg/text-classification-tensorflow-Models.html) . This model is also available through Jumpstart .\n", "\n", "A complete list of JumpStart models can also be accessed at [JumpStart Models](https://sagemaker.readthedocs.io/en/stable/doc_utils/jumpstart.html#).\n", "***" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_id = \"tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "model_id, model_version = model_id, \"*\"\n", "training_instance_type=\"ml.p3.2xlarge\"\n", "inference_instance_typ=\"ml.m5.xlarge\"\n", "\n", "# Retrieve the docker image\n", "train_image_uri = image_uris.retrieve(\n", " region=None,\n", " framework=None,\n", " model_id=model_id,\n", " model_version=model_version,\n", " image_scope=\"training\",\n", " instance_type=training_instance_type,\n", ").replace(\"/\", \"\\/\")\n", "\n", "# Retrieve the training script\n", "train_source_uri = script_uris.retrieve(\n", " model_id=model_id, model_version=model_version, script_scope=\"training\"\n", ").replace(\"/\", \"\\/\")\n", "\n", "\n", "# Retrieve the pre-trained model tarball to further fine-tune\n", "base_model_uri = model_uris.retrieve(\n", " model_id=model_id, model_version=model_version, model_scope=\"training\"\n", ").replace(\"/\", \"\\/\")\n", "\n", "inference_instance_type = \"ml.m5.xlarge\"\n", "\n", "# Retrieve the inference docker container uri.\n", "deploy_image_uri = image_uris.retrieve(\n", " region=None,\n", " framework=None,\n", " image_scope=\"inference\",\n", " model_id=model_id,\n", " model_version=model_version,\n", " instance_type=inference_instance_type,\n", ").replace(\"/\", \"\\/\")\n", "print(deploy_image_uri)\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Build and deploy AWS CloudFormation template using AWS SAM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Update template.yaml . The template.yaml is leverage by AWS SAM to build out the backend infrastructure consisting of State machine, helper utilities and AWS lambda functions\n", "\n", "Update configurations variables for Active Learning process :\n", "\n", "- `byom` : (Bring your own model) . This parameter determines whether the Lambda function returns the Amazon Blazing Text Algorithm model or your own custom model docker image. . This is set to to `false` because we are using a pretrained TensorFlow BERT model\n", "- `byomimage` : Your own custom model image . Set this parameter to NotApplicable because we are using a pretrained BERT model\n", "- `sagemaker_program` : This parameter determines the training script to be used for training your model . In this case we are using a transfer learning script to fine tune the pretrained BERT model . Hence, setting this parameter to the script name transfer_learning.py\n", "- `sagemaker_submit_directory` : The directory which contains the training source code or other dependencies aside from the entry point file . Setting this parameter to the S3 bucket name where the training script is hosted\n", "- `pretrain_model` : This parameter determines if the active learning process is using pretrained models or not . This is set to to `true` since we are using a pretrained BERT model\n", "- `pretrain_algo_train_repo` : This parameter refers to the training image URI of our pretrained BERT algorithm . Because we are using a pretrained BERT model, setting this parameter value to the training image URI \n", "- `base_model_uri` : The URI of the pretrained BERT model which will be fine tuned with custom data. Because we are using a pretrained BERT model, setting this to parameter URI to the pretrained BERT model\n", "- `pretrain_algo_inference_repo` : This parameter refers to the inference image URI of our pretrained BERT algorithm . Since we are using a pretrained BERT model, setting this parameter value to the inference image URI \n", "\n", "\n", "Update Batch Strategy to `Single` in the `CreateTransformJob` step of the ActiveLearning-active-learning-stack state machine.\n", "This is because the TensorFlow algorithm used here supports `SingleRecord` \n", "- `BatchStrategy `: `SingleRecord`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%bash -s \"$train_image_uri\" \"$train_source_uri\" \"$base_model_uri\" \"$deploy_image_uri\"\n", "\n", "cd src\n", "\n", "# Update byom value to False \n", "sed -i \"0,/^\\([[:space:]]*byom: *\\).*/s//\\1\"false\"/\" template.yaml\n", "\n", "#Update byomimage value to NotApplicable\n", "sed -i \"0,/^\\([[:space:]]*byomimage: *\\).*/s//\\1\"NotApplicable\"/\" template.yaml\n", "\n", "#Update Pretrained Model flag to True\n", "sed -i \"0,/^\\([[:space:]]*pretrain_model: *\\).*/s//\\1\"true\"/\" template.yaml\n", "\n", "#Update Training script for PreTrained model\n", "sed -i \"0,/^\\([[:space:]]*sagemaker_program: *\\).*/s//\\1\"transfer_learning.py\"/\" template.yaml\n", "\n", "#Update Pretrained algorithm training repository\n", "sed -i \"0,/^\\([[:space:]]*pretrain_algo_train_repo: *\\).*/s//\\1\"${1}\"/\" template.yaml\n", "\n", "#Update SageMaker submit directory \n", "sed -i \"0,/^\\([[:space:]]*sagemaker_submit_directory: *\\).*/s//\\1\"${2}\"/\" template.yaml\n", "\n", "#Update based pretrained model URI\n", "sed -i \"0,/^\\([[:space:]]*base_model_uri: *\\).*/s//\\1\"${3}\"/\" template.yaml\n", "\n", "#Updated Pretrained algorithm inference repository \n", "sed -i \"0,/^\\([[:space:]]*pretrain_algo_inference_repo: *\\).*/s//\\1\"${4}\"/\" template.yaml\n", "\n", "#Update BatchStrategy to SingleRecord\n", "sed -i 's/MultiRecord/SingleRecord/g' template.yaml\n", "\n", "cat template.yaml\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash -s \"$region\" \"$stack_name\" \"$bucket\"\n", "\n", "# Change directory to src folder\n", "cd src\n", "\n", "#Delete SAM build folder to clean up any exisiting SAM templates\n", "rm -rf .aws-sam \n", "\n", "#Build AWS SAM artifacts to deploy AWS Cloudformation template\n", "sam build\n", "\n", "#Deploy AWS SAM template\n", "sam deploy --region ${1} --stack-name ${2} --s3-bucket ${3} --capabilities CAPABILITY_IAM CAPABILITY_AUTO_EXPAND\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "#### Prepare labeling input manifest file" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "We will create an input manifest file for our active learning workflow using the newsCorpora.csv file from the [UCI News Dataset](https://archive.ics.uci.edu/ml/datasets/News+Aggregator). This dataset contains a list of about 420,000 articles that fall into one of four categories: Business (b), Science & Technology (t), Entertainment (e) and Health & Medicine (m). We will randomly choose 10,000 articles from that file to create our dataset.\n", "\n", "For the active learning loop to start, 20% of the data must be labeled. To quickly test the active learning component, we will include 20% (`labeled_count`) of the original labels provided in the dataset in our input manifest. We use this partially-labeled dataset as the input to the active learning loop." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "tags": [] }, "outputs": [], "source": [ "! wget -nc https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip --no-check-certificate && unzip -o NewsAggregatorDataset.zip" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "tags": [] }, "outputs": [], "source": [ "column_names = [\"TITLE\", \"URL\", \"PUBLISHER\", \"CATEGORY\", \"STORY\", \"HOSTNAME\", \"TIMESTAMP\"]\n", "manifest_file = \"partially-labeled.manifest\"\n", "news_data_all = pd.read_csv(\"newsCorpora.csv\", names=column_names, header=None, delimiter=\"\\t\")\n", "news_data = news_data_all.sample(n=10000, random_state=42)\n", "news_data = news_data[[\"TITLE\", \"CATEGORY\"]]\n", "news_data " ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "We will clean our data set using *pandas*." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "tags": [] }, "outputs": [], "source": [ "news_data[\"TITLE\"].replace('\"', \"\", inplace=True, regex=True)\n", "news_data[\"TITLE\"].replace(\"[^\\w\\s]\", \"\", inplace=True, regex=True)\n", "news_data[\"TITLE\"] = news_data[\"TITLE\"].str.split(\"\\n\").str[0]\n", "news_data[\"CATEGORY_LABELS\"]=news_data[\"CATEGORY\"]\n", "news_data[\"CATEGORY\"] = news_data[\"CATEGORY\"].astype(\"category\").cat.codes\n", "news_data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "tags": [] }, "outputs": [], "source": [ "fixed = news_data[\"TITLE\"].str.lower().replace('\"', \"\")" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "The following cell will create our partially-labeled input manifest file, and push it to our S3 bucket. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "tags": [] }, "outputs": [], "source": [ "import json\n", "\n", "total = len(news_data)\n", "labeled_count = int(total / 5) # 20% of the dataset is labeled.\n", "label_map = {\n", " \"b\": \"Business\",\n", " \"e\": \"Entertainment\",\n", " \"m\": \"Health & Medicine\",\n", " \"t\": \"Science and Technology\",\n", "}\n", "labeled_series = pd.Series(\n", " data=news_data.iloc[:labeled_count].TITLE.values,\n", " index=news_data.iloc[:labeled_count].CATEGORY.values,\n", ")\n", "annotation_metadata = b\"\"\"{ \"category-metadata\" : { \"confidence\": 1.0, \"human-annotated\": \"yes\", \"type\": \"groundtruth/text-classification\"} }\"\"\"\n", "annotation_metadata_dict = json.loads(annotation_metadata)\n", "with open(manifest_file, \"w\") as outfile:\n", " for items in labeled_series.items():\n", " labeled_record = dict()\n", " labeled_record[\"source\"] = items[1]\n", " labeled_record[\"category\"] = int(items[0])\n", " labeled_record.update(annotation_metadata_dict)\n", " outfile.write(json.dumps(labeled_record) + \"\\n\")\n", "\n", "unlabeled_series = pd.Series(\n", " data=news_data.iloc[labeled_count:].TITLE.values,\n", " index=news_data.iloc[labeled_count:].CATEGORY.values,\n", ")\n", "with open(manifest_file, \"a\") as outfile:\n", " for items in unlabeled_series.items():\n", " outfile.write('{\"source\":\"' + items[1] + '\"}\\n')\n", "\n", "boto3.resource(\"s3\").Bucket(bucket).upload_file(manifest_file, key + \"/\" + manifest_file)\n", "manifest_file_uri = \"s3://{}/{}\".format(bucket, key + \"/\" + manifest_file)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "tags": [] }, "outputs": [], "source": [ "# Use s3 client to upload relevant json strings to s3.\n", "s3_client = boto3.client(\"s3\")" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "This cell will specify the labels that workers will use to categorize the articles. To customize your labeling job, add your own labels here. To learn more, see [LabelCategoryConfigS3Uri](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateLabelingJob.html#sagemaker-CreateLabelingJob-request-LabelCategoryConfigS3Uri)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "tags": [] }, "outputs": [], "source": [ "label_file_name = \"class_labels.json\"\n", "label_file = \"\"\"{\n", " \"document-version\": \"2018-11-28\",\n", " \"labels\": [\n", " {\n", " \"label\": \"Business\"\n", " },\n", " {\n", " \"label\": \"Entertainment\"\n", " },\n", " {\n", " \"label\": \"Health & Medicine\"\n", " },\n", " {\n", " \"label\": \"Science and Technology\"\n", " }\n", " ]\n", "}\"\"\"\n", "\n", "s3_client.put_object(Body=label_file, Bucket=bucket, Key=key + \"/\" + label_file_name)\n", "label_file_uri = \"s3://{}/{}\".format(bucket, key + \"/\" + label_file_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Prepare custom worker task template " ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "The following cell will specify our custom worker task template. This template will configure the UI that workers will see when they open our text classification labeling job tasks. To learn how to customize this cell, see [Creating your custom labeling task template](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-custom-templates-step2.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "tags": [] }, "outputs": [], "source": [ "template_file_name = \"instructions.template\"\n", "template_file = r\"\"\"\n", "\n", "\n", " \n", " {{ task.input.taskObject }} \n", " \n", "
  1. Read the text carefully.
  2. Read the examples to understand more about the options.
  3. Choose the appropriate label that best suits the text.
\n", "
\n", " \n", "

Example Business title:

US open: Stocks fall after Fed official hints at accelerated tapering.


\n", "

Example Entertainment title:

CBS negotiates three more seasons for The Big Bang Theory


\n", "

Example Health & Medicine title:

Blood Test Could Predict Alzheimer's. Good News?


\n", "

Example Science and Technology (t) title:

Elephants tell human friend from foe by voice.


\n", "

\n", "
\n", " \n", "
\n", "\"\"\"\n", "\n", "s3_client.put_object(Body=template_file, Bucket=bucket, Key=key + \"/\" + template_file_name)\n", "template_file_uri = \"s3://{}/{}\".format(bucket, key + \"/\" + template_file_name)" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "In this lab we will be using a private work force to label your dataset . To use a private work team to labeling your data objects, set `USE_PRIVATE_WORKFORCE` to `True` and input your work team ARN for `private_workteam_arn` that you copied from the *Prerequisite* section earlier. You must have a private workforce in the same AWS Region as your labeling job task request to use a private work team. To learn more see [Use a Private Workforce](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-private.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "tags": [] }, "outputs": [], "source": [ "USE_PRIVATE_WORKFORCE = True\n", "private_workteam_arn = \"Enter your Private Worker ARN\"" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "This cell will automatically configure a public workforce ARN and pre- and post-annotation ARNs (`prehuman_arn` and `acs_arn` respectively). If `USE_PRIVATE_WORKFORCE` is `False` a public workforce will be used to create your labeling job request. \n", "\n", "To customize your labeling job task type, you will need to modify `prehuman_arn` and `acs_arn`. \n", "\n", "If you are using one of the Ground Truth built-in task types, you can find pre- and post-annotation lambda ARNs using the following links. \n", "* Pre-annotation lambda ARNs for built in task types can be found in [HumanTaskConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HumanTaskConfig.html#API_HumanTaskConfig_Contents).\n", "* Post-annotation lambda ARNs (Annotation Consolidation Lambda) for built in task types can be found in [AnnotationConsolidationConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AnnotationConsolidationConfig.html#sagemaker-Type-AnnotationConsolidationConfig-AnnotationConsolidationLambdaArn).\n", "\n", "If you are creating a custom labeling job task, see [Step 3: Processing with AWS Lambda\n", "](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-custom-templates-step3.html) learn how to create custom pre- and post-annotation lambda ARNs." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "tags": [] }, "outputs": [], "source": [ "# Specify ARNs for resources needed to run a text classification job.\n", "ac_arn_map = {\n", " \"us-west-2\": \"081040173940\",\n", " \"us-east-1\": \"432418664414\",\n", " \"us-east-2\": \"266458841044\",\n", " \"eu-west-1\": \"568282634449\",\n", " \"ap-northeast-1\": \"477331159723\",\n", "}\n", "\n", "public_workteam_arn = \"arn:aws:sagemaker:{}:394669845002:workteam/public-crowd/default\".format(\n", " region\n", ")\n", "prehuman_arn = \"arn:aws:lambda:{}:{}:function:PRE-TextMultiClass\".format(region, ac_arn_map[region])\n", "acs_arn = \"arn:aws:lambda:{}:{}:function:ACS-TextMultiClass\".format(region, ac_arn_map[region])" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "The following cell specifies our labeling job name, the description workers see, and tags that workers can use to find our labeling job task." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "tags": [] }, "outputs": [], "source": [ "job_name_prefix = \"byoal-news\"\n", "task_description = \"Classify news title to one of these 4 categories.\"\n", "task_keywords = [\"text\", \"classification\", \"humans\", \"news\"]\n", "task_title = task_description" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Labeling job request" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "Modify the following request to customize your labeling job request. For more information on the parameters below, see [CreateLabelingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateLabelingJob.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "tags": [] }, "outputs": [], "source": [ "human_task_config = {\n", " \"AnnotationConsolidationConfig\": {\n", " \"AnnotationConsolidationLambdaArn\": acs_arn,\n", " },\n", " \"PreHumanTaskLambdaArn\": prehuman_arn,\n", " \"MaxConcurrentTaskCount\": 50, # 50 texts will be sent at a time to the workteam.\n", " \"NumberOfHumanWorkersPerDataObject\": 1, # 1 workers will be enough to label each text.\n", " \"TaskAvailabilityLifetimeInSeconds\": 21600, # Your work team has 6 hours to complete all pending tasks.\n", " \"TaskDescription\": task_description,\n", " \"TaskKeywords\": task_keywords,\n", " \"TaskTimeLimitInSeconds\": 300, # Each text must be labeled within 5 minutes.\n", " \"TaskTitle\": task_title,\n", " \"UiConfig\": {\n", " \"UiTemplateS3Uri\": template_file_uri,\n", " },\n", "}\n", "\n", "if not USE_PRIVATE_WORKFORCE:\n", " human_task_config[\"PublicWorkforceTaskPrice\"] = {\n", " \"AmountInUsd\": {\n", " \"Dollars\": 0,\n", " \"Cents\": 1,\n", " \"TenthFractionsOfACent\": 2,\n", " }\n", " }\n", " human_task_config[\"WorkteamArn\"] = public_workteam_arn\n", "else:\n", " human_task_config[\"WorkteamArn\"] = private_workteam_arn\n", "\n", "ground_truth_request = {\n", " \"InputConfig\": {\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"ManifestS3Uri\": manifest_file_uri,\n", " }\n", " },\n", " \"DataAttributes\": {\n", " \"ContentClassifiers\": [\"FreeOfPersonallyIdentifiableInformation\", \"FreeOfAdultContent\"]\n", " },\n", " },\n", " \"OutputConfig\": {\n", " \"S3OutputPath\": \"s3://{}/{}/output/\".format(bucket, key),\n", " },\n", " \"HumanTaskConfig\": human_task_config,\n", " \"LabelingJobNamePrefix\": job_name_prefix,\n", " \"RoleArn\": role,\n", " \"LabelAttributeName\": \"category\",\n", " \"LabelCategoryConfigS3Uri\": label_file_uri,\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "tags": [] }, "outputs": [], "source": [ "print(json.dumps(ground_truth_request, indent=2))\n", "\n", "step_func_input=json.dumps(ground_truth_request,default=str)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "#Get Stepfunction Client and start execution\n", "sfn=boto3.client('stepfunctions', region_name=region)\n", "stateMachineArn=\"arn:aws:states:{}:{}:stateMachine:ActiveLearningLoop-active-learning-stack\".format(region,account)\n", "response = sfn.start_execution(\n", " stateMachineArn=stateMachineArn,\n", " input=step_func_input\n", " \n", ")\n", "\n", "executionArn=response[\"executionArn\"] \n", "\n", "print(json.dumps(response, indent=2, default=str))" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false } }, "source": [ "Please refer to lab instructions for next steps on View active data labeling workflow and Human Annotation WorkFlow\n", "\n", "On successful completion of the active learning loop and the Human Annotation Worflow, the state machine will output the final output manifest file and the latest trained model output.\n", "\n", "Run the below cell, on succesfull completion of the active learning loop and the Human Annotation Worflow" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Clean up \n", "\n", "# Stop Active Learning state machine\n", "if (sfn.describe_execution(executionArn=executionArn)[\"status\"]==\"RUNNING\" ):\n", " sfn.stop_execution(executionArn=executionArn)\n", " print('State Machine Execution Stopped : ' + executionArn)\n", "\n", "# Stop Ground Truth Labeling jobs that are InProgress or Initializing\n", "response = smclient .list_labeling_jobs(NameContains='byoal-news')\n", "for i in range(len(response['LabelingJobSummaryList'])):\n", " if ( (response['LabelingJobSummaryList'][i]['LabelingJobStatus'] == 'InProgress' ) or (response['LabelingJobSummaryList'][i]['LabelingJobStatus'] == 'Initializing')):\n", " label_job_name=response['LabelingJobSummaryList'][i]['LabelingJobName']\n", " smclient.stop_labeling_job(LabelingJobName=label_job_name)\n", " print('Stopping Labeling Job: ' + label_job_name )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Delete AWS Cloudformation Stack created by AWS SAM" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash -s \"$region\" \"$stack_name\"\n", "\n", "#Delete CloudFormation Stack \n", "sam delete --region ${1} --stack-name ${2} --no-prompts" ] } ], "metadata": { "availableInstances": [ { "_defaultOrder": 0, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.t3.medium", "vcpuNum": 2 }, { "_defaultOrder": 1, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.t3.large", "vcpuNum": 2 }, { "_defaultOrder": 2, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.t3.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 3, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.t3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 4, "_isFastLaunch": true, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5.large", "vcpuNum": 2 }, { "_defaultOrder": 5, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 6, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 7, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 8, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 9, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 10, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 11, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 12, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.m5d.large", "vcpuNum": 2 }, { "_defaultOrder": 13, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.m5d.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 14, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.m5d.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 15, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.m5d.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 16, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.m5d.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 17, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.m5d.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 18, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.m5d.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 19, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.m5d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 20, "_isFastLaunch": false, "category": "General purpose", "gpuNum": 0, "hideHardwareSpecs": true, "memoryGiB": 0, "name": "ml.geospatial.interactive", "supportedImageNames": [ "sagemaker-geospatial-v1-0" ], "vcpuNum": 0 }, { "_defaultOrder": 21, "_isFastLaunch": true, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 4, "name": "ml.c5.large", "vcpuNum": 2 }, { "_defaultOrder": 22, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 8, "name": "ml.c5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 23, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.c5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 24, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.c5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 25, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 72, "name": "ml.c5.9xlarge", "vcpuNum": 36 }, { "_defaultOrder": 26, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 96, "name": "ml.c5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 27, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 144, "name": "ml.c5.18xlarge", "vcpuNum": 72 }, { "_defaultOrder": 28, "_isFastLaunch": false, "category": "Compute optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.c5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 29, "_isFastLaunch": true, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g4dn.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 30, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g4dn.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 31, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g4dn.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 32, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g4dn.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 33, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g4dn.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 34, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g4dn.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 35, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 61, "name": "ml.p3.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 36, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 244, "name": "ml.p3.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 37, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 488, "name": "ml.p3.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 38, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.p3dn.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 39, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.r5.large", "vcpuNum": 2 }, { "_defaultOrder": 40, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.r5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 41, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.r5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 42, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.r5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 43, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.r5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 44, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.r5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 45, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 512, "name": "ml.r5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 46, "_isFastLaunch": false, "category": "Memory Optimized", "gpuNum": 0, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.r5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 47, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 16, "name": "ml.g5.xlarge", "vcpuNum": 4 }, { "_defaultOrder": 48, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 32, "name": "ml.g5.2xlarge", "vcpuNum": 8 }, { "_defaultOrder": 49, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 64, "name": "ml.g5.4xlarge", "vcpuNum": 16 }, { "_defaultOrder": 50, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 128, "name": "ml.g5.8xlarge", "vcpuNum": 32 }, { "_defaultOrder": 51, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 1, "hideHardwareSpecs": false, "memoryGiB": 256, "name": "ml.g5.16xlarge", "vcpuNum": 64 }, { "_defaultOrder": 52, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 192, "name": "ml.g5.12xlarge", "vcpuNum": 48 }, { "_defaultOrder": 53, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 4, "hideHardwareSpecs": false, "memoryGiB": 384, "name": "ml.g5.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 54, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 768, "name": "ml.g5.48xlarge", "vcpuNum": 192 }, { "_defaultOrder": 55, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4d.24xlarge", "vcpuNum": 96 }, { "_defaultOrder": 56, "_isFastLaunch": false, "category": "Accelerated computing", "gpuNum": 8, "hideHardwareSpecs": false, "memoryGiB": 1152, "name": "ml.p4de.24xlarge", "vcpuNum": 96 } ], "instance_type": "ml.c5.large", "kernelspec": { "display_name": "Python 3 (TensorFlow 2.10.0 Python 3.9 CPU Optimized)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/tensorflow-2.10.1-cpu-py39-ubuntu20.04-sagemaker-v1.2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.16" } }, "nbformat": 4, "nbformat_minor": 4 }