{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# Post-Processing Amazon Textract with Location-Aware Transformers\n", "\n", "# Part 1: Introduction and Data Collection\n", "\n", "> *This notebook works well with the `Data Science 3.0 (Python 3)` kernel on SageMaker Studio*\n", "\n", "Human-readable documents are more than just 1D sequences of text: Many have **complex layout and formatting structures** (like headers and footers, form fields, tables, paragraphs and multi-section layouts...) that are important for a wide range of information extraction tasks - and challenge the idea that pages always have one \"correct reading order\" at all.\n", "\n", "**Multi-modal, layout-aware language models** - like [**DocFormer** (2021, Appalaraju et al at AWS AI)](https://arxiv.org/abs/2106.11539), [**LayoutXLM** (2021, Xu et al)](https://arxiv.org/abs/2104.08836), and the LayoutLM series [v1](https://arxiv.org/abs/1912.13318), [v2](https://arxiv.org/abs/2012.14740), [v3](https://arxiv.org/abs/2204.08387) - augment traditional (sequence-based) text processing models to add features representing the **absolute positions of words/tokens on the page**: And other modalities too, like a pixel image representation of the source page.\n", "\n", "Architectures like these can **significantly improve accuracy** (over plain text methods or position-based templating), for information extraction tasks where *both* the visual layout and the text content are useful.\n", "\n", "> ⚠️ **Note:** You should be aware that several fully-managed document AI service offerings on AWS, are already capable of similar kinds of multi-modal document understanding under the hood: Particularly [Amazon Textract Queries](https://aws.amazon.com/about-aws/whats-new/2022/04/amazon-textract-launches-queries-feature-analyze-document-api/) and [Amazon Comprehend native document processing](https://aws.amazon.com/about-aws/whats-new/2021/09/amazon-comprehend-extract-entities-native-format/).\n", ">\n", "> While these AI services should usually be the first choice where available, custom SageMaker-based modelling as shown here may still be of interest in some cases: For example if you'd like to customize the model architeture or pre-training, or use a particular feature+language combination that's not currently available as a managed service.\n", "\n", "This multi-modal approach is useful for post-processing OCR results from services like [Amazon Textract](https://aws.amazon.com/textract/) (which returns both the detected text and the geometry of each word in the input document), and can be compatible with a broad range of downstream NLP tasks, like:\n", "\n", "- \"Self-supervised\" pre-training on Textracted but otherwise unlabelled documents\n", "- Classifying documents, pages, or individual lines/sequences\n", "- Extracting entities (e.g. via token/word classification)\n", "- Extractive question answering and other span extraction tasks\n", "- \"Translation\", generative question answering or other sequence generation\n", "\n", "In this sample we'll demonstrate applying open-source models in this family (including the multi-lingual **[LayoutXLM](https://huggingface.co/docs/transformers/model_doc/layoutxlm)** and English-optimized **LayoutLM [v1](https://huggingface.co/docs/transformers/model_doc/layoutlm)/[v2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2)**) to an example use case where Amazon Textract's [built-in functionality](https://aws.amazon.com/textract/features/) for extracting key-value \"Forms\" data and structured \"Tables\" data helps somewhat... But would require further post-processing for usable results.\n", "\n", "> ⚠️ **Note:** Although model implementations in the [Hugging Face Transformers library](https://huggingface.co/docs/transformers/index) used by this sample are generally published under the permissive Apache-2.0 license [as detailed on their GitHub repository](https://github.com/huggingface/transformers/blob/main/LICENSE), you should be aware that **license terms for pre-trained model weights** published on the [Hugging Face Model Hub](https://huggingface.co/models) may vary. In particular, there is [a difference](https://github.com/microsoft/unilm/issues/352) in commercial-use licensing between pre-trained LayoutLMv1 and v2/v3/XLM weights.\n", "\n", "This first notebook will focus on preparing and annotating data, before we move on to training, deploying, and integrating models in later notebooks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data prep overview (⚙️ bringing your own data)\n", "\n", "This notebook covers 3 main data preparation steps for training multi-modal document understanding models like these:\n", "\n", "1. Processing raw documents with **Amazon Textract** to extract position-linked text information.\n", "1. **Cleaning** document images ready for human annotation (and, for some models, preparing standard-sized page images).\n", "1. **Iteratively annotating** documents to build a dataset.\n", "\n", "We'll try to keep this process **modular** to make it easy to adapt for your own datasets and iterate as your corpus grows.\n", "\n", "For this purpose, we'll use [JSONLines](https://jsonlines.org/) formatted [augmented manifest files](https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html#augmented-manifest-format) as a shared integration point: Starting out with a simple document-level list of raw files, and gradually building up to an annotated, page-level manifest with records something like the below:\n", "\n", "```json\n", "{\"raw-ref\": \"s3://doc-example-bucket/raw/mydoc.pdf\", \"textract-ref\": \"s3://doc-example-bucket/tex/mydoc.pdf.json\", \"page-num\": 2, \"source-ref\": \"s3://doc-example-bucket/img/mydoc-0002.png\", \"labels\": {...}}\n", "{\"raw-ref\": \"s3://doc-example-bucket/raw/mydoc.pdf\", \"textract-ref\": \"s3://doc-example-bucket/tex/mydoc.pdf.json\", \"page-num\": 3, \"source-ref\": \"s3://doc-example-bucket/img/mydoc-0003.png\", \"labels\": {...}}\n", "```\n", "\n", "If your setup deviates at any step (for example, already having Textract results for your documents) or you want to modify the data, then so long as you're able to generate the manifest file output for each step, you should be able to continue the flow. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting started: Dependencies and configuration\n", "\n", "First there are some additional libraries we need to install that aren't present in the SageMaker kernel environments by default:\n", "\n", "- [Amazon Textract Response Parser](https://github.com/aws-samples/amazon-textract-response-parser) is a helper for interpreting and navigating Amazon Textract result JSON.\n", "- The [SageMaker Studio Image Build CLI](https://github.com/aws-samples/sagemaker-studio-image-build-cli) is a tool for building customized container images and pushing to [Amazon ECR](https://aws.amazon.com/ecr/).\n", "- [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) (installed by default in SageMaker already) must be version >=2.87 for some of the Hugging Face container versions used in notebook 2." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "!pip install amazon-textract-response-parser \\\n", " sagemaker-studio-image-build \\\n", " \"sagemaker>=2.87,<3\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the requried libraries installed, we're ready to import dependencies and set up some basic configuration including which [Amazon S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html) and folder/prefix data will be uploaded to." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "\n", "# Python Built-Ins:\n", "import json\n", "from logging import getLogger\n", "import os\n", "import random\n", "import re\n", "import shutil\n", "import time\n", "from zipfile import ZipFile\n", "\n", "# External Dependencies:\n", "import boto3 # AWS SDK for Python\n", "from IPython import display # To display rich content in notebook\n", "import pandas as pd # For tabular data analysis\n", "import sagemaker # High-level SDK for SageMaker\n", "from tqdm.notebook import tqdm # Progress bars\n", "\n", "# Local Dependencies:\n", "import util\n", "\n", "logger = getLogger()\n", "\n", "bucket_name = sagemaker.Session().default_bucket()\n", "bucket_prefix = \"textract-transformers/\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For some sections below we'll need to reference resources created by the **[AWS CloudFormation](https://aws.amazon.com/cloudformation/) solution stack** you spun up earlier. If you didn't do this step yet, refer to the [README.md](../README.md) in the top level of the repository for instructions.\n", "\n", "The solution stack stores these useful variables in [AWS Systems Manager Parameter Store](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html) and below we use the `util.project` utility module ([util/project.py](util/project.py)) to fetch them. This is a transferable pattern you can use to connect from data science notebooks to deployed ML project resources in the cloud by project name/ID.\n", "\n", "▶️ **Check** in the [CloudFormation Console](https://console.aws.amazon.com/cloudformation/home?#/stacks) that the `ProjectId` parameter for your OCR Pipeline Stack matches the default `ocr-transformers-demo` value below: Otherwise change the code below to match.\n", "\n", "> ⚠️ If you get an **AccessDeniedException** (ClientError) below, it's likely your [SageMaker execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) doesn't have the required `ssm:GetParameters` permission to look up the OCR pipeline stack parameters.\n", ">\n", "> To fix this, you can click your execution role in the [IAM Roles Console](https://console.aws.amazon.com/iamv2/home#/roles) and use the **Attach policies** button to attach the `PipelineDataSciencePolicy` created by the stack." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "try:\n", " config = util.project.init(\"ocr-transformers-demo\")\n", " print(config)\n", "except Exception as e:\n", " try:\n", " print(f\"Your SageMaker execution role is: {sagemaker.get_execution_role()}\")\n", " except Exception:\n", " print(\"Couldn't look up your SageMaker execution role\")\n", " raise e" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Upload the raw documents\n", "\n", "For our example, we'll be exploring (a recent quarter's snapshot of) the [Credit card agreements database](https://www.consumerfinance.gov/credit-cards/agreements/) published by the United States' [Consumer Finance Protection Bureau](https://www.consumerfinance.gov/). This dataset includes specimen credit card agreement documents from providers across the US, and is interesting for our purposes because the documents are:\n", "\n", "- **Diverse** in formatting, as various providers present the required information in different ways\n", "- **Representative of commercial** documents - rather than, for example, academic papers which might have quite different tone and structure\n", "- **Complex** in structure, with common data points in theory (e.g. interest rates, fees, etc) - but a lot of nuances and differences between documents in practice.\n", "\n", "The sample dataset (approx. 900MB uncompressed) is published as an archive file (approx. 750MB) which we'll need to extract for the raw PDFs. Since it's a reasonable size, we can perform the extraction here in SageMaker Studio to also have local copies of the raw files to inspect.\n", "\n", "> ▶️ Skip down to the next cell for guidance on bringing custom datasets (or if you've run this extraction already). You may find it useful to run this sample through once with the example dataset first - to understand how it works end-to-end." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%time\n", "os.makedirs(\"data/raw\", exist_ok=True)\n", "\n", "# Fetch the example data:\n", "!wget -O data/CC_Agreements.zip https://files.consumerfinance.gov/a/assets/Credit_Card_Agreements_2020_Q4.zip\n", "\n", "# Extract the file:\n", "print(\"Extracting...\")\n", "shutil.rmtree(\"data/raw\")\n", "with ZipFile(\"data/CC_Agreements.zip\", \"r\") as fzip:\n", " fzip.extractall(\"data/raw\")\n", "\n", "# Clean up unneeded files and remap if the folder became nested:\n", "# (This is written specific to our sample data zip, but is unlikely to break most custom data)\n", "original_root_items = os.listdir(\"data/raw\")\n", "if \"__MACOSX\" in original_root_items:\n", " shutil.rmtree(\"data/raw/__MACOSX\")\n", "if len(original_root_items) < 4:\n", " try:\n", " folder = next(f for f in original_root_items if f.startswith(\"Credit_Card_Agreements\"))\n", " print(f\"De-nesting folder '{folder}'...\")\n", " for sub in os.listdir(f\"data/raw/{folder}\"):\n", " shutil.move(f\"data/raw/{folder}/{sub}\", f\"data/raw/{sub}\")\n", " time.sleep(0.1) # (Saw a FileNotFound error during renames one time in SMStudio)\n", " os.rmdir(f\"data/raw/{folder}\")\n", " except StopIteration:\n", " pass\n", "\n", "print(\"Done!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⚙️ For custom datasets, your documents are likely already extracted.\n", "\n", "- They'll need to be uploaded to **Amazon S3 in the same AWS Region** you're currently working in, in a bucket your [SageMaker execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) has access to - in order to use our sample code to run them through Amazon Textract in bulk, prepare page thumbnail images, or annotate the documents.\n", "- They should be under one **shared S3 prefix** - to allow generating all page thumbnail images in a single job, and because downstream processes expect a relative path to uniquely identify a document.\n", "- You may like to copy (all or a small subset of) these docs into SageMaker under the `data/raw` folder - to preview and inspect documents through the process." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "os.makedirs(\"data/raw\", exist_ok=True)\n", "\n", "# Configure your root S3 location for raw documents (note no trailing slash):\n", "raw_s3uri = f\"s3://{bucket_name}/{bucket_prefix}data/raw\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# The s3 sync command can upload folders from SageMaker to S3 (or download, swapping the args).\n", "# For the example data, we extracted locally so will upload:\n", "print(f\"Uploading raw PDFs to {raw_s3uri}...\")\n", "!aws s3 sync --quiet data/raw {raw_s3uri}\n", "print(\"Done\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the documents present on S3, we'll need an inventory of **all** documents relevant to the project, and will use a [JSONLines augmented manifest format](https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html#augmented-manifest-format) file to store this.\n", "\n", "The cell below builds this by crawling all objects under `raw_s3uri` that look like valid images/documents:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "s3 = boto3.resource(\"s3\")\n", "raw_bucket_name, raw_prefix = util.s3.s3uri_to_bucket_and_key(raw_s3uri)\n", "\n", "valid_file_types = {\"jpeg\", \"jpg\", \"pdf\", \"png\", \"tif\", \"tiff\"}\n", "\n", "n_files = 0\n", "with open(\"data/raw-all.manifest.jsonl\", \"w\") as f:\n", " # sorted() guarantees output order for reproducible sampling later:\n", " for obj in sorted(\n", " s3.Bucket(raw_bucket_name).objects.filter(Prefix=raw_prefix + \"/\"),\n", " key=lambda obj: obj.key,\n", " ):\n", " # Filter out any files you know shouldn't be counted:\n", " file_ext = obj.key.rpartition(\".\")[2].lower()\n", " if \"/.\" in obj.key or file_ext not in valid_file_types:\n", " print(f\"Skipping s3://{obj.bucket_name}/{obj.key}\")\n", " continue\n", "\n", " # Save\n", " item = {\"raw-ref\": f\"s3://{obj.bucket_name}/{obj.key}\"}\n", " f.write(json.dumps(item)+\"\\n\")\n", " n_files += 1\n", "\n", "print(f\"\\nFound {n_files} valid files for OCR\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ▶️ In [data/raw-all.manifest.jsonl](data/raw-all.manifest.jsonl) you should now see an alphabetized list of your source documents as stored on S3: With one record per line like `{\"raw-ref\": \"s3://...\"}`\n", "\n", "**IF** you have (some or all of) the source documents downloaded to SageMaker in the `data/raw` folder, you can explore these through the file browser or even pull them through to display inline here in the notebook:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Read from docs manifest:\n", "with open(\"data/raw-all.manifest.jsonl\") as f:\n", " for ix, line in enumerate(f):\n", " # Select by index (could change this):\n", " if ix == 0:\n", " item = json.loads(line)\n", " # Map S3 URI to local data/raw file path:\n", " filepath = os.path.join(\"data/raw\", item[\"raw-ref\"].replace(raw_s3uri+\"/\", \"\"))\n", " break\n", "\n", "print(f\"Displaying: {filepath}\")\n", "display.IFrame(\n", " filepath,\n", " height=\"600\",\n", " width=\"100%\",\n", ")" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Defining the challenge\n", "\n", "We have our source documents, so what will we try to extract about them?\n", "\n", "There are many ways position-aware NLP models might be applied to OCR outputs: For example to generate structured summaries, provide translations, answer questions, or just classify the documents.\n", "\n", "A common requirement in document analytics and process automation though, is to extract particular **'fields' of interest**: Known attributes expected to be present in all/most of the documents, which would be interesting to compare between them.\n", "\n", "In this example we'll tackle this as an **entity detection** task **via word classification**:\n", "\n", "- Defining a list of field/entity types of interest\n", "- Classifying each `WORD` in the document to these types, using a Hugging Face [LayoutLMv2ForTokenClassification](https://huggingface.co/docs/transformers/model_doc/layoutlmv2#transformers.LayoutLMv2ForTokenClassification) model or similar\n", "- ...And finally grouping individual words together (via simple rule-based post-processing) to detect the entities/fields\n", "\n", "Some **benefits** of this approach are:\n", "\n", "- Results are traceable all the way back to the detected word blocks from the OCR engine; rather than with a text generation method where the output of the model may not correspond 1:1 with raw text.\n", "- Annotation effort is relatively minimal; since we only need to highlight the documents, rather than typing out custom corrections, answers, etc.\n", "\n", "Some **drawbacks** are:\n", "\n", "- Since it only tags detected words, this model will not be able to *intelligently correct OCR errors* or *standardize form* (e.g. of dates) like a text generation method could learn to.\n", "- Since the ML component only extends to word classification, we're still relying on (usually good, helped by Amazon Textract) rule-based heuristics to group same-type words together to detect multi-word entities.\n", "\n", "Below, we'll define the set of fields/entities to be detected and their configuration aspects. You can run `help(FieldConfiguration)` or refer to [util/postproc/config.py](util/postproc/config.py) for more details of the API.\n", "\n", "> ⚠️ **Warning:** Although you may **edit** the configuration below (to build a custom dataset), you'll no longer be able to use the pre-annotated data sample we provide in `data/annotations` to accelerate model training (unless the classes are still defined in the same order, and labelled in a consistent way with the previous guidelines)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from util.postproc.config import FieldConfiguration\n", "\n", "# For config API details, you can see the docs in the source file or run:\n", "# help(FieldConfiguration)\n", "\n", "fields = [\n", " # (To prevent human error, enter class_id=0 each time and update programmatically below)\n", " FieldConfiguration(0, \"Agreement Effective Date\", optional=True, select=\"first\",\n", " annotation_guidance=(\n", " \"

Avoid labeling extraneous dates which are not necessarily the effective date of \"\n", " \"the document: E.g. copyright dates/years, or other dates mentioned in text.

\"\n", " \"

Do not include unnecessary qualifiers e.g. 'from 2020/01/01'.

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"APR - Introductory\", optional=True, select=\"confidence\",\n", " annotation_guidance=(\n", " \"

Use this class (instead of the others) for ANY case where the rate is \"\n", " \"offered for a fixed introductory period - regardless of interest rate subtype e.g. \"\n", " \"balance transfers, purchases, etc.

\"\n", " \"

Include the term of the introductory period in cases where it's directly listed \"\n", " \"(e.g. '20.00% for the first 6 months'). Try to minimize/exclude extraneous \"\n", " \"information about the offer (e.g. '20.00% for the first 6 months after account \"\n", " \"opening').

\"\n", " \"

'Prime rate + X%' mentions are acceptable and should be labeled.

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"APR - Balance Transfers\", optional=True, select=\"confidence\",\n", " annotation_guidance=(\n", " \"

Use for interest rates which are specific to balance transfers.

\"\n", " \"

Avoid including extraneous information about the terms of balance transfers, or \"\n", " \"using for fixed-term introductory rates.

\"\n", " \"

'Prime rate + X%' mentions are acceptable and should be labeled.

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"APR - Cash Advances\", optional=True, select=\"confidence\",\n", " annotation_guidance=(\n", " \"

Use for interest rates which are specific to cash advances.

\"\n", " \"

Avoid including extraneous information about the terms of cash advances, or using \"\n", " \"for fixed-term introductory rates.

\"\n", " \"

'Prime rate + X%' mentions are acceptable and should be labeled.

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"APR - Purchases\", optional=True, select=\"confidence\",\n", " annotation_guidance=(\n", " \"

Use for interest rates which are specific to purchases.

\"\n", " \"

'Prime rate + X%' mentions are acceptable and should be labeled.

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"APR - Penalty\", optional=True, select=\"confidence\",\n", " annotation_guidance=(\n", " \"

Use for penalty interest rates applied under certain conditions.

\"\n", " \"

Exclude include information about the conditions under which the penalty \"\n", " \"rate comes into effect: Only include the interest rate which will be applied.

\"\n", " \"

'Prime rate + X%' mentions are acceptable and should be labeled.

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"APR - General\", optional=True, select=\"confidence\",\n", " annotation_guidance=(\n", " \"

Use for interest rates which are general and not specifically tied to a \"\n", " \"particular transaction type e.g. purchases / balance transfers.

\"\n", " \"

Avoid using for fixed-term introductory rates.

\"\n", " \"

'Prime rate + X%' mentions are acceptable and should be labeled.

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"APR - Other\", optional=True, select=\"confidence\",\n", " # TODO: Remove this class\n", " annotation_guidance=(\n", " \"

Use only for interest rates which don't fall in to any other category (including \"\n", " \"general or introductory rates). You may not see any examples in the data.

\"\n", " \"

Avoid using for fixed-term introductory rates.

\"\n", " \"

'Prime rate + X%' mentions are acceptable and should be labeled.

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"Fee - Annual\", optional=True, select=\"confidence\",\n", " annotation_guidance=(\n", " \"

Include cases where the document explicitly indicates no fee e.g. 'None'

\"\n", " \"

Avoid any introductory terms e.g. '$0 for the first 6 months' or extraneous \"\n", " \"words: Label only the standard fee.

\"\n", " \"

Label only the annual amount of the fee, in cases where other breakdowns are \"\n", " \"specified: E.g. '$120', not '$10 per month ($120 per year)'.

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"Fee - Balance Transfer\", optional=True, select=\"confidence\",\n", " annotation_guidance=(\n", " # TODO: Review\n", " \"

Try to be concise and exclude extra terms where not necessary

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"Fee - Late Payment\", optional=True, select=\"confidence\",\n", " annotation_guidance=(\n", " \"

Label only the fee, not the circumstances in which it is payable.

\"\n", " \"

Limits e.g. 'Up to $25' are acceptable (don't just label '$25').

\"\n", " \"

Do NOT include non-specific mentions of pass-throgh costs (e.g. 'legal \"\n", " \"costs', 'reasonable expenses', etc.) incurred in the general collections process.

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"Fee - Returned Payment\", optional=True, select=\"confidence\",\n", " annotation_guidance=(\n", " \"

Label only the fee, not the circumstances in which it is payable.

\"\n", " \"

Limits e.g. 'Up to $25' are acceptable (don't just label '$25').

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"Fee - Foreign Transaction\", optional=True, select=\"shortest\",\n", " annotation_guidance=(\n", " \"

Do NOT include explanations of how exchange rates are calculated or \"\n", " \"non-specific indications of margins between rates. DO include specific \"\n", " \"charges/margins with brief clarifying info where listed e.g. '3% of the US \"\n", " \"dollar amount'.

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"Fee - Other\", ignore=True,\n", " annotation_guidance=(\n", " \"

Common examples include: Minimum interest charge, cash advance fees, and \"\n", " \"overlimit fees.

\"\n", " \"

Do NOT include fixed-term introductory rates for fees (e.g. '$0 during \"\n", " \"the first year. After the first year...') - only the standard fees

\"\n", " \"

DO include qualifying information on the amount and limits of the fee, \"\n", " \"e.g. '$5 or 5% of the amount of each transaction, whichever is the greater'.

\"\n", " \"

Do NOT include general information on the nature of the fee and \"\n", " \"circumstances under which it is applied: E.g. 'Cash advance fee' or 'If the amount \"\n", " \"of interest payable is...'

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"Card Name\",\n", " annotation_guidance=(\n", " \"

Label instances of the brand name of specific card(s) offered by the provider \"\n", " \"under the agreement, e.g. 'Rewards Platinum Card'

\"\n", " \"

Include the ' Card' suffix where available, but also annotate instances without \"\n", " \"such as 'Rewards Platinum'

\"\n", " \"

Avoid including the Provider Name (use the separate class for this) e.g. \"\n", " \"'AnyCompany Rewards Card' unless it's been substantially modified/abbreviated for \"\n", " \"the card name (e.g. 'AnyCo Rewards Card') or the company name is different from the \"\n", " \"Credit card provider (e.g. AnyBank offering a store credit card for AnyCompany)

\"\n", " \"

Do NOT include fixed-term introductory rates for fees (e.g. '$0 during \"\n", " \"the first year. After the first year...') - only the standard fees

\"\n", " \"

Avoid labeling generic payment provider names e.g. 'VISA card' or \"\n", " \"'Mastercard', except in contexts where the provider clearly uses them as the brand \"\n", " \"name for the offered card (e.g. 'VISA Card' from 'AnyCompany VISA Card'.

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"Provider Address\", optional=True, select=\"confidence\",\n", " annotation_guidance=(\n", " \"

Include department or 'attn:' lines where present (but not Provider Name where \"\n", " \"used at the start of an address e.g. 'AnyCompany; 100 Main Street...').

\"\n", " \"

Include zip/postcode where present.

\"\n", " \"

Avoid labeling addresses for non-provider entities, such as watchdogs, \"\n", " \"market regulators, or independent agencies.

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"Provider Name\", select=\"longest\",\n", " annotation_guidance=(\n", " \"

Label the name of the card provider: Including abbreviated mentions.

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"Min Payment Calculation\", ignore=True,\n", " annotation_guidance=(\n", " \"

Label clauses describing how the minimum payment is calculated.

\"\n", " \"

Exclude lead-in e.g. 'The minimum payment is calculated as...' and label directly \"\n", " \"from e.g. 'the minimum of...'.

\"\n", " \"

Do NOT include clauses from related subjects e.g. how account balance is \"\n", " \"calculated

\"\n", " ),\n", " ),\n", " FieldConfiguration(0, \"Local Terms\", ignore=True,\n", " annotation_guidance=(\n", " \"

Label full terms specific to residents of certain states/countries, or applying \"\n", " \"only in particular jurisdictions.

\"\n", " \"

Include the scope of where the terms apply e.g. 'Residents of GA and \"\n", " \"VA...'

\"\n", " \"

Include locally-applicable interest rates, instead of annotating these \"\n", " \"with the 'APR - ' classes

\"\n", " ),\n", " )\n", "]\n", "for ix, cfg in enumerate(fields):\n", " cfg.class_id = ix\n", "\n", "# Save the configuration to file:\n", "with open(\"data/field-config.json\", \"w\") as f:\n", " f.write(json.dumps(\n", " [cfg.to_dict() for cfg in fields],\n", " indent=2,\n", " ))\n", "\n", "# And print out a simple list:\n", "entity_classes = [f.name for f in fields]\n", "print(\"\\n\".join(entity_classes))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ▶️ Your field configuration should now be stored in [data/field-config.json](data/field-config.json)\n", "\n", "## Data collection\n", "\n", "To efficiently annotate training data for entity extraction on documents, we'll want to work **visually**: Highlighting matches, and perhaps also collecting manual transcription reviews - in case we'd like to extend the model later to support correcting text.\n", "\n", "[Amazon SageMaker Ground Truth](https://aws.amazon.com/sagemaker/groundtruth/) provides an out-of-the-box annotation UI [for bounding boxes](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-bounding-box.html) which will be useful for this: And can also be incorporated within [customized annotation UIs](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-custom-templates.html) via the [crowd-bounding-box](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-ui-template-crowd-bounding-box.html) element.\n", "\n", "However, at the time of writing, the bounding box annotation tool supports images but not PDFs. Therefore to prepare for data annotation we'll need to:\n", "\n", "- Run our documents through Amazon Textract\n", "- Extract individual page images from the PDFs to use through the annotation UI\n", "- Collate the page images and Textract results together, ready for annotation\n", "\n", "For a significantly sized corpus like this, we'd also benefit from filtering down the data a little - to save time and cost by Textracting and converting only the amount of data we'll need." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Filter a sample of the document set\n", "\n", "To limit down our input corpus, we'll:\n", "\n", "1. Take a random (but reproducible) split of the first N documents\n", "\n", "2. ...Ensuring we *include* all documents from the previously annotated sets (otherwise, we won't be able to use these annotations in model training).\n", "\n", "> ⚙️ If you're using a custom dataset:\n", ">\n", "> - Remember to **delete the `data/annotations/augmentation-*` folders** before running this to avoid errors. This is because the `list_preannotated_textract_uris()` function below automatically tries to include any already-annotated documents in the sample.\n", "> - The exclusion rules in `include_filename()` may not be relevant to you, so consider reviewing them.\n", "\n", "This last point (2) is not as straightforward as you might expect, because in addition to crawling multiple jobs' annotation files:\n", "\n", "- The annotation files may contain only the Textract result URI (not the raw document URI - we'll need to map across to that), and\n", "- If some annotation was done on other AWS accounts (like for our sample dataset), the actual bucket name and prefix may not match what you're currently working with.\n", "\n", "So we'll first construct that list of which documents were annotated already, which you can inspect and edit if the logic isn't working for your custom datasets:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# Crawl source annotated Textract URIs from the job manifests:\n", "annotated_textract_s3uris = util.ocr.list_preannotated_textract_uris(\n", " ann_jobs_folder=\"data/annotations\",\n", " exclude_job_names=[\"LICENSE\"],\n", ")\n", "\n", "# List raw document URIs in S3\n", "with open(\"data/raw-all.manifest.jsonl\") as f:\n", " raw_doc_s3uris = [json.loads(l)[\"raw-ref\"] for l in f]\n", "\n", "# Define how to check for matches:\n", "def textract_uri_matches_doc_uri(tex_uri, doc_uri) -> bool:\n", " \"\"\"Customize this function if needed for your use case's data layout\"\"\"\n", " # With our sample, Textract URIs will look like:\n", " # some/prefix/data/textracted/subfolders/file.pdf/consolidated.json\n", " # (Due to this sample's Textract batch processing logic - will likely be different if you're\n", " # trying to use previously-Textracted documents).\n", " tex_s3key = tex_uri[len(\"s3://\"):].partition(\"/\")[2]\n", " # With our sample, Raw URIs will look like:\n", " # some/prefix/data/raw/subfolders/file.pdf\n", " doc_s3key = doc_uri[len(\"s3://\"):].partition(\"/\")[2]\n", "\n", " # Given the expectations above:\n", " tex_rel_filepath = tex_s3key.partition(\"data/textracted/\")[2].rpartition(\"/\")[0]\n", " doc_rel_filepath = doc_s3key.partition(\"data/raw/\")[2]\n", " return doc_rel_filepath == tex_rel_filepath\n", "\n", "# Build the list of docs for which some annotations exist (prioritising debug over speed here):\n", "annotated_doc_s3uris = set()\n", "for uri in annotated_textract_s3uris:\n", " matching_doc_s3uris = [\n", " doc_s3uri\n", " for doc_s3uri in raw_doc_s3uris\n", " if textract_uri_matches_doc_uri(uri, doc_s3uri)\n", " ]\n", " n_matches = len(matching_doc_s3uris)\n", " if n_matches == 0:\n", " raise ValueError(\n", " \"Couldn't find matching document in dataset for annotated Textract URI: %s\"\n", " % (uri,)\n", " )\n", " if n_matches > 1:\n", " logger.warning(\n", " \"Textract URI matched %s document URIs: Matching criterion may be too loose.\\n%s\\n%s\",\n", " n_matches,\n", " uri,\n", " matching_doc_s3uris,\n", " )\n", " annotated_doc_s3uris.update(matching_doc_s3uris)\n", "\n", "# This sorted list of required document S3 URIs is the main result you need to get to here:\n", "annotated_doc_s3uris = sorted(annotated_doc_s3uris)\n", "print(f\"Found {len(annotated_doc_s3uris)} docs with pre-existing annotations\")\n", "print(\"For example:\")\n", "print(\"\\n\".join(annotated_doc_s3uris[:5] + [\"...\"]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the list of required (already annotated) documents prepared, we're ready to randomly sample a corpus that includes them.\n", "\n", "The example Credit Card Agreements dataset includes some **non-English documents**. By default we exclude these (in case you can't read Spanish well enough to annotate them!) ...But you could choose to include them: As the multi-lingual [LayoutXLM model](https://huggingface.co/docs/transformers/model_doc/layoutxlm) has been tested with this sample." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "N_DOCS_KEPT = 120\n", "SKIP_SPANISH_DOCS = True\n", "\n", "\n", "def include_filename(name: str) -> bool:\n", " \"\"\"Filter out likely Spanish/non-English docs (if SKIP_SPANISH_DOCS enabled)\"\"\"\n", " if not name:\n", " return False\n", " if not SKIP_SPANISH_DOCS:\n", " return True\n", " name_l = name.lower()\n", " if \"spanish\" in name_l:\n", " return False\n", " if re.search(r\"espa[nñ]ol\", name_l):\n", " return False\n", " if \"tarjeta\" in name_l or re.search(r\"cr[eé]dito\", name_l):\n", " return False\n", " if re.search(r\"[\\[\\(]esp?[\\]\\)]\", name_l):\n", " return False\n", " return True\n", "\n", "\n", "if N_DOCS_KEPT < len(annotated_doc_s3uris):\n", " raise ValueError(\n", " \"Existing annotations cannot be used for model training unless the target documents are \"\n", " \"Textracted. To proceed with fewer docs than have already been annotated, you'll need to \"\n", " \"`exclude_job_names` per the 'data/annotations' folder (e.g. ['augmentation-1']) AND \"\n", " \"remember to not include them in notebook 2 (model training). Alternatively, increase \"\n", " f\"your N_DOCS_KEPT. (Got {N_DOCS_KEPT} vs {len(annotated_doc_s3uris)} prev annotations).\"\n", " )\n", "\n", "with open(\"data/raw-all.manifest.jsonl\") as f:\n", " # First apply filtering rules:\n", " sampled_docs = [\n", " doc for doc in (json.loads(line) for line in f)\n", " if include_filename(doc[\"raw-ref\"])\n", " ]\n", "\n", "# Forcibly including the pre-annotated docs *after* the shuffling ensures that the order of\n", "# sampling new docs is independent of what/how many have been pre-annotated:\n", "required_docs = [d for d in sampled_docs if d[\"raw-ref\"] in annotated_doc_s3uris]\n", "random.Random(1337).shuffle(sampled_docs)\n", "new_docs = [d for d in sampled_docs if d[\"raw-ref\"] not in annotated_doc_s3uris]\n", "sampled_docs = sorted(\n", " required_docs + new_docs[:N_DOCS_KEPT - len(required_docs)],\n", " key=lambda doc: doc[\"raw-ref\"],\n", ")\n", "\n", "# Write the selected set to file:\n", "with open(\"data/raw-sample.manifest.jsonl\", \"w\") as f:\n", " for d in sampled_docs:\n", " f.write(json.dumps(d) + \"\\n\")\n", "\n", "print(f\"Extracted random sample of {len(sampled_docs)} docs\")\n", "sampled_docs[:5] + [\"...\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ▶️ In [data/raw-sample.manifest.jsonl](data/raw-sample.manifest.jsonl) you should now have an alphabetized list of the `N_DOCS_KEPT` randomly selected documents, which should include any documents referenced in existing annotations under `data/annotations`." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### OCR the input documents\n", "\n", "> ⚠️ **Note:** Refer to the [Amazon Textract Pricing Page](https://aws.amazon.com/textract/pricing/) for up-to-date guidance before running large extraction jobs.\n", ">\n", "> At the time of writing, the projected cost (in `us-east-1`, ignoring free tier allowances) of analyzing 100 documents with 10 pages on average was approximately \\\\$67 with `TABLES` and `FORMS` enabled, or \\\\$2 without. Across the full corpus, we measured the average number of pages per document at approximately 6.7.\n", "\n", "With (a subset of) the raw documents selected, the next ingredient is to link them with Amazon Textract-compatible OCR results in a new manifest - with entries something like:\n", "\n", "```json\n", "{\"raw-ref\": \"s3://doc-example-bucket/folder/mydoc.pdf\", \"textract-ref\": \"s3://doc-example-bucket/folder/mydoc-textracted.json\"}\n", "```" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "---\n", "⚙️ If you're experimenting with a custom dataset, you might **already have Amazon Textract results** for your documents:\n", "\n", "- These must be uploaded to **Amazon S3 in the same AWS Region** you're working in, in a location this SageMaker notebook's execution role has access to.\n", "- They'll need to be under **one shared S3 prefix** for model training later (and training may be more performant/cheaper if that prefix doesn't have too many *other* files under it)\n", "- They must be in an **Amazon Textract output-compatible format**, as the model training script later will assume this.\n", "- Ideally, the results should be **consolidated to a single JSON per raw document** - which is done automatically by the pipeline in this sample, but may not be the case if you used the the Amazon Textract [StartDocumentAnalysis API](https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentAnalysis.html#Textract-StartDocumentAnalysis-request-OutputConfig) with a direct `OutputConfig`.\n", " - The reason for this is that we'll construct a *page-level* manifest later, and need to be confident we're linking each page image to the correct corresponding page result from Textract JSON.\n", " - If your Textract JSON is split into multiple objects per input document and you'd like to avoid re-processing the corpus, you could explore consolidating the JSONs to document level via simple Python. Check that when loaded with [`trp`](https://github.com/aws-samples/amazon-textract-response-parser/blob/master/src-python/README.md), the \"pages\" in your final file correspond correctly to the source document pages.\n", "\n", "The exact correspondence from input document URI to existing OCR result JSON URI will depend on your environment, but here's some example code to show how you might crawl some pre-existing results:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "#### EXAMPLE mapping for pre-Textracted documents with existence check:\n", "\n", "# Given that raw docs live under some S3 prefix:\n", "raw_s3uri_prefix = raw_s3uri\n", "\n", "# ...And Amazon Textract results live under another:\n", "textract_s3uri = f\"s3://{bucket_name}/{bucket_prefix}data/textracted\"\n", "\n", "\n", "# ...And you can define a mapping from one to the other:\n", "def doc_uri_to_textract_uri(doc_uri: str) -> str:\n", " if not doc_uri.startswith(raw_s3uri_prefix):\n", " raise ValueError(\n", " \"Document S3 URI '%s' did not start with expected prefix: '%s'\"\n", " % (doc_uri, raw_s3uri_prefix)\n", " )\n", " # Replace raw prefix with Textract prefix, and add \"/consolidated.json\" to filename:\n", " return textract_s3uri + doc_uri[len(raw_s3uri_prefix):] + \"/consolidated.json\"\n", "\n", "# Then build up the combined manifest, checking existence for each result:\n", "\n", "# print(\"Building manifest...\")\n", "# with open(\"data/raw-sample.manifest.jsonl\") as fin:\n", "# with open(\"data/textracted-all.manifest.jsonl\", \"w\") as fout:\n", "# for doc in (json.loads(line) for line in fin):\n", "# textract_uri = doc_uri_to_textract_uri(doc[\"raw-ref\"])\n", "# if not util.s3.s3_object_exists(textract_uri):\n", "# raise ValueError(\n", "# \"Mapped Textract result URI does not exist in S3.\\nFor: %s\\nGot: %s\"\n", "# % (doc[\"raw-ref\"], textract_uri)\n", "# )\n", "# doc[\"textract-ref\"] = textract_uri\n", "# fout.write(json.dumps(doc) + \"\\n\")\n", "# print(\"Done!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "**For documents not yet run through OCR**:\n", "\n", "> ⚙️ The code below uses Amazon Textract to extract document text, which is the most accurate and feature-rich choice for languages the Amazon Textract service supports. If you're working with documents in **unsupported languages**, check out the *\"Experimenting with alternative OCR engines\"* instructions in [Optional Extras.ipynb](Optional%20Extras.ipynb) for an open source-based option. \n", "\n", "We need to be mindful of the service [quotas](https://docs.aws.amazon.com/general/latest/gr/textract.html#limits_textract) when processing large batches of documents with Amazon Textract, to avoid excessive rate limiting and retries. Since an OCR pipeline solution stack is already set up for this sample, you can use just the *Amazon Textract portion of the pipeline* to process the documents in bulk.\n", "\n", "> ⏰ This process took about 5 minutes to run against the 120-document sample set in our tests." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "textract_results = util.ocr.call_textract(\n", " textract_sfn_arn=config.plain_textract_sfn_arn,\n", " # Can instead use raw-all.manifest.jsonl to process whole dataset (see cost note above):\n", " input_manifest=\"data/raw-sample.manifest.jsonl\",\n", " manifest_raw_field=\"raw-ref\",\n", " manifest_out_field=\"textract-ref\",\n", " # Map subpaths of {input_base} to subpaths of {output_base}:\n", " output_base_s3uri=textract_s3uri,\n", " input_base_s3uri=raw_s3uri,\n", " # You can turn on e.g. \"TABLES\" and \"FORMS\" features here, but note this could have a\n", " # significant impact on Textract API costs:\n", " features=[],\n", " skip_existing=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ⚠️ **If you see errors in the output above:**\n", ">\n", "> - Try re-running the cell - Rate limiting can sometimes cause intermittent failures, and the function will skip successfully processed files in repeat runs.\n", "> - Persistent errors (on custom datasets) could be due to malformed files (remove them from the manifest) or very large files (see the [/CUSTOMIZATION_GUIDE.md](../CUSTOMIZATION_GUIDE.md) for tips on re-configuring your pipeline to handle very large documents).\n", "\n", "Once the extraction is done, write (only successful items) to a manifest file:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "n_success = 0\n", "n_fail = 0\n", "with open(\"data/textracted-all.manifest.jsonl\", \"w\") as fout:\n", " for ix, item in enumerate(textract_results):\n", " if isinstance(item[\"textract-ref\"], str):\n", " fout.write(json.dumps(item) + \"\\n\")\n", " n_success += 1\n", " else:\n", " if n_fail == 0:\n", " logger.error(\"First failure at index %s:\\n%s\", ix, item[\"textract-ref\"])\n", " n_fail += 1\n", "\n", "print(f\"{n_success} of {n_success + n_fail} docs processed successfully\")\n", "if n_fail > 0:\n", " raise ValueError(\n", " \"Are you sure you want to continue? Consider re-trying to process the failed docs\"\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "> ▶️ Regardless of whether you processed documents here, or indexed previously-Textracted results, you should now have a [data/textracted-all.manifest.jsonl](data/textracted-all.manifest.jsonl) JSON-Lines manifest file mapping source documents `raw-ref` to Amazon Textract result JSONs `textract-ref`: Both as `s3://...` URIs." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Extract clean input images (batch)\n", "\n", "To annotate our documents with SageMaker Ground Truth image task UIs, we need **individual page images**, stripped of EXIF rotation metadata (because, at the time of writing, SMGT ignores this rotation for annotation consistency) and converted to compatible formats (since some formats like TIFF are not supported by most browsers).\n", "\n", "For large corpora, this process of splitting PDFs and rotating and converting images may require significant resources - but is easy to parallelize.\n", "\n", "Therefore instead of pre-processing the raw documents here in the notebook, this is a good use case for a scalable [SageMaker Processing Job](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html).\n", "\n", "The job will require a **custom container image**, since the PDF reading tools we use aren't installed by default in pre-built SageMaker containers and aren't `pip install`able. However, the image has already been built and deployed to [Amazon Elastic Container Registry (ECR)](https://aws.amazon.com/ecr/) by the CDK stack (see `preproc_image` in [/pipeline/\\_\\_init\\_\\_.py](../pipeline/__init__.py)). All we need to do here is look it up from the stack parameters:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from sagemaker.processing import FrameworkProcessor\n", "\n", "ecr_image_uri = config.preproc_image_uri\n", "print(f\"Using pre-built custom container image:\\n{ecr_image_uri}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we need to **define the inputs and output locations** for the processing job." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", "\n", "# Output S3 locations:\n", "imgs_s3uri = f\"s3://{bucket_name}/{bucket_prefix}data/imgs-clean\"\n", "thumbs_s3uri = f\"s3://{bucket_name}/{bucket_prefix}data/thumbnails\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Two options are shown below:\n", "\n", "- **Option 1:** (Commented-out) To process the whole `data/raw` corpus on S3, you could simply pass the entire prefix as input to the processing job. For large corpora, you'll probably want to scale up the job's compute resources below to complete the work quickly.\n", "- **Option 2:** (Sample default) To process just the sampled subset of files, we'll create an **input manifest file** listing the documents we want." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "#### OPTION 1: For processing the whole raw_s3uri prefix:\n", "\n", "# preproc_inputs = [\n", "# ProcessingInput(\n", "# destination=\"/opt/ml/processing/input/raw\", # Expected input location, per our script\n", "# input_name=\"raw\",\n", "# s3_data_distribution_type=\"ShardedByS3Key\", # Distribute between instances, if multiple\n", "# source=raw_s3uri, # S3 prefix for full raw document collection\n", "# ),\n", "# ]\n", "# print(\"Selected whole corpus\")\n", "#### END OPTION 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Note:** The 'Non-augmented' manifest files used below for job data loading are still JSON-based, but a different format from the JSON-**Lines** manifests we use in most other places of this sample. You can find guidance on the [S3DataSource API doc](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) for manifests as used here, and separate information in the [Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-input-data-input-manifest.html) on the \"augmented\" JSON-Lines manifests used elsewhere." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "#### OPTION 2: For processing the sampled subset of raw docs only:\n", "\n", "# Note your source documents must still be under one common S3 prefix, and the folder structure\n", "# under this will be mapped through to the outputs:\n", "raw_s3uri_prefix = raw_s3uri # (*No* trailing slash)\n", "\n", "# Load the list of docs from file and add final filters:\n", "with open(\"data/raw-sample.manifest.jsonl\") as fin:\n", " doc_relpaths = []\n", " for item in (json.loads(line) for line in fin):\n", " if item[\"raw-ref\"].startswith(raw_s3uri_prefix + \"/\"):\n", " doc_relpaths.append(item[\"raw-ref\"][len(raw_s3uri_prefix) + 1:])\n", " else:\n", " raise ValueError(\n", " \"All document S3 URIs for processing job must start with shared S3 prefix %s:\\n\"\n", " \"%s\" % (raw_s3uri_prefix, item[\"raw-ref\"])\n", " )\n", "\n", "# Prepare a true JSON (*NON-JSONLINES*) manifest file for SageMaker Processing:\n", "os.makedirs(\"data/manifests\", exist_ok=True)\n", "preproc_input_manifest_path = \"data/manifests/raw-dataclean-input.manifest.json\"\n", "with open(preproc_input_manifest_path, \"w\") as fout:\n", " fout.write(json.dumps(\n", " [{ \"prefix\": raw_s3uri_prefix + \"/\" }]\n", " + doc_relpaths\n", " ))\n", "\n", "# Upload the manifest to S3:\n", "preproc_input_manifest_s3uri = f\"s3://{bucket_name}/{bucket_prefix}{preproc_input_manifest_path}\"\n", "!aws s3 cp {preproc_input_manifest_path} {preproc_input_manifest_s3uri}\n", "\n", "# Set the processing job inputs to reference the manifest:\n", "preproc_inputs = [\n", " ProcessingInput(\n", " destination=\"/opt/ml/processing/input/raw\", # Expected input location, per our script\n", " input_name=\"raw\",\n", " s3_data_distribution_type=\"ShardedByS3Key\", # Distribute between instances, if multiple\n", " s3_data_type=\"ManifestFile\",\n", " source=preproc_input_manifest_s3uri, # Manifest of sample raw documents\n", " ),\n", "]\n", "print(\"Selected sample subset of documents\")\n", "#### END OPTION 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The script we'll be using to process the documents is in the same folder as the Dockerfile used earlier to build the container image: [preproc/inference.py](preproc/inference.py).\n", "\n", "The code parallelizes processing across available CPUs, and the `ShardedByS3Key` setting used on our [ProcessingInput](https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.ProcessingInput) above distributes documents between instances if multiple are provided. This means you should be able to adjust both `instance_type` and `instance_count` below, and still take advantage of the resources configured. The process is typically CPU-bound, so the `ml.c*` families are likely a good fit.\n", "\n", "The cell below will **run the processing job** and show logs from the job as it progresses. You can also check up on the status and history of jobs in the [Processing page of the Amazon SageMaker Console](https://console.aws.amazon.com/sagemaker/home?#/processing-jobs).\n", "\n", "> ⏰ **Note:** In our tests, it took (including job start-up overheads):\n", ">\n", "> - About 10 minutes to process the 120-document sample with 2x `ml.c5.xlarge` instances\n", "> - About 17 minutes to process the full 2,541-document credit cards corpus with 5x `ml.c5.4xlarge` instances." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "%%time\n", "\n", "processor = FrameworkProcessor(\n", " # Because the custom image already exists, we can just use generic Framework class rather than\n", " # configuring PyTorchProcessor and worrying about whether PyTorch/Python versions match:\n", " estimator_cls=util.preproc.DummyFramework,\n", " image_uri=ecr_image_uri,\n", " framework_version=\"\",\n", "\n", " base_job_name=\"ocr-img-dataclean\",\n", " role=sagemaker.get_execution_role(),\n", " instance_count=2,\n", " instance_type=\"ml.c5.xlarge\",\n", " volume_size_in_gb=15,\n", " max_runtime_in_seconds=60*60,\n", ")\n", "\n", "processor.run(\n", " code=\"preproc.py\", # PDF splitting / image conversion script\n", " source_dir=\"preproc\",\n", " inputs=preproc_inputs[:], # Either whole corpus or sample, as above\n", " outputs=[\n", " ProcessingOutput(\n", " destination=imgs_s3uri,\n", " output_name=\"imgs-clean\",\n", " s3_upload_mode=\"Continuous\",\n", " source=\"/opt/ml/processing/output/imgs-clean\", # Output folder, per our script\n", " ),\n", " ProcessingOutput(\n", " destination=thumbs_s3uri,\n", " output_name=\"thumbnails\",\n", " s3_upload_mode=\"Continuous\",\n", " source=\"/opt/ml/processing/output/thumbnails\", # Resized output for visual models\n", " ),\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the images have been extracted, we'll also **optionally** download them locally to the notebook for use in visualizations later:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "print(f\"Downloading cleaned images from {imgs_s3uri}...\")\n", "!aws s3 sync --quiet {imgs_s3uri} data/imgs-clean\n", "print(\"Done\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll see that this job also generates uniformly resized \"thumbnail\" images per page when the second (optional) `thumbnails` output is specified. These aren't important for the human annotation process, but will be needed later assuming you train a model like [LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2#transformers.LayoutLMv2Model) which accepts page image inputs.\n", "\n", "The older [LayoutLM (v1)](https://huggingface.co/docs/transformers/model_doc/layoutlm#transformers.LayoutLMModel) model doesn't use these." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# You could also download the thumbnail images to see what they look like - but we won't need them:\n", "#!aws s3 sync --quiet {thumbs_s3uri} data/imgs-thumb" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Collate OCR and image data for annotation\n", "\n", "Now we have a filtered corpus of documents with Amazon Textract results, plus cleaned and standardized images for each page - all available on Amazon S3.\n", "\n", "To prepare for data annotation and later model training, we'll need to collate these together with a **page-level manifest** in JSON-lines format, with records something like:\n", "\n", "```json\n", "{\"source-ref\": \"s3://doc-example-bucket/img-prefix/folder/filename-0001-01.png\", \"textract-ref\": \"s3://doc-example-bucket/tex-prefix/folder/filename.pdf/consolidated.json\", \"page-num\": 1}\n", "```\n", "\n", "Key features of the format are:\n", "- The `source-ref` is the path to a full-resolution cleaned page image (**not** a thumbnail), **but** model training in the next notebook will assume the equivalent thumbnail path is identical, except for some different s3://... bucket & prefix.\n", "- The `page-num` is one-based (always >= 1), and for model training must match the image to the appropriate page number **in the linked Textract JSON file**.\n", " - For example if you have thumbnail `filename-0001-15.png` for page 15 of some long document, but for some reason your `textract-ref` JSON file contains *only* detections from page 15 of the document, you would set `\"page-num\": 1`.\n", "- Mapping through the `raw-ref` here is nice to have, but optional, as the model training won't refer to the original document.\n", "\n", "> ⚙️ **For custom data setups:** The sample code in this section should work fine if you've used this sample to Textract and clean images for your documents, but if you have a customized setup, you might need to adjust the logic that maps between document, Textract JSON, and page image S3 URIs, to suit your environment.\n", "\n", "The key goal is to create a page-level catalogue that we're confident is correct, and for that reason the example function below will actually **validate that the artifacts are present on S3** in the expected locations.\n", "\n", "> ⏰ Because of these validation checks, the cell below may a minute or two to run against our 120-document sample set." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "warnings = util.preproc.collate_data_manifest(\n", " # Output file:\n", " \"data/pages-all-sample.manifest.jsonl\",\n", " # Input manifest:\n", " input_manifest=\"data/textracted-all.manifest.jsonl\",\n", " # s3://... base URI used to try and map 'textract-ref's to cleaned images:\n", " textract_s3_prefix=textract_s3uri,\n", " # The s3://... base URI under which page images are stored:\n", " imgs_s3_prefix=imgs_s3uri,\n", " # Optional s3://... base URI also used to try and map 'raw-ref's to images if present:\n", " raw_s3_prefix=raw_s3uri,\n", " # Other output manifest settings:\n", " by=\"page\",\n", " no_content=\"omit\",\n", ")\n", "\n", "if len(warnings):\n", " raise ValueError(\n", " \"Manifest usable but incomplete - %s docs failed. Please see `warnings` for details\"\n", " % len(warnings)\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ▶️ Whether you were able to use the above code as-is or had to modify it, you should now have a page-level catalogue linking `source-ref`, `textract-ref`, `page-num` in [data/pages-all-sample.manifest.jsonl](data/pages-all-sample.manifest.jsonl)\n", "\n", "Let's briefly explore the catalogue we've created.\n", "\n", "Each line of the file is a JSON record identifying a particular page:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "with open(\"data/pages-all-sample.manifest.jsonl\", \"r\") as f:\n", " for ix, line in enumerate(f):\n", " print(line, end=\"\")\n", " if ix >= 2:\n", " print(\"...\")\n", " break" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The credit cards corpus has a very skewed distribution of number of pages per document, with a few outliers dragging up the average significantly. In our tests on corpus-wide statistics:\n", "\n", "- The overall average was **~6.7 pages per document**\n", "- The 25th percentile was 3 pages; the 50th percentile was 6 pages; and the 75th percentile was 11 pages\n", "- The longest document was 402 pages\n", "\n", "Your results for sub-sampled sets will likely vary a little - but can be analyzed as below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "with open(\"data/pages-all-sample.manifest.jsonl\", \"r\") as f:\n", " manifest_df = pd.DataFrame([json.loads(line) for line in f])\n", "page_counts_by_doc = manifest_df.groupby(\"textract-ref\")[\"textract-ref\"].count()\n", "\n", "print(\"Document page count statistics\")\n", "page_counts_by_doc.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Annotation infrastructure\n", "\n", "To create a labelling job in Amazon SageMaker Ground Truth, we'll need to specify\n", "\n", "- **Who's** doing the labelling - which could be your own internal teams, the public crowd via Amazon Mechanical Turk, or skilled workers supplied by vendors through the AWS Marketplace\n", "- **What** the task will look like - which could be using the [built-in task UIs](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html) or [custom workflows](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-custom-templates.html).\n", "- **Where** the input data sourced from and the results will be saved to (locations on Amazon S3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a private workteam\n", "\n", "For this demo, you'll set up a private work \"team\" for just yourself to test out the annotation process.\n", "\n", "▶️ **Open** the [Amazon SageMaker Ground Truth console, *Labeling Workforces* page](https://console.aws.amazon.com/sagemaker/groundtruth?#/labeling-workforces)\n", "\n", "> ⚠️ **Check** SM Ground Truth opens in the same **AWS Region** where this notebook and your CloudFormation stack are deployed: You may find it defaults to `N. Virginia`. Use the drop-down in the top right of the screen to switch regions.\n", "\n", "▶️ **Select** the *Private* tab and click **Create private team**\n", "\n", "- Choose an appropriate **name** for your team e.g. `just-me`\n", "- (If you get the option) select to **Invite new workers via email** and enter your email address (you'll need access to this address to log in and annotate the data)\n", "- And leave the other (Cognito, SNS, etc) parameters as default.\n", "\n", "▶️ **If you didn't get the option** to add workers during team creation (typically because your account is already set up for SageMaker Ground Truth), then after the team is created you can:\n", "\n", "- Click **Invite new workers** to add your email address to the workforce, and then\n", "- Click on your **team name** to open the team details, then navigate to the *Workers tab* to add yourself to the team\n", "\n", "▶️ **Copy** the *name* of your workteam and paste it into the cell below, to store it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "workteam_name = \"just-me\" # TODO: Update this to match yours, if different\n", "\n", "workteam_arn = util.smgt.workteam_arn_from_name(workteam_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally:\n", "\n", "▶️ **Check your email** for an invitation and log in to the labelling portal. You'll be asked to configure a password on first login.\n", "\n", "\n", "Your completed setup should look something like this in the AWS Console:\n", "\n", "![](img/smgt-private-workforce.png \"Screenshot of SageMaker Ground Truth private workforces configuration\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set up the custom task template\n", "\n", "This sample provides 2 options for data annotation:\n", "\n", "1. Use the **built-in [Bounding Box tool](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-bounding-box.html)**\n", "2. Use the provided **custom task template** which collects transcription reviews as well as bounding boxes\n", "\n", "We recommend **experimenting with the custom template** here in the notebook, to get a better understanding of how the model will \"see\" and use your annotations (and how you might extend this sample for your own use cases).\n", "\n", "However, you'll probably want to **use the built-in boxes tool** for the bulk of your annotating work because:\n", "\n", "- The ML model we present (in the next notebook) only supports tagging and cannot be directly trained on the text corrections you collect in the custom template\n", "- ...And reviewing the text transcription takes extra time & effort\n", "\n", "As detailed [in the developer guide](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-custom-templates.html), custom Ground Truth UIs are HTML [Liquid templates](https://shopify.github.io/liquid/basics/introduction/). You can use the [Crowd HTML Elements](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-ui-template-reference.html) to embed standard components, but also include custom HTML/CSS/JS as needed. A set of examples is provided in the [amazon-sagemaker-ground-truth-task-uis repository on GitHub](https://github.com/aws-samples/amazon-sagemaker-ground-truth-task-uis).\n", "\n", "Since spinning up a labelling job each time to test and debug a custom template would slow down development, SageMaker provides a [RenderUiTemplate API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_RenderUiTemplate.html) for previewing the worker experience.\n", "\n", "First, we'll populate the master template `*.liquid.tpl.html` with the entity/field types we configured earlier (and some other automated content) to produce the final SageMaker Ground Truth template `*.liquid.html`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "\n", "with open(\"annotation/ocr-bbox-and-validation.liquid.tpl.html\", \"r\") as ftpl:\n", " with open(\"annotation/ocr-bbox-and-validation.liquid.html\", \"w\") as fout:\n", " template = BeautifulSoup(ftpl.read())\n", "\n", " annotator_el = template.find(id=\"annotator\")\n", " annotator_el[\"header\"] = \"Highlight entities and review their OCR results.\"\n", " annotator_el[\"labels\"] = json.dumps(entity_classes)\n", "\n", " if any(f.annotation_guidance for f in fields):\n", " full_instructions_el = template.find(\"full-instructions\")\n", " full_instructions_el.append(\n", " BeautifulSoup(\n", " \"\\n\".join(\n", " [\"

Per-Field Guidance

\"]\n", " + [\n", " f\"

{f.name}

\\n{f.annotation_guidance}\"\n", " for f in fields if f.annotation_guidance\n", " ]\n", " )\n", " )\n", " )\n", "\n", " fout.write(template.prettify())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To be able to serve our example images through the UI, SageMaker Ground Truth requires the target S3 bucket to be set up [with CORS permissions](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-cors-update.html) (which is not the same as making the bucket or its contents public).\n", "\n", "The cell below will ensure these permissions are set on the bucket configured earlier by `bucket_name`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "util.smgt.ensure_bucket_cors(bucket_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we're ready to render the preview of what this task would look like with an actual record from the data manifest:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "role = sagemaker.get_execution_role()\n", "smclient = boto3.client(\"sagemaker\")\n", "\n", "# Fetch an example record from the manifest:\n", "ix_example = 0\n", "with open(\"data/pages-all-sample.manifest.jsonl\", \"r\") as fmanifest:\n", " sample_task_str = None\n", " for ix, line in enumerate(fmanifest):\n", " if ix == ix_example:\n", " sample_task_str = line\n", " break\n", "\n", "# Render the template with the example record:\n", "ui_render_file = \"annotation/render.tmp.html\"\n", "with open(\"annotation/ocr-bbox-and-validation.liquid.html\", \"r\") as fui:\n", " with open(ui_render_file, \"w\") as frender:\n", " ui_render_resp = smclient.render_ui_template(\n", " UiTemplate={\"Content\": fui.read()},\n", " Task={\"Input\": sample_task_str},\n", " RoleArn=role,\n", " )\n", " frender.write(ui_render_resp[\"RenderedContent\"])\n", "\n", "print(f\"▶️ Open {ui_render_file} and click 'Trust HTML' to see the UI in action!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Opening [annotation/render.tmp.html](annotation/render.tmp.html) and clicking **Trust HTML** in the toolbar, you should see a view something similar to the below:\n", "\n", "> ℹ️ **Note:** In this task template, you need to click the \"Instructions\" button to expand the transcription review pane on the left!\n", "\n", "![](img/smgt-custom-template-demo.png \"Screenshot of custom annotation UI\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that:\n", "\n", "- When you draw a bounding box on the page image, a new OCR result is populated in the left sidebar prompting you to review (and if necessary, correct) Textract's transcription of the text in that region.\n", "- Overlapping bounding boxes of the same type are combined, allowing you to highlight non-square regions of text.\n", "- Transcription review fields are mandatory: The template should not let you submit the result until all transcriptions have been reviewed in the sidebar.\n", "\n", "You should aim to follow the same overlapping conventions when annotating the sample data, even with the built-in task type. Under the hood, the ML model code applies similar logic to map your bounding box annotations to the Textract detected `WORD`s and `LINE`s.\n", "\n", "Since reviewing transcriptions makes labelling take longer, you'll probably want to stick to the built-in task template instead of this custom one, unless you have a use case for collecting the text data. The [Exploring sequence-to-sequence models section](Optional%20Extras.ipynb#Exploring-sequence-to-sequence-models) of the **Optional Extras** notebook discusses training generative models to actually normalize and correct OCR transcriptions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Annotate data\n", "\n", "> ⏰ **If you're short on time**: You can skip the remaining steps in this notebook altogether.\n", ">\n", "> We've provided pre-prepared annotations for 100 pages in the `data/annotations` folder, to augment your work and help train an effective model faster. If you need, you can skip along to the next notebook and select **only** the `augmentation-*` datasets instead of labeling your own too. If you choose to do this, your model will likely be less accurate.\n", "\n", "We're now ready to start annotating data, and will typically **iterate over multiple jobs** in this step to start small and then boost model accuracy.\n", "\n", "To make incrementally adding to the dataset easy, we'll need to pay particular attention to:\n", "\n", "- How we sample data for jobs, with good randomness but no repetition of previously-annotated pages\n", "- How we collect our results to a single consolidated dataset\n", "\n", "So let's follow through the steps:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Collect a dataset\n", "\n", "Here we will:\n", "\n", "- **Shuffle** our data (in a *reproducible*/deterministic way), to ensure we annotate documents/pages from a range of providers - not just concentrating on the first provider/doc(s)\n", "- **Exclude** any examples for which the page image has **already been labeled** in the `data/annotations` output folder\n", "- **Stratify** the sample, to obtain a specific (boosted) proportion of first-page samples, since we observed the first pages of documents to often be most useful for the fields of interest in the sample credit cards use case. (Many documents use the first page for a fact-sheet/summary, followed by subsequent pages of dense legal terms).\n", "\n", "> ⚙️ If you're using a custom dataset:\n", ">\n", "> `select_examples()` below implements a strategy of randomly selecting pages with a fixed, boosted proportion of \"page 1s\". This might not be appropriate for your use case: For example if you already have a specific planned set of target pages for an extremely small-scale PoC, or if first-page samples are no more relevant than other pages for your documents. Consider revising `select_examples()` to adjust how you select pages for annotation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# This cell just defines the necessary functions & constants:\n", "\n", "# Keep this the same across the jobs:\n", "annotations_base_s3uri = f\"s3://{bucket_name}/{bucket_prefix}data/annotations\"\n", "\n", "\n", "def rel_path_from_s3uri(uri, key_base=\"data/imgs-clean/\") -> str:\n", " \"\"\"Extract e.g. 'subfolders/file' from 's3://bucket/.../{key_base}subfolders/file'\"\"\"\n", " return uri[len(\"s3://\"):].partition(\"/\")[2].partition(key_base)[2]\n", "\n", "\n", "def get_preannotated_imgs(exclude_job_names=[]) -> set:\n", " \"\"\"Find the set of relative image paths that have already been annotated\"\"\"\n", " filepaths = set() # Protect against introducing duplicates\n", " for job_folder in os.listdir(\"data/annotations\"):\n", " if job_folder in exclude_job_names:\n", " logger.info(f\"Skipping excluded job {job_folder}\")\n", " continue\n", " manifest_file = os.path.join(\n", " \"data\",\n", " \"annotations\",\n", " job_folder,\n", " \"manifests\",\n", " \"output\",\n", " \"output.manifest\",\n", " )\n", " if not os.path.isfile(manifest_file):\n", " if os.path.isdir(os.path.join(\"data\", \"annotations\", job_folder)):\n", " logger.warning(f\"Skipping job {job_folder}: No output manifest at {manifest_file}\")\n", " continue\n", " with open(manifest_file, \"r\") as f:\n", " filepaths.update([\n", " rel_path_from_s3uri(json.loads(line)[\"source-ref\"])\n", " for line in f\n", " ])\n", " return filepaths\n", "\n", "\n", "def select_examples(\n", " job_page_count,\n", " exclude_img_paths=set(),\n", " job_first_page_pct=0.4,\n", "):\n", " with open(\"data/pages-all-sample.manifest.jsonl\", \"r\") as fmanifest:\n", " examples_all = [json.loads(line) for line in fmanifest]\n", "\n", " # Separate and shuffle the first vs non-first pages:\n", " examples_all_arefirsts = [item[\"page-num\"] == 1 for item in examples_all]\n", "\n", " examples_firsts = [e for ix, e in enumerate(examples_all) if examples_all_arefirsts[ix]]\n", " examples_nonfirsts = [e for ix, e in enumerate(examples_all) if not examples_all_arefirsts[ix]]\n", " random.Random(1337).shuffle(examples_firsts)\n", " random.Random(1337).shuffle(examples_nonfirsts)\n", "\n", " # Exclude already-annotated images:\n", " filtered_firsts = [\n", " e for e in examples_firsts\n", " if rel_path_from_s3uri(e[\"source-ref\"]) not in exclude_img_paths\n", " ]\n", " filtered_nonfirsts = [\n", " e for e in examples_nonfirsts\n", " if rel_path_from_s3uri(e[\"source-ref\"]) not in exclude_img_paths\n", " ]\n", " print(\n", " \"Excluded %s first and %s non-first pages\"\n", " % (\n", " len(examples_firsts) - len(filtered_firsts),\n", " len(examples_nonfirsts) - len(filtered_nonfirsts),\n", " )\n", " )\n", "\n", " # Draw from the filtered shuffled lists:\n", " n_first_pages = round(job_first_page_pct * job_page_count)\n", " n_nonfirst_pages = job_page_count - n_first_pages\n", " if n_first_pages > len(filtered_firsts):\n", " raise ValueError(\n", " \"Unable to find enough first-page records to build manifest: Wanted \"\n", " \"%s, but only %s available from list after exclusions (%s before)\"\n", " % (n_first_pages, len(filtered_firsts), len(examples_firsts))\n", " )\n", " if n_nonfirst_pages > len(filtered_nonfirsts):\n", " raise ValueError(\n", " \"Unable to find enough non-first-page records to build manifest: Wanted \"\n", " \"%s, but only %s available from list after exclusions (%s before)\"\n", " % (n_nonfirst_pages, len(filtered_nonfirsts), len(examples_nonfirsts))\n", " )\n", " print(f\"Taking {n_first_pages} first pages and {n_nonfirst_pages} non-first pages.\")\n", " selected = filtered_firsts[:n_first_pages] + filtered_nonfirsts[:n_nonfirst_pages]\n", " random.Random(1337).shuffle(selected) # Shuffle again to avoid putting all 1stP at front\n", " return selected" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With this incremental sampling procedure set up, to actually generate a new job input manifest, you just need to specify:\n", "\n", "- A unique name for the job\n", "- The number of examples (pages) you'll annotate\n", "- The ratio of first-pages to non-first pages (e.g. 0.4 -> 40% of examples will be the first page of a document)\n", "\n", "Run the cells below to create a new labelling job input manifest and then upload it to Amazon S3.\n", "\n", "> ⚠️ **Warning:** If you've just completed an annotation job below, make sure you've `s3 sync`ed results back to the `data/annotations` folder - otherwise you'll set up a new job for the same pages again!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "annotation_job_name = \"cfpb-boxes-1\" # What will this job be called?\n", "job_page_count = 20 # How many pages will we annotate?\n", "job_first_page_pct = .4 # What proportion of pages should be first pages of a doc?\n", "\n", "preannotated_img_paths = get_preannotated_imgs()\n", "input_manifest_file = f\"data/manifests/{annotation_job_name}.jsonl\"\n", "os.makedirs(\"data/manifests\", exist_ok=True)\n", "print(f\"'{annotation_job_name}' saving to: {input_manifest_file}\")\n", "with open(input_manifest_file, \"w\") as f:\n", " for ix, example in enumerate(select_examples(\n", " job_page_count,\n", " exclude_img_paths=preannotated_img_paths,\n", " job_first_page_pct=job_first_page_pct,\n", " )):\n", " if ix < 3:\n", " print(example)\n", " elif ix == 3:\n", " print(\"...\")\n", " f.write(json.dumps(example) + \"\\n\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "input_manifest_s3uri = f\"s3://{bucket_name}/{bucket_prefix}{input_manifest_file}\"\n", "!aws s3 cp $input_manifest_file $input_manifest_s3uri" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> ▶️ Above we created another manifest file matching the `source-ref`, `textract-ref`, `page-num` structure from before... But **only** listing the subset of pages to be annotated in this particular labelling job.\n", "\n", "### Create the labelling job\n", "\n", "To minimize the risk of errors and get started quickly, you're recommended to create your labeling job by running the utility function provided below.\n", "\n", "This will set up a job with the **default pre-built bounding box** template by default, for faster labelling. If you want to collect OCR transcription reviews as well (for example, to measure OCR accuracy on your fields of interest or train text normalizing models), you can un-comment and fill in the extra optional arguments to use the custom template we saw earlier." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "print(f\"Starting labeling job {annotation_job_name}\\non data {input_manifest_s3uri}\\n\")\n", "create_labeling_job_resp = util.smgt.create_bbox_labeling_job(\n", " annotation_job_name,\n", " bucket_name=bucket_name,\n", " execution_role_arn=role,\n", " fields=fields,\n", " input_manifest_s3uri=input_manifest_s3uri,\n", " output_s3uri=annotations_base_s3uri,\n", " workteam_arn=workteam_arn,\n", " s3_inputs_prefix=f\"{bucket_prefix}data/manifests\",\n", "\n", " # To create a review/adjustment job from a manifest with existing labels in:\n", " # reviewing_attribute_name=\"label\",\n", "\n", " # To use the custom task template (adding transcription review):\n", " # task_template=\"annotation/ocr-bbox-and-validation.liquid.html\",\n", " # pre_lambda_arn=\"arn:aws...{AnnotationInfra->PreLabelingFunction Lambda from your CFn stack}\",\n", " # post_lambda_arn=\"arn:aws...{AnnotationInfra->PostLabelingFunction Lambda from your CFn stack}\",\n", ")\n", "print(f\"\\nLABELLING JOB STARTED:\\n{create_labeling_job_resp['LabelingJobArn']}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, you can also explore creating the job through the [AWS Console for SageMaker Ground Truth](https://console.aws.amazon.com/sagemaker/groundtruth?#/labeling-jobs) (check your AWS Region!) by clicking on *Create labeling job*:\n", "\n", "- Leave (as default) the *label attribute name* the same as the *job name*\n", "- Select **Manual data setup** and use:\n", " - The `input_manifest_s3uri` (`s3://[...].jsonl`) from above for the input location\n", " - The `annotations_base_s3uri` (`s3://[...]/data/annotations`) with **no trailing slash** for the output location\n", "- Select or create any **SageMaker IAM execution role** that has access to the `bucket_name` we're using.\n", "- For **task type**, select *Image > Bounding Box* (for the default bounding-box UI) or *Custom* (for the custom UI with OCR transcription reviews)\n", "- On the second screen, be sure to use **worker type** *Private* and select the workteam we made earlier from the dropdown.\n", "- For the built-in task type, you'll need to enter the **labels** manually exactly in the order that we defined them in this notebook.\n", "- For the custom task type:\n", " - Copy the contents of the `*.liquid.html` file above into the *Template body* section (**NOT** the `*.tpl.liquid.html`, which has placeholders e.g. for the list of classes)\n", " - Select the `PreLabelingFunction` and `PostLabelingFunction` Lambda functions created by the Pipeline CDK stack (you can find these via the [CloudFormation console](https://console.aws.amazon.com/cloudformation/home?#/stacks))\n", "\n", "The cell below prints out some of these values to help:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "print(input_manifest_s3uri)\n", "print(annotations_base_s3uri)\n", "print(role)\n", "print(\"\\n\".join([\"\\nLabels:\", \"-------\"] + entity_classes))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Label the data!\n", "\n", "Now that the labeling job has been created, you'll see a new task for your user in the labeling portal (If you lost the portal link from your email, you can access it from the *Private* tab of the [SageMaker Ground Truth Workforces console](https://console.aws.amazon.com/sagemaker/groundtruth?#/labeling-workforces)).\n", "\n", "> ⏰ SageMaker Ground Truth processes the job data in batches, so it might take a minute or two for the job to appear in your list.\n", ">\n", "> If it's taking a long time, you can:\n", ">\n", "> - Double-check the job in the [Labeling jobs page of the Console](https://console.aws.amazon.com/sagemaker/groundtruth?#/labeling-jobs) to see if it's failed to start due to some error\n", "> - Check the job is set up for a workteam that you're a member of\n", "> - Check your user is showing as *Verified* and *Enabled* (i.e. that you completed the email verification successfully) in the *Private* tab of the [Workforces console](https://console.aws.amazon.com/sagemaker/groundtruth?#/labeling-workforces)\n", "\n", "▶️ Click **Start working** and annotate the examples until the all are finished and you're returned to the portal homepage.\n", "\n", "▶️ **Try to be as consistent as possible** in how you annotate the classes, because inconsistent annotations can significantly degrade final model accuracy. Refer to the guidance (in this notebook and the 'Full Instructions') that we applied when annotating the example set.\n", "\n", "![](img/smgt-task-pending.png \"Screenshot of SMGT labeling portal with pending task\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sync the results locally (and iterate?)\n", "\n", "Once you've finished annotating and the job shows as \"Complete\" in the [SMGT Console](https://console.aws.amazon.com/sagemaker/groundtruth?#/labeling-jobs) (which **might take an extra minute or two**, while your annotations are consolidated), you can download the results here to the notebook via the cell below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 sync --quiet $annotations_base_s3uri ./data/annotations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see a subfolder created with the name of your annotation job, under which the **`manifests/output/output.manifest`** file contains the consolidated results of your labelling - again in the open JSON-Lines format.\n", "\n", "▶️ **Check** your results appear as expected, and explore the file format.\n", "\n", "> Because label outputs are in JSON-Lines, it's easy to consolidate, transform, and manipulate these results as required using open source tools!\n", "\n", "If you like, you can expand your dataset with **additional labelling jobs** by repeating these steps from [Collect a dataset](#Collect-a-dataset) down to here.\n", "\n", "> ⚠️ Take care to set a different `annotation_job_name` each time, as these must be unique." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next Steps\n", "\n", "In this notebook we set up the modelling objective, collected the project dataset, and annotated (perhaps multiple) sets of training data.\n", "\n", "In the next, we'll consolidate these output manifests (together with some pre-prepared example data) and actually train/deploy our ML model.\n", "\n", "So you can now open up **notebook [2. Model Training.ipynb](2.%20Model%20Training.ipynb)**, and follow along!" ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science 3.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }