{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Finetuning a custom question answering model\n",
    "The [BERT](https://arxiv.org/abs/1810.04805) family of models are a powerful set of natural language understanding models based on the transformer architecture originating from the paper Attention Is All You Need, which you can find [here](https://arxiv.org/abs/1706.03762).  \n",
    "\n",
    "These models work by running unsupervised pre-training on massive sets of text data. This process requires an enormous amount of time and compute. Luckily for us, BERT models are built for transfer learning. BERT models are able to be finetuned to perform many different NLU tasks like question answering, sentiment analysis, document summarization, and more.\n",
    "\n",
    "For this tutorial, we are going to download the [Stanford Question Answering Dataset](https://rajpurkar.github.io/SQuAD-explorer/) and walk through the  steps necessary to augment it with our own questions using [SageMaker Ground Truth](https://aws.amazon.com/sagemaker/groundtruth/), use [Huggingface](https://huggingface.co/) to finetune a BERT variant for question answering, deploy our finetuned model to a Sagemaker endpoint, and then visualize some results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!pip install transformers\n",
    "!pip install datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import collections\n",
    "import math\n",
    "import random\n",
    "import torch\n",
    "import os, tarfile, json\n",
    "import time, datetime\n",
    "from io import StringIO\n",
    "import numpy as np\n",
    "import boto3\n",
    "import sagemaker\n",
    "from tqdm import tqdm\n",
    "from IPython.display import Markdown as md\n",
    "from sagemaker.pytorch import estimator, PyTorchModel, PyTorchPredictor, PyTorch\n",
    "from sagemaker.utils import name_from_base\n",
    "from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline, TrainingArguments, Trainer, default_data_collator\n",
    "from sagemaker.huggingface import HuggingFace, HuggingFaceModel\n",
    "import datasets\n",
    "from datasets import load_dataset, load_metric, Dataset, Features\n",
    "\n",
    "# Configuration file with IAM role ARN, bucket name, and pre/post processing lambda ARN,\n",
    "# generated by notebook lifecycle script.\n",
    "CONFIG_FILE_PATH = \"/home/ec2-user/SageMaker/hf-gt-custom-qa.json\"\n",
    "\n",
    "sagemaker_cl = boto3.client(\"sagemaker\")\n",
    "sagemaker_session = sagemaker.Session()\n",
    "s3 = boto3.resource(\"s3\")\n",
    "region = sagemaker_session.boto_region_name\n",
    "role = sagemaker.get_execution_role()\n",
    "prefix = 'hf_squad'\n",
    "runtime_client = boto3.client('runtime.sagemaker')\n",
    "\n",
    "with open(CONFIG_FILE_PATH) as f:\n",
    "    config = json.load(f)\n",
    "\n",
    "role_arn = config[\"SageMakerRoleArn\"]\n",
    "bucket = config['LabelingJobInputBucket']\n",
    "pre_annot_lambda_arn = config[\"PreLabelTaskLambdaArn\"]\n",
    "post_annot_lambda_arn = config[\"PostLabelTaskLambdaArn\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "You will create some of the resources you need to launch a Ground Truth labeling job in this notebook. You must create the following resources before executing this notebook:\n",
    "\n",
    "* A work team. A work team is a group of workers that complete labeling tasks. If you want to preview the worker UI and execute the labeling task you will need to create a private work team, add yourself as a worker to this team, and provide the work team ARN below. This [GIF](images/create-workteam-loop.gif) demonstrates how to quickly create a private work team on the Amazon SageMaker console. If you do not want to use a private or vendor work team ARN, set `private_work_team` to `False` to use the Amazon Mechanical Turk workforce. To learn more about private, vendor, and Amazon Mechanical Turk workforces, see [Create and Manage Workforces\n",
    "](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-workforce-management.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "WORKTEAM_ARN = '<<ADD WORK TEAM ARN HERE>>'\n",
    "\n",
    "print(f'This notebook will use the work team ARN: {WORKTEAM_ARN}')\n",
    "\n",
    "# Make sure workteam arn is populated if private work team is chosen\n",
    "assert (WORKTEAM_ARN != '<<ADD WORK TEAM ARN HERE>>')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download and inspect the data\n",
    "\n",
    "Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.\n",
    "\n",
    "To get an idea of what SQuAD contains, let's download it locally and take a look.\n",
    "\n",
    "SQuAD was created by Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. You can find the original paper [here](https://arxiv.org/abs/1606.05250) and the dataset [here](https://rajpurkar.github.io/SQuAD-explorer/). SQuAD has been licensed by the authors under the [Creative Commons Attribution-ShareAlike 4.0 International Public License](https://creativecommons.org/licenses/by-sa/4.0/legalcode)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%sh \n",
    "\n",
    "mkdir data\n",
    "    \n",
    "v2=\"data/v2.0\"\n",
    "mkdir $v2\n",
    "wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -O $v2/train-v2.0.json\n",
    "wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O $v2/dev-v2.0.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the v2.0 dev set\n",
    "\n",
    "with open('data/v2.0/dev-v2.0.json', 'r') as f:\n",
    "    squad_dev = json.load(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've loaded some of the data, you can use the following block to look at a random context, question, and answer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# view a random Q&A pair\n",
    "ind = random.randint(0,34)\n",
    "sq = squad_dev['data'][ind]\n",
    "print('Paragraph title: ',sq['title'], '\\n')\n",
    "print(sq['paragraphs'][0]['context'],'\\n')\n",
    "print('Question:', sq['paragraphs'][0]['qas'][0]['question'])\n",
    "print('Answer:', sq['paragraphs'][0]['qas'][0]['answers'][0]['text'])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Model\n",
    "\n",
    "Now that we've viewed some of the different question and answer pairs in SQuAD, let's download a model that we can finetune for question answering. Huggingface allows us to easily download a base model that has undergone large scale pre-training and re-initialize it for a different downstream task. In this case we are taking the distilbert-base-uncased model and repurposing it for question answering using the AutoModelForQuestionAnswering class from Huggingface. We also utilize the AutoTokenizer class to retrieve the model's pre-trained tokenizer. Since the models are PyTorch based, we can look at the different modules in the model, the following block will print out the question answering output layer after downloading the model and tokenizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# loadbase_model_prefix \n",
    "model_name = \"distilbert-base-uncased-distilled-squad\"\n",
    "\n",
    "# Load model & tokenizer\n",
    "model = AutoModelForQuestionAnswering.from_pretrained(model_name)\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
    "# set model to evaluation mode\n",
    "evl = model.eval()\n",
    "print(' \\nOutput layer:',list(model.named_modules())[-2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# view model layers\n",
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## View BERT Input\n",
    "\n",
    "BERT needs us to transform our text data into a numeric representation known as tokens. There are a variety of tokenizers available, we are going to use a tokenizer specially designed for BERT that we will instantiate with our vocabulary file. Let's take a look at our transformed question and context we will be supplying BERT for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "QA_input = {\n",
    "    'question': sq['paragraphs'][0]['qas'][0]['question'],\n",
    "    'context': sq['paragraphs'][0]['context']\n",
    "}\n",
    "inputs = tokenizer(QA_input['question'], QA_input['context'], return_tensors='pt')\n",
    "input_ids = inputs[\"input_ids\"].tolist()[0]\n",
    "start_positions = torch.tensor([1])\n",
    "end_positions = torch.tensor([3])\n",
    "inputs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Inference\n",
    "\n",
    "Let's try getting some predictions from our question answering model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)\n",
    "\n",
    "answer_start_scores = outputs.start_logits\n",
    "answer_end_scores = outputs.end_logits\n",
    "answer_start = torch.argmax(\n",
    "    answer_start_scores\n",
    ")  # Get the most likely beginning of answer with the argmax of the score\n",
    "answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score\n",
    "# convert answers back to english\n",
    "answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))\n",
    "print(f\"Question: {QA_input['question']}\")\n",
    "print(f\"Answer: {answer}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Augment SQuAD\n",
    "\n",
    "It looks like we didn't get any output from the model for our previous prediction. This is because the weights for the question answering head have not been tuned. In order to tune them we need to train the model on our question answering dataset, but what if we want the model to perform particularly well on our domain specific questions? While it might be able to generalize by simply training on SQuAD, we can augment our training or validation set with our own questions. \n",
    "\n",
    "We can use the following code to augment SQuAD by launching a labeling job using a custom question answering template through SageMaker Ground Truth."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Uploading annotation UI to {bucket}\")\n",
    "annotation_ui_s3_uri = f\"s3://{bucket}/qa_annotator.liquid.html\"\n",
    "!aws s3 cp qa_annotator.liquid.html {annotation_ui_s3_uri}\n",
    "\n",
    "squad_s3_uri = f\"s3://{bucket}/data/sample_squad.json\"\n",
    "!aws s3 cp sample_squad.json {squad_s3_uri}\n",
    "\n",
    "# create our manifest\n",
    "sample_manifest_lines = [{\"source\": squad_s3_uri}]\n",
    "sample_manifest = \"\\n\".join([json.dumps(data_object) for data_object in sample_manifest_lines])\n",
    "\n",
    "# send manifest to s3\n",
    "manifest_name = \"input.manifest\"\n",
    "s3.Object(bucket, manifest_name).put(Body=sample_manifest)\n",
    "input_manifest_uri = f\"s3://{bucket}/{manifest_name}\"\n",
    "\n",
    "# launch our labeling job\n",
    "time_seconds = int(time.time())\n",
    "labeling_job_name = f\"squad-{time_seconds}\"\n",
    "labeling_job_request = {\n",
    "    \"LabelingJobName\": labeling_job_name,\n",
    "    \"HumanTaskConfig\": {\n",
    "        \"AnnotationConsolidationConfig\": {\n",
    "            \"AnnotationConsolidationLambdaArn\": post_annot_lambda_arn,\n",
    "        },\n",
    "        \"MaxConcurrentTaskCount\": 1000,\n",
    "        \"NumberOfHumanWorkersPerDataObject\": 1,\n",
    "        \"PreHumanTaskLambdaArn\": pre_annot_lambda_arn,\n",
    "        \"TaskAvailabilityLifetimeInSeconds\": 864000,\n",
    "        \"TaskDescription\": labeling_job_name,\n",
    "        \"TaskTimeLimitInSeconds\": 28800,\n",
    "        \"TaskTitle\": labeling_job_name,\n",
    "        \"UiConfig\": {\n",
    "            \"UiTemplateS3Uri\": annotation_ui_s3_uri,\n",
    "        },\n",
    "        \"WorkteamArn\": WORKTEAM_ARN,\n",
    "    },\n",
    "    \"InputConfig\": {\n",
    "        \"DataAttributes\": {\n",
    "            \"ContentClassifiers\": [\n",
    "                \"FreeOfPersonallyIdentifiableInformation\",\n",
    "                \"FreeOfAdultContent\",\n",
    "            ]\n",
    "        },\n",
    "        \"DataSource\": {\"S3DataSource\": {\"ManifestS3Uri\": input_manifest_uri}},\n",
    "    },\n",
    "    \"LabelAttributeName\": labeling_job_name,\n",
    "    \"OutputConfig\": {\n",
    "        \"S3OutputPath\": f\"s3://{bucket}/smgt-output\",\n",
    "    },\n",
    "    \"StoppingConditions\": {\"MaxPercentageOfInputDatasetLabeled\": 100},\n",
    "    \"RoleArn\": role_arn,\n",
    "}\n",
    "response = sagemaker_cl.create_labeling_job(\n",
    "    **labeling_job_request\n",
    ")\n",
    "\n",
    "response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Access Labeling Portal\n",
    "\n",
    "To access the labeling portal run the following cell and click on the link:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "domain = sagemaker_cl.list_workteams()['Workteams'][0]['SubDomain']\n",
    "md(f\"[labeling worker portal](https://{domain})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check Labeling Job Status\n",
    "\n",
    "Once our job has been completed, it may take a few minutes to process and for Ground Truth to put the output in S3. This block will check for the job completion status."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "labeling_job_description = sagemaker_cl.describe_labeling_job(LabelingJobName=labeling_job_name)\n",
    "status = labeling_job_description[\"LabelingJobStatus\"]\n",
    "print(f\"Job Status: {status}\")\n",
    "assert status == \"Completed\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Labeled Data\n",
    "\n",
    "Once our job is completed that means that our output data is now in S3, we can download it and load into our notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_manifest = labeling_job_description[\"LabelingJobOutput\"][\"OutputDatasetS3Uri\"]\n",
    "output_manifest_lines = s3.Object(*output_manifest[len(\"s3://\"):].split(\"/\", 1)).get()[\"Body\"].read().decode(\"utf-8\").splitlines()\n",
    "output_manifest_lines = [json.loads(line) for line in output_manifest_lines]\n",
    "\n",
    "custom_squad_labels_s3_uri = output_manifest_lines[0][labeling_job_name][\"s3Uri\"]\n",
    "sample_squad_labels = json.loads(sagemaker.s3.S3Downloader.read_file(custom_squad_labels_s3_uri))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "sample_squad_labels['data'][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load SQuAD Train Set\n",
    "\n",
    "Let's load SQuAD and add our own labeled examples. SQuAD formats the data as a list of topics. Each topic has it's own set of context statements, each of which have a variety of question and answer pairs. Our sample labels are formatted as their own topic, so we can simply append them to the list of topics already in SQuAD."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "with open('data/v2.0/train-v2.0.json', 'r') as f:\n",
    "    actual_squad = json.load(f)\n",
    "    \n",
    "print('Original length', len(actual_squad['data']))\n",
    "\n",
    "# add our dataset to squad\n",
    "actual_squad['data'].extend(sample_squad_labels['data'])\n",
    "actual_squad['data'][-1]\n",
    "print('New length', len(actual_squad['data']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can save our augmented version of SQuAD"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "with open('data/augmented_squad.json', 'w') as f:\n",
    "    json.dump(actual_squad, f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a HF Dataset Object\n",
    "\n",
    "Now that we've combined our data, we can transform it into a Huggingface dataset object. There are several ways to do this. We can use the [load_dataset](https://huggingface.co/docs/datasets/loading_datasets.html) option, in this case we can supply a CSV, JSON, or text file that will then be loaded as a dataset object. You can supply load_dataset a processing script to convert your file into the desired format. In this case for demonstration purposes we are instead going to use the Dataset.from_dict() method which allows us to supply an in-memory dictionary to create a dataset object. \n",
    "\n",
    "Run the following block to create our dataset dictionaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_squad_dict(actual_squad):\n",
    "    titles = []\n",
    "    contexts = []\n",
    "    ids = []\n",
    "    questions = []\n",
    "    answers = []\n",
    "    for example in tqdm(actual_squad[\"data\"]):\n",
    "        title = example.get(\"title\", \"\").strip()\n",
    "        for paragraph in example[\"paragraphs\"]:\n",
    "            context = paragraph[\"context\"].strip()\n",
    "            for qa in paragraph[\"qas\"]:\n",
    "                question = qa[\"question\"].strip()\n",
    "                id_ = qa[\"id\"]\n",
    "\n",
    "                answer_starts = [answer[\"answer_start\"] for answer in qa[\"answers\"]]\n",
    "                answer_list = [answer[\"text\"].strip() for answer in qa[\"answers\"]]\n",
    "                titles.append(title)\n",
    "                contexts.append(context)\n",
    "                questions.append(question)\n",
    "                ids.append(id_)\n",
    "                answers.append({\n",
    "                        \"answer_start\": answer_starts,\n",
    "                        \"text\": answer_list,\n",
    "                    })\n",
    "\n",
    "    dataset_dict = {\n",
    "        \"answers\":answers,\n",
    "        \"context\":contexts,\n",
    "        \"id\":ids,\n",
    "        \"question\":questions,\n",
    "        \"title\":titles,\n",
    "    }\n",
    "    return dataset_dict\n",
    "\n",
    "dataset_dict = create_squad_dict(actual_squad)\n",
    "test_dataset_dict = create_squad_dict(squad_dev)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also need to define our dataset features. We also will define our dataset features. In our case our features are:\n",
    "\n",
    "* ID - the ID of the text\n",
    "* Title - the associated title for the topic\n",
    "* context - the context statement the model must search to find an answer\n",
    "* question - the question the model is being asked\n",
    "* answer - the accepted answer text and location in the context statement\n",
    "\n",
    "HF datasets allow us to define this schema using the Features argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "squad_dataset = Dataset.from_dict(dataset_dict,\n",
    "                                 features=datasets.Features(\n",
    "                {\n",
    "                    \"id\": datasets.Value(\"string\"),\n",
    "                    \"title\": datasets.Value(\"string\"),\n",
    "                    \"context\": datasets.Value(\"string\"),\n",
    "                    \"question\": datasets.Value(\"string\"),\n",
    "                    \"answers\": datasets.features.Sequence(\n",
    "                        {\n",
    "                            \"text\": datasets.Value(\"string\"),\n",
    "                            \"answer_start\": datasets.Value(\"int32\"),\n",
    "                        }\n",
    "                    ),\n",
    "                    # These are the features of your dataset like images, labels ...\n",
    "                }\n",
    "            ))\n",
    "squad_test = Dataset.from_dict(test_dataset_dict,\n",
    "                                 features=datasets.Features(\n",
    "                {\n",
    "                    \"id\": datasets.Value(\"string\"),\n",
    "                    \"title\": datasets.Value(\"string\"),\n",
    "                    \"context\": datasets.Value(\"string\"),\n",
    "                    \"question\": datasets.Value(\"string\"),\n",
    "                    \"answers\": datasets.features.Sequence(\n",
    "                        {\n",
    "                            \"text\": datasets.Value(\"string\"),\n",
    "                            \"answer_start\": datasets.Value(\"int32\"),\n",
    "                        }\n",
    "                    ),\n",
    "                }\n",
    "            ))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have our Dataset object created, we then have to tokenize the text. Since models can’t accept raw text as an input, we need to convert our text into a numeric input that it can understand."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def prepare_train_features(examples, tokenizer, max_length, doc_stride):\n",
    "    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results\n",
    "    # in one example possible giving several features when a context is long, each of those features having a\n",
    "    # context that overlaps a bit the context of the previous feature.\n",
    "    pad_on_right = tokenizer.padding_side == \"right\"\n",
    "\n",
    "    tokenized_examples = tokenizer(\n",
    "        examples[\"question\" if pad_on_right else \"context\"],\n",
    "        examples[\"context\" if pad_on_right else \"question\"],\n",
    "        truncation=\"only_second\" if pad_on_right else \"only_first\",\n",
    "        max_length=max_length,\n",
    "        stride=doc_stride,\n",
    "        return_overflowing_tokens=True,\n",
    "        return_offsets_mapping=True,\n",
    "        padding=\"max_length\",\n",
    "    )\n",
    "\n",
    "    # Since one example might give us several features if it has a long context, we need a map from a feature to\n",
    "    # its corresponding example. This key gives us just that.\n",
    "    sample_mapping = tokenized_examples.pop(\"overflow_to_sample_mapping\")\n",
    "    # The offset mappings will give us a map from token to character position in the original context. This will\n",
    "    # help us compute the start_positions and end_positions.\n",
    "    offset_mapping = tokenized_examples.pop(\"offset_mapping\")\n",
    "\n",
    "    # Let's label those examples!\n",
    "    tokenized_examples[\"start_positions\"] = []\n",
    "    tokenized_examples[\"end_positions\"] = []\n",
    "\n",
    "    for i, offsets in enumerate(offset_mapping):\n",
    "        # We will label impossible answers with the index of the CLS token.\n",
    "        input_ids = tokenized_examples[\"input_ids\"][i]\n",
    "        cls_index = input_ids.index(tokenizer.cls_token_id)\n",
    "\n",
    "        # Grab the sequence corresponding to that example (to know what is the context and what is the question).\n",
    "        sequence_ids = tokenized_examples.sequence_ids(i)\n",
    "\n",
    "        # One example can give several spans, this is the index of the example containing this span of text.\n",
    "        sample_index = sample_mapping[i]\n",
    "        answers = examples[\"answers\"][sample_index]\n",
    "        # If no answers are given, set the cls_index as answer.\n",
    "        if len(answers[\"answer_start\"]) == 0:\n",
    "            tokenized_examples[\"start_positions\"].append(cls_index)\n",
    "            tokenized_examples[\"end_positions\"].append(cls_index)\n",
    "        else:\n",
    "            # Start/end character index of the answer in the text.\n",
    "            start_char = answers[\"answer_start\"][0]\n",
    "            end_char = start_char + len(answers[\"text\"][0])\n",
    "\n",
    "            # Start token index of the current span in the text.\n",
    "            token_start_index = 0\n",
    "            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):\n",
    "                token_start_index += 1\n",
    "\n",
    "            # End token index of the current span in the text.\n",
    "            token_end_index = len(input_ids) - 1\n",
    "            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):\n",
    "                token_end_index -= 1\n",
    "\n",
    "            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).\n",
    "            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):\n",
    "                tokenized_examples[\"start_positions\"].append(cls_index)\n",
    "                tokenized_examples[\"end_positions\"].append(cls_index)\n",
    "            else:\n",
    "                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.\n",
    "                # Note: we could go after the last offset if the answer is the last word (edge case).\n",
    "                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:\n",
    "                    token_start_index += 1\n",
    "                tokenized_examples[\"start_positions\"].append(token_start_index - 1)\n",
    "                while offsets[token_end_index][1] >= end_char:\n",
    "                    token_end_index -= 1\n",
    "                tokenized_examples[\"end_positions\"].append(token_end_index + 1)\n",
    "\n",
    "    return tokenized_examples\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Local Training\n",
    "\n",
    "Let’s start by understanding how training a model in Huggingface works locally, then go over the adjustments we make to run it in SageMaker. \n",
    "Huggingface makes training easy through the use of their trainer class. The trainer class allows us to pass in our model, our train and validation datasets, our hyperparameters, and even our tokenizer. Since we already have our model as well as our training and validation sets, we only need to define our hyperparameters. This can be done through the TrainingArguments class. This allows us to specify things like the learning rate, batch size, number of epochs, and more in depth parameters like weight decay or a learning rate scheduling strategy. Once we’ve defined our TrainingArguments, we can pass in our model, our training set, validation set and arguments to instantiate our trainer class. Once this is all ready we can simply call trainer.train() to start training our model. \n",
    "\n",
    "To run training locally, you can uncomment out the below block, but this will take a very long time if you are not running on a GPU instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# for running local training, this will likely take a very long time if not running on a GPU instance.\n",
    "\n",
    "# doc_stride=128\n",
    "# max_length=512\n",
    "# tokenized_train = squad_dataset.map(prepare_train_features, batched=True, remove_columns=squad_dataset.column_names, fn_kwargs = {'tokenizer':tokenizer, 'max_length':max_length, 'doc_stride':doc_stride})\n",
    "# tokenized_test = squad_test.map(prepare_train_features, batched=True, remove_columns=squad_test.column_names, fn_kwargs = {'tokenizer':tokenizer, 'max_length':max_length, 'doc_stride':doc_stride})\n",
    "\n",
    "# hf_args = TrainingArguments(\n",
    "# 'test_local',\n",
    "# evaluation_strategy = \"epoch\",\n",
    "# learning_rate=5e-5,\n",
    "# per_device_train_batch_size=16,\n",
    "# per_device_eval_batch_size=16,\n",
    "# num_train_epochs=1,\n",
    "# weight_decay=0.0001,\n",
    "# )\n",
    "\n",
    "# trainer = Trainer(\n",
    "#     model,\n",
    "#     hf_args,\n",
    "#     train_dataset=tokenized_train,\n",
    "#     eval_dataset=tokenized_test,\n",
    "#     data_collator=default_data_collator,\n",
    "#     tokenizer=tokenizer,\n",
    "# )\n",
    "\n",
    "# trainer.train()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Send Data to S3\n",
    "\n",
    "To run the same training job in SageMaker is straightforward. The first step is putting our data in S3 so that our model can access it. SageMaker training allows you to specify a data source, you can use sources like S3, EFS, or FSx for Lustre for high performance data ingestion. In our case, our augmented SQuAD dataset isn’t particularly large, so S3 is a good choice. We upload our training data to a folder in S3 and when SageMaker spins up our training instance, it will download the data from our specified S3 location. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 cp --recursive data s3://{bucket}/{prefix}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3train = f's3://{bucket}/{prefix}/'\n",
    "\n",
    "# create a pointer to our data in S3\n",
    "train = sagemaker.session.s3_input(s3train, distribution='FullyReplicated', \n",
    "                        content_type=None, s3_data_type='S3Prefix')\n",
    "\n",
    "data_channels = {'train': train}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instantiate the model\n",
    "\n",
    "Now we are going to instantiate our model, here we are going to specify our hyperparameters for training as well as the number of GPUs we are going to use. For a longer running training job, ml.p4d.24xlarge instances contain 8 A100 GPUs, making them ideal for heavy duty deep learning training. We don't need quite as much horsepower since we are only training our model for a couple epochs, so we can instead use a smaller GPU instance. For our specific training job we are going to use a p3.8xlarge instance consisting of 4 V100 GPUs. \n",
    "\n",
    "Once we have set our hyperparameters, we will instantiate a Sagemaker Estimator that we will use to run our training job. We specify the Docker image we just pushed to ECR as well as an entrypoint giving instructions for what operations our container should perform when it starts up. Our Docker container has two commands, train and serve. When we instantiate a training job, behind the scenes Sagemaker is running our Docker container and telling it to run the train command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "account=!aws sts get-caller-identity --query Account --output text\n",
    "\n",
    "# Get the region defined in the current configuration (default to us-west-2 if none defined)\n",
    "region=!aws configure get region\n",
    "\n",
    "# metric definition to extract the results\n",
    "metric_definitions=[\n",
    "     {\"Name\": \"train_runtime\", \"Regex\": \"train_runtime.*=\\D*(.*?)$\"},\n",
    "     {'Name': 'train_samples_per_second', 'Regex': \"train_samples_per_second.*=\\D*(.*?)$\"},\n",
    "     {'Name': 'epoch', 'Regex': \"epoch.*=\\D*(.*?)$\"},\n",
    "     {'Name': 'f1', 'Regex': \"f1.*=\\D*(.*?)$\"},\n",
    "     {'Name': 'exact_match', 'Regex': \"exact_match.*=\\D*(.*?)$\"}]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# hyperparameters, which are passed into the training job\n",
    "hyperparameters={\n",
    "    'model_name': model_name,\n",
    "    'dataset_name':'squad',\n",
    "    'do_train': True,\n",
    "    'do_eval': True,\n",
    "    'fp16': True,\n",
    "    'train_batch_size': 32, \n",
    "    'eval_batch_size': 32,\n",
    "    'weight_decay':0.01,\n",
    "    'warmup_steps':500,\n",
    "    'learning_rate':5e-5,\n",
    "    'epochs': 2,\n",
    "    'max_length': 384,\n",
    "    'max_steps': 100,\n",
    "    'pad_to_max_length': True,\n",
    "    'doc_stride': 128,\n",
    "    'output_dir': '/opt/ml/model'\n",
    "}\n",
    "\n",
    "# estimator\n",
    "huggingface_estimator = HuggingFace(entry_point='run_qa.py',\n",
    "                                    source_dir='container_training',\n",
    "                                    metric_definitions=metric_definitions,\n",
    "                                    instance_type='ml.p3.8xlarge',\n",
    "                                    instance_count=1,\n",
    "                                    volume_size=100,\n",
    "                                    role=role,\n",
    "                                    transformers_version='4.6.1',\n",
    "                                    pytorch_version='1.7.1',\n",
    "                                    py_version='py36',\n",
    "                                    hyperparameters = hyperparameters)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fine-tune the Model\n",
    "\n",
    "If you use an instance with 4 GPUs and a batch size of 32 this process will take ~30 minutes to complete for this particular finetuning task with 2 epochs. Each additional epoch will add another 8 or so minutes. It's recommended to at minimum use a training instance with 4 GPUs, although you will likely get better performance with one of the ml.p3.16xlarge, ml.p3dn.24xlarge, or ml.p4d.24xlarge instances.\n",
    "\n",
    "We'll be running this training in a SageMaker Training job, which is triggered by the below cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "huggingface_estimator.fit(data_channels, wait=False, job_name=f'hf-distilbert-squad-{int(time.time())}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download Modal Locally\n",
    "\n",
    "Once our training job is complete, we can download our trained model. SageMaker stores our model output in S3 as a tarball, so we will need to retrieve it from S3 and then extract the contents locally. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Once training has completed, copy the model weights to our instance.\n",
    "! aws s3 cp {huggingface_estimator.output_path}{huggingface_estimator.jobs[0].job_name}/output/model.tar.gz model/model.tar.gz\n",
    "\n",
    "import tarfile\n",
    "\n",
    "with tarfile.open('model/model.tar.gz', 'r:gz') as f:\n",
    "    f.extractall('model')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Model with Trained Weights\n",
    "\n",
    "Now that we've extracted the weights, we can reinitialize our model with them by pointing to the folder containing our weights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = AutoModelForQuestionAnswering.from_pretrained('model')\n",
    "tokenizer = AutoTokenizer.from_pretrained('model') # model_name\n",
    "evl = model.eval()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Time to test out our model! Let's try a few different questions from our dev set and see how our model performs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "i = np.random.randint(0,25)\n",
    "QA_input = {\n",
    "    'question': sq['paragraphs'][i]['qas'][0]['question'],\n",
    "    'context': sq['paragraphs'][i]['context']\n",
    "}\n",
    "inputs = tokenizer(QA_input['question'], QA_input['context'], return_tensors='pt')\n",
    "input_ids = inputs[\"input_ids\"].tolist()[0]\n",
    "start_positions = torch.tensor([1])\n",
    "end_positions = torch.tensor([3])\n",
    "\n",
    "outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)\n",
    "\n",
    "answer_start_scores = outputs.start_logits\n",
    "answer_end_scores = outputs.end_logits\n",
    "answer_start = torch.argmax(\n",
    "    answer_start_scores\n",
    ")  # Get the most likely beginning of answer with the argmax of the score\n",
    "answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score\n",
    "answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))\n",
    "print(f\"Context: {QA_input['context']} \\n\")\n",
    "print(f\"Question: {QA_input['question']} \\n\")\n",
    "print(f\"Answer: {answer}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy Trained Model\n",
    "\n",
    "Now that we've finetuned our model, what now? Let's deploy our trained model to an endpoint and ask it some questions!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.pytorch import PyTorch, PyTorchModel\n",
    "from sagemaker.predictor import RealTimePredictor\n",
    "\n",
    "# this class defines the content type used for our endpoint, in this case plain text\n",
    "class StringPredictor(RealTimePredictor):\n",
    "    def __init__(self, endpoint_name, sagemaker_session):\n",
    "        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')\n",
    "           "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "endpoint_name = 'hf-distilbert-QA-string-endpoint'\n",
    "\n",
    "# if deploying from a model you trained in the same session \n",
    "#     bert_end = torch_model.deploy(instance_type='ml.g4dn.4xlarge', initial_instance_count=1, \n",
    "#                           endpoint_name=endpoint_name)\n",
    "\n",
    "model_data = f\"{huggingface_estimator.output_path}{huggingface_estimator.jobs[0].job_name}/output/model.tar.gz\"\n",
    "\n",
    "# We are going to use a SageMaker serving container\n",
    "torch_model = PyTorchModel(model_data=model_data,\n",
    "                           source_dir = 'container_serving',\n",
    "                           role=role,\n",
    "                          entry_point='transform_script.py',\n",
    "                          framework_version='1.8.1',\n",
    "                          py_version='py3',\n",
    "                          predictor_cls = StringPredictor)\n",
    "bert_end = torch_model.deploy(instance_type='ml.m5.2xlarge', initial_instance_count=1, #'ml.g4dn.xlarge'\n",
    "                          endpoint_name=endpoint_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are a few questions we can ask our model about SageMaker:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "amzn_context = \"\"\"\n",
    "Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high quality models.\n",
    "Traditional ML development is a complex, expensive, iterative process made even harder because there are no integrated \n",
    "tools for the entire machine learning workflow. You need to stitch together tools and workflows, which is time-consuming\n",
    "and error-prone. SageMaker solves this challenge by providing all of the components used for machine learning in a single \n",
    "toolset so models get to production faster with much less effort and at lower cost.\n",
    "\"\"\"\n",
    "\n",
    "amzn_questions = [\n",
    "    \"How does SageMaker solve the challenge of traditional ML Development?\",\n",
    "    \"What is Traditional ML development?\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can create a predictor object by using the RealTimePredictor class and specifying our endpoint name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# test hf classifier\n",
    "predictor = RealTimePredictor(endpoint_name=endpoint_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plain Text Inference Results\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "print('Context:',amzn_context)\n",
    "ind = 0\n",
    "\n",
    "predictions = []\n",
    "for amzn_question in amzn_questions:\n",
    "    print('-------------------------------------------------------------------------------------')\n",
    "    print('Question:',amzn_question)\n",
    "    test_text = \"|\".join((amzn_question, amzn_context))\n",
    "    pred = predictor.predict(test_text).decode().strip('\"')\n",
    "    pred = \"\".join(eval(pred))\n",
    "    print('-------------------------------------------------------------------------------------')\n",
    "    print('Prediction:',pred)\n",
    "    print('-------------------------------------------------------------------------------------')\n",
    "    predictions.append(pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize model results\n",
    "\n",
    "Now that we've deployed our model and run some inference through it, we can actually visualize the results for multiple questions at once\n",
    "in the original annotation UI by manually creating some SQuAD format and performing the liquid injection ourselves."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is the SquAD input format required by the annotation UI.\n",
    "squad_txt = json.dumps(\n",
    "    {\n",
    "        \"version\": \"v2.0\",\n",
    "        \"data\": [\n",
    "            {\n",
    "                \"title\": \"Ground Truth Marketing\",\n",
    "                \"paragraphs\": [\n",
    "                    {\n",
    "                        \"context\": amzn_context.replace(\"\\n\", \"\"),\n",
    "                        \"qas\": [\n",
    "                            {\n",
    "                                \"question\": question,\n",
    "                                \"id\": i,\n",
    "                                \"answers\": [\n",
    "                                    {\n",
    "                                        \"answer_id\": i,\n",
    "                                        \"text\": prediction,\n",
    "                                        \"answer_start\": amzn_context.replace(\"\\n\", \"\").find(prediction),\n",
    "                                    }\n",
    "                                ],\n",
    "                            }\n",
    "                            for i, (question, prediction) in enumerate(\n",
    "                                zip(amzn_questions, predictions)\n",
    "                            )\n",
    "                        ],\n",
    "                    },\n",
    "                ],\n",
    "            }\n",
    "        ],\n",
    "    }\n",
    ")\n",
    "\n",
    "with open(\"qa_annotator.liquid.html\") as f:\n",
    "    # Manually inject our generated squad into the liquid tag.\n",
    "    tmp_annotator_txt = f.read().replace(\"{{ task.input.source }}\", squad_txt)\n",
    "    # We'll remove crowd html since we aren't loading within a Ground Truth context.\n",
    "    tmp_annotator_txt = tmp_annotator_txt.replace('<script src=\"https://assets.crowd.aws/crowd-html-elements.js\"></script>', \"\")\n",
    "\n",
    "tmp_annotator_path = \"tmp_annotator.html\"\n",
    "with open(tmp_annotator_path, \"w\") as f:\n",
    "    f.write(tmp_annotator_txt)\n",
    "\n",
    "from IPython.display import IFrame\n",
    "IFrame(tmp_annotator_path, 1280, 800)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "In this notebook, you learned how to create your own question answering dataset using SageMaker Ground Truth and combine it with SQuAD to train your own question answering model using SageMaker training.\n",
    "\n",
    "Try augmenting SQuAD or even creating an entire dataset with your own questions using SageMaker Ground Truth (https://aws.amazon.com/sagemaker/groundtruth/). Also  try out finetuning different BERT variants using Huggingface (https://huggingface.co/) for question answering. Happy building!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleanup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!rm bert_base.pt\n",
    "!rm s3_bucket.txt\n",
    "bert_end.delete_endpoint()"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.g4dn.xlarge",
  "kernelspec": {
   "display_name": "conda_pytorch_latest_p36",
   "language": "python",
   "name": "conda_pytorch_latest_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}