{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Explainability for SageMaker BlazingText\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "*This notebook will not work on SageMaker Studio notebook. Please run this notebook on SageMaker Notebook instances and set the kernel to conda_python3.*\n",
    "\n",
    "Amazon SageMaker Clarify helps improve your machine learning models by detecting potential bias and helping explain how these models make predictions. The fairness and explainability functionality provided by SageMaker Clarify takes a step towards enabling AWS customers to build trustworthy and understandable machine learning models. The product comes with the tools to help you with the following tasks.\n",
    "\n",
    "Measure biases that can occur during each stage of the ML lifecycle (data collection, model training and tuning, and monitoring of ML models deployed for inference).\n",
    "Generate model governance reports targeting risk and compliance teams and external regulators.\n",
    "Provide explanations of the data, models, and monitoring used to assess predictions for input containing data of various modalities like numerical data, categorical data, text, and images.\n",
    "\n",
    "Learn more about SageMaker Clarify [here](https://aws.amazon.com/sagemaker/clarify/). This sample notebook walks you through:\n",
    "\n",
    "1. Key terms and concepts needed to understand SageMaker Clarify\n",
    "2. The incremental updates required to explain text features, along with other tabular features.\n",
    "3. Explaining the importance of the various new input features on the model's decision\n",
    "\n",
    "SageMaker Clarify currently accepts only [CSV, and JSON Lines formats as input](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-configure-processing-jobs.html#clarify-processing-job-configure-prerequisites) and the [SageMaker BlazingText Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html) container only accepts JSON as the inference input format and thus can't be directly used. In order to use Clarify, we have 2 options as below -\n",
    "\n",
    "\n",
    "\n",
    "### Option A\n",
    "\n",
    "![Architecture](SageMaker_Clarify_BlazingText.png)\n",
    "\n",
    "The notebook follows this approach which will first train a Text Classification Model using SageMaker BlazingText using the SageMaker Estimator in the SageMaker Python SDK. We will take this Model and host the model using fasttext BYOC SageMaker container for Clarify to use and run Explainability job. The BYOC container will accept the inference request in CSV format allowing SageMaker Clarify to explain the inference results of SageMaker BlazingText Model.\n",
    "\n",
    "\n",
    "### Option B\n",
    "\n",
    "![Architecture](pipeline-based-xai.png)\n",
    "\n",
    "Alternate approach which will scale to other model types is to use SageMaker Inference Pipeline and have a pre-processing container to convert from SageMaker Clarify acceptable format (i.e. JSON Lines or CSV) to the format acceptable by the underlying Model (i.e. JSON for SageMaker BlazingText algorithm). Although this approach is not demoed in this notebook, this approach can be used for other model types.\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training Text Classification Model using SageMaker BlazingText Algorithm\n",
    "\n",
    "Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. This notebook demonstrates the use of SageMaker BlazingText to perform supervised binary/multi class with single or multi label text classification. BlazingText can train the model on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. BlazingText extends the fastText text classifier to leverage GPU acceleration using custom CUDA kernels."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install required libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install sagemaker botocore boto3 --upgrade\n",
    "! pip install awscli --force-reinstall --upgrade"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Let's start by specifying:\n",
    "\n",
    "- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. \n",
    "- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from SageMaker python SDK."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "import json\n",
    "import boto3\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "\n",
    "role = get_execution_role()\n",
    "print(\n",
    "    role\n",
    ")  # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf\n",
    "\n",
    "bucket = sess.default_bucket()  # Replace with your own bucket name if needed\n",
    "print(bucket)\n",
    "prefix = \"blazingtext/supervised\"  # Replace with the prefix under which you want to store the data if needed"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Preparation\n",
    "\n",
    "Now we'll download a dataset from the web on which we want to train the text classification model. BlazingText expects a single preprocessed text file with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by `\"\\__label\\__\"`.\n",
    "\n",
    "In this example, let us train the text classification model on the [DBPedia Ontology Dataset](https://wiki.dbpedia.org/services-resources/dbpedia-data-set-2014#2) as done by [Zhang et al](https://arxiv.org/pdf/1509.01626.pdf). The DBpedia ontology dataset is constructed by picking 14 nonoverlapping classes from DBpedia 2014. It has 560,000 training samples and 70,000 testing samples. The fields we used for this dataset contain title and abstract of each Wikipedia article. \n",
    "\n",
    "The below dataset is attributed to [this](https://github.com/saurabh3949/Text-Classification-Datasets/blob/master/dbpedia_csv.tar.gz) original source and is shared under [this](https://creativecommons.org/licenses/by-sa/3.0/) license. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.s3 import S3Downloader as downloader\n",
    "\n",
    "downloader.download(\n",
    "    s3_uri=f\"s3://sagemaker-example-files-prod-{sess.boto_region_name}/datasets/text/dbpedia/dbpedia.tar.gz\",\n",
    "    local_path=\".\",\n",
    "    sagemaker_session=sess,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!tar -xzvf dbpedia.tar.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head dbpedia_csv/train.csv -n 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat dbpedia_csv/classes.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "index_to_label = {}\n",
    "with open(\"dbpedia_csv/classes.txt\") as f:\n",
    "    for i, label in enumerate(f.readlines()):\n",
    "        index_to_label[str(i + 1)] = label.strip()\n",
    "print(index_to_label)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Preprocessing\n",
    "We need to preprocess the training data into **space separated tokenized text** format which can be consumed by `BlazingText` algorithm. Also, as mentioned previously, the class label(s) should be prefixed with `__label__` and it should be present in the same line along with the original sentence. We'll use `nltk` library to tokenize the input sentences from DBPedia dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from random import shuffle\n",
    "import multiprocessing\n",
    "from multiprocessing import Pool\n",
    "import csv\n",
    "import nltk\n",
    "\n",
    "nltk.download(\"punkt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def transform_instance(row):\n",
    "    cur_row = []\n",
    "    label = \"__label__\" + index_to_label[row[0]]  # Prefix the index-ed label with __label__\n",
    "    cur_row.append(label)\n",
    "    cur_row.extend(nltk.word_tokenize(row[1].lower()))\n",
    "    cur_row.extend(nltk.word_tokenize(row[2].lower()))\n",
    "    return cur_row"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def preprocess(input_file, output_file, keep=1):\n",
    "    all_rows = []\n",
    "    with open(input_file, \"r\") as csvinfile:\n",
    "        csv_reader = csv.reader(csvinfile, delimiter=\",\")\n",
    "        for row in csv_reader:\n",
    "            all_rows.append(row)\n",
    "    shuffle(all_rows)\n",
    "    all_rows = all_rows[: int(keep * len(all_rows))]\n",
    "    pool = Pool(processes=multiprocessing.cpu_count())\n",
    "    transformed_rows = pool.map(transform_instance, all_rows)\n",
    "    pool.close()\n",
    "    pool.join()\n",
    "\n",
    "    with open(output_file, \"w\") as csvoutfile:\n",
    "        csv_writer = csv.writer(csvoutfile, delimiter=\" \", lineterminator=\"\\n\")\n",
    "        csv_writer.writerows(transformed_rows)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# Preparing the training dataset\n",
    "\n",
    "# Since preprocessing the whole dataset might take a couple of minutes,\n",
    "# we keep 20% of the training dataset for this demo.\n",
    "# Set keep to 1 if you want to use the complete dataset\n",
    "preprocess(\"dbpedia_csv/train.csv\", \"dbpedia.train\", keep=0.2)\n",
    "\n",
    "# Preparing the validation dataset\n",
    "preprocess(\"dbpedia_csv/test.csv\", \"dbpedia.validation\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "train_channel = prefix + \"/train\"\n",
    "validation_channel = prefix + \"/validation\"\n",
    "\n",
    "sess.upload_data(path=\"dbpedia.train\", bucket=bucket, key_prefix=train_channel)\n",
    "sess.upload_data(path=\"dbpedia.validation\", bucket=bucket, key_prefix=validation_channel)\n",
    "\n",
    "s3_train_data = \"s3://{}/{}\".format(bucket, train_channel)\n",
    "s3_validation_data = \"s3://{}/{}\".format(bucket, validation_channel)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_output_location = \"s3://{}/{}/output\".format(bucket, prefix)\n",
    "print(s3_output_location)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training\n",
    "Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "region_name = boto3.Session().region_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, \"blazingtext\", \"latest\")\n",
    "print(\"Using SageMaker BlazingText container: {} ({})\".format(container, region_name))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the BlazingText model for supervised text classification"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to the original implementation of [Word2Vec](https://arxiv.org/pdf/1301.3781.pdf), SageMaker BlazingText provides an efficient implementation of the continuous bag-of-words (CBOW) and skip-gram architectures using Negative Sampling, on CPUs and additionally on GPU[s]. The GPU implementation uses highly optimized CUDA kernels. To learn more, please refer to [*BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs*](https://dl.acm.org/citation.cfm?doid=3146347.3146354)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "BlazingText also supports a *supervised* mode for text classification. It extends the FastText text classifier to leverage GPU acceleration using custom CUDA kernels. The model can be trained on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. For more information, please refer to the [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To summarize, the following modes are supported by BlazingText on different types instances:\n",
    "\n",
    "|          Modes         \t| cbow (supports subwords training) \t| skipgram (supports subwords training) \t| batch_skipgram \t| supervised |\n",
    "|:----------------------:\t|:----:\t|:--------:\t|:--------------:\t| :--------------:\t|\n",
    "|   Single CPU instance  \t|   ✔  \t|     ✔    \t|        ✔       \t|  ✔  |\n",
    "|   Single GPU instance  \t|   ✔  \t|     ✔    \t|                \t|  ✔ (Instance with 1 GPU only)  |\n",
    "| Multiple CPU instances \t|      \t|          \t|        ✔       \t|     | |\n",
    "\n",
    "Now, let's define the SageMaker `Estimator` with resource configurations and hyperparameters to train Text Classification on *DBPedia* dataset, using \"supervised\" mode on a `c4.4xlarge` instance.\n",
    "\n",
    "Refer to [BlazingText Hyperparameters](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) in the Amazon SageMaker documentation for the complete list of hyperparameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bt_model = sagemaker.estimator.Estimator(\n",
    "    container,\n",
    "    role,\n",
    "    instance_count=1,\n",
    "    instance_type=\"ml.c4.4xlarge\",\n",
    "    volume_size=30,\n",
    "    max_run=360000,\n",
    "    input_mode=\"File\",\n",
    "    output_path=s3_output_location,\n",
    "    hyperparameters={\n",
    "        \"mode\": \"supervised\",\n",
    "        \"epochs\": 1,\n",
    "        \"min_count\": 2,\n",
    "        \"learning_rate\": 0.05,\n",
    "        \"vector_dim\": 10,\n",
    "        \"early_stopping\": True,\n",
    "        \"patience\": 4,\n",
    "        \"min_epochs\": 5,\n",
    "        \"word_ngrams\": 2,\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data = sagemaker.inputs.TrainingInput(\n",
    "    s3_train_data,\n",
    "    distribution=\"FullyReplicated\",\n",
    "    content_type=\"text/plain\",\n",
    "    s3_data_type=\"S3Prefix\",\n",
    ")\n",
    "validation_data = sagemaker.inputs.TrainingInput(\n",
    "    s3_validation_data,\n",
    "    distribution=\"FullyReplicated\",\n",
    "    content_type=\"text/plain\",\n",
    "    s3_data_type=\"S3Prefix\",\n",
    ")\n",
    "data_channels = {\"train\": train_data, \"validation\": validation_data}"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have our `Estimator` object, we have set the hyper-parameters for this object and we have our data channels linked with the algorithm. The only remaining thing to do is to train the algorithm. The following command will train the algorithm. The estimated total training time based on the training job configuration is approximately 10 mins. Training the algorithm involves a few steps. Firstly, the instance that we requested while creating the `Estimator` classes is provisioned and is setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take some time, depending on the size of the data. Therefore it might be a few minutes before we start getting training logs for our training jobs. The data logs will also print out Accuracy on the validation data for every epoch after training job has executed `min_epochs`. This metric is a proxy for the quality of the algorithm. \n",
    "\n",
    "Once the job has finished a \"Job complete\" message will be printed. The trained model can be found in the S3 bucket that was setup as `output_path` in the estimator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bt_model.fit(inputs=data_channels, logs=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Packaging and Uploading your own container for Inference with Amazon SageMaker\n",
    "\n",
    "### An overview of Docker\n",
    "\n",
    "If you're familiar with Docker already, you can skip ahead to the next section.\n",
    "\n",
    "For many data scientists, Docker containers are a new concept, but they are not difficult, as you'll see here. \n",
    "\n",
    "Docker provides a simple way to package arbitrary code into an _image_ that is totally self-contained. Once you have an image, you can use Docker to run a _container_ based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way you set up your program is the way it runs, no matter where you run it.\n",
    "\n",
    "Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language independent and (b) it comprises your whole operating environment, including startup commands, environment variable, etc.\n",
    "\n",
    "In some ways, a Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run on the same physical machine or virtual machine instance.\n",
    "\n",
    "Docker uses a simple file called a `Dockerfile` to specify how the image is assembled. We'll see an example of that below. You can build your Docker images based on Docker images built by yourself or others, which can simplify things quite a bit.\n",
    "\n",
    "Docker has become very popular in the programming and devops communities for its flexibility and well-defined specification of the code to be run. \n",
    "\n",
    "Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.\n",
    "\n",
    "In Amazon SageMaker, Docker containers are invoked in a certain way for training and a slightly different way for hosting. The following sections outline how to build containers for the SageMaker environment.\n",
    "\n",
    "Some helpful links:\n",
    "\n",
    "* [Docker home page](http://www.docker.com)\n",
    "* [Getting started with Docker](https://docs.docker.com/get-started/)\n",
    "* [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)\n",
    "* [`docker run` reference](https://docs.docker.com/engine/reference/run/)\n",
    "* [Amazon ECS](https://aws.amazon.com/ecs/)\n",
    "\n",
    "### How Amazon SageMaker runs your Docker container\n",
    "\n",
    "Because you can run the same image in training or hosting, Amazon SageMaker runs your container with the argument `train` or `serve`. How your container processes this argument depends on the container:\n",
    "\n",
    "\n",
    "* If you specify a program as an `ENTRYPOINT` in the Dockerfile, that program will be run at startup and its first argument will be `train` or `serve`. The program can then look at that argument and decide what to do.\n",
    "* If you are building separate containers for training and hosting (or building only for one or the other), you can define a program as an `ENTRYPOINT` in the Dockerfile and ignore (or verify) the first argument passed in. \n",
    "\n",
    "\n",
    "#### Running your container during hosting\n",
    "\n",
    "Hosting has a very different model than training because hosting is responding to inference requests that come in via HTTP. In this example, we use our recommended Python serving stack to provide robust and scalable serving of inference requests:\n",
    "\n",
    "![Request serving stack](stack.png)\n",
    "\n",
    "This stack is implemented in the sample code here and you can mostly just leave it alone. \n",
    "\n",
    "Amazon SageMaker uses two URLs in the container:\n",
    "\n",
    "* `/ping` will receive `GET` requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.\n",
    "* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these will be passed in as well. \n",
    "\n",
    "The container will have the model files in the same place they were written during training:\n",
    "\n",
    "    /opt/ml\n",
    "    `-- model\n",
    "        `-- <model files>\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The parts of the sample container\n",
    "\n",
    "In the `container` directory are all the components you need to package the sample algorithm for Amazon SageMager:\n",
    "\n",
    "    .\n",
    "    |-- Dockerfile\n",
    "    |-- build_and_push.sh\n",
    "    `-- blazing_text\n",
    "        |-- nginx.conf\n",
    "        |-- predictor.py\n",
    "        |-- serve\n",
    "        `-- wsgi.py\n",
    "\n",
    "Let's discuss each of these in turn:\n",
    "\n",
    "* __`Dockerfile`__ describes how to build your Docker container image. More details below.\n",
    "* __`build_and_push.sh`__ is a script that uses the Dockerfile to build your container images and then pushes it to ECR. We'll invoke the commands directly later in this notebook, but you can just copy and run the script for your own algorithms.\n",
    "* __`blazing_text`__ is the directory which contains the files that will be installed in the container.\n",
    "* __`local_test`__ is a directory that shows how to test your new container on any computer that can run Docker, including an Amazon SageMaker notebook instance. Using this method, you can quickly iterate using small datasets to eliminate any structural bugs before you use the container with Amazon SageMaker. We'll walk through local testing later in this notebook. Please copy the model.bin file (output of the training job) to the container/local_test/test_dir/model directory.\n",
    "\n",
    "In this simple application, we only install five files in the container. You may only need that many or, if you have many supporting routines, you may wish to install more. These five show the standard structure of our Python containers, although you are free to choose a different toolset and therefore could have a different layout. If you're writing in a different programming language, you'll certainly have a different layout depending on the frameworks and tools you choose.\n",
    "\n",
    "The files that we'll put in the container are:\n",
    "\n",
    "* __`nginx.conf`__ is the configuration file for the nginx front-end. Generally, you should be able to take this file as-is.\n",
    "* __`predictor.py`__ is the program that actually implements the Flask web server and the decision tree predictions for this app. You'll want to customize the actual prediction parts to your application. Since this algorithm is simple, we do all the processing here in this file, but you may choose to have separate files for implementing your custom logic.\n",
    "* __`serve`__ is the program started when the container is started for hosting. It simply launches the gunicorn server which runs multiple instances of the Flask app defined in `predictor.py`. You should be able to take this file as-is.\n",
    "* __`wsgi.py`__ is a small wrapper used to invoke the Flask app. You should be able to take this file as-is.\n",
    "\n",
    "In summary, there is one file you will probably want to change for your application - `predictor.py`."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The Dockerfile\n",
    "\n",
    "The Dockerfile describes the image that we want to build. You can think of it as describing the complete operating system installation of the system that you want to run. A Docker container running is quite a bit lighter than a full operating system, however, because it takes advantage of Linux on the host machine for the basic operations. \n",
    "\n",
    "For the Python science stack, we will start from a standard Ubuntu installation and run the normal tools to install the things needed by scikit-learn. Finally, we add the code that implements our specific algorithm to the container and set up the right environment to run under.\n",
    "\n",
    "Along the way, we clean up extra space. This makes the container smaller and faster to start.\n",
    "\n",
    "Let's look at the Dockerfile for the example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat container/Dockerfile"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Building and registering the container\n",
    "\n",
    "The following shell code shows how to build the container image using `docker build` and push the container image to ECR using `docker push`. This code is also available as the shell script `container/build-and-push.sh`, which you can run as `build-and-push.sh sagemaker-blazing-text` to build the image `sagemaker-blazing-text`. \n",
    "\n",
    "This code looks for an ECR repository in the account you're using and the current default region (if you're using a SageMaker notebook instance, this will be the region where the notebook instance was created). If the repository doesn't exist, the script will create it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "algorithm_name = \"sagemaker-blazing-text\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "byoc_image_uri = \"{}.dkr.ecr.{}.amazonaws.com/{}:latest\".format(\n",
    "    sess.account_id(), region_name, algorithm_name\n",
    ")\n",
    "print(byoc_image_uri)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "\n",
    "# The name of our algorithm\n",
    "algorithm_name=sagemaker-blazing-text\n",
    "\n",
    "cd container\n",
    "\n",
    "chmod +x blazing_text/serve\n",
    "\n",
    "account=$(aws sts get-caller-identity --query Account --output text)\n",
    "\n",
    "# Get the region defined in the current configuration (default to us-west-2 if none defined)\n",
    "region=$(aws configure get region)\n",
    "region=${region:-us-west-2}\n",
    "\n",
    "fullname=\"${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest\"\n",
    "\n",
    "# If the repository doesn't exist in ECR, create it.\n",
    "aws ecr describe-repositories --repository-names \"${algorithm_name}\" > /dev/null 2>&1\n",
    "\n",
    "if [ $? -ne 0 ]\n",
    "then\n",
    "    aws ecr create-repository --repository-name \"${algorithm_name}\" > /dev/null\n",
    "fi\n",
    "\n",
    "# Get the login command from ECR and execute it directly\n",
    "docker login -u AWS -p $(aws ecr get-login-password --region ${region}) ${fullname}\n",
    "\n",
    "# Build the docker image locally with the image name and then push it to ECR\n",
    "# with the full name.\n",
    "\n",
    "docker build  -t ${algorithm_name} .\n",
    "docker tag ${algorithm_name} ${fullname}\n",
    "\n",
    "docker push ${fullname}"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Testing your algorithm on your local machine or on an Amazon SageMaker notebook instance\n",
    "\n",
    "While you're first packaging an algorithm use with Amazon SageMaker, you probably want to test it yourself to make sure it's working right. In the directory `container/local_test`, there is a framework for doing this. It includes three shell scripts for running and using the container and a directory structure that mimics the one outlined above.\n",
    "\n",
    "The scripts are:\n",
    "\n",
    "* `serve_local.sh`: Run this with the name of the image once you've trained the model and it should serve the model. For example, you can run `$ ./serve_local.sh sagemaker-blazing-text`. It will run and wait for requests. Simply use the keyboard interrupt to stop it.\n",
    "* `predict.sh`: Run this with the name of a payload file and (optionally) the HTTP content type you want. The content type will default to `text/csv`. For example, you can run `$ ./predict.sh payload.csv text/csv`.\n",
    "\n",
    "The directories as shipped are set up to test the blazingtext  sample algorithm presented here."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy your trained SageMaker BlazingText Model using your own container in Amazon SageMaker\n",
    "\n",
    "Once you have your container packaged, you can use it for hosting the model for real-time inference. Let's do that with the algorithm we made above.\n",
    "\n",
    "### Set up the environment\n",
    "\n",
    "Here we specify a bucket to use and the role that will be used for working with SageMaker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# S3 prefix\n",
    "prefix = \"byom-blazingtext\"\n",
    "\n",
    "# Define IAM role\n",
    "import boto3\n",
    "import re\n",
    "\n",
    "import os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "role = get_execution_role()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Hosting your model\n",
    "You can use a trained model to get real time predictions using HTTPS endpoint. Follow these steps to walk you through the process."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the model\n",
    "The first step is to create a model object from the SageMaker Python SDK `Model` class that can be deployed to an\n",
    "HTTPS endpoint.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "from datetime import datetime\n",
    "from sagemaker.model import Model\n",
    "from sagemaker.predictor import Predictor\n",
    "\n",
    "model_name = \"blazing-text-byo-model-clarify-demo-{}\".format(\n",
    "    datetime.now().strftime(\"%d-%m-%Y-%H-%M-%S\")\n",
    ")\n",
    "print(model_name)\n",
    "sagemaker_session = sagemaker.Session()\n",
    "sagemaker_model = Model(\n",
    "    model_data=bt_model.model_data,\n",
    "    role=role,\n",
    "    image_uri=byoc_image_uri,\n",
    "    sagemaker_session=sagemaker_session,\n",
    "    predictor_cls=Predictor,\n",
    "    name=model_name,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Deploying the model is optional and you can follow the below steps if you want to create a persistent endpoint. Else, skip to **Configure SageMaker Clarify**."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### [Optional] Deploy the model\n",
    "\n",
    "Deploying the model to SageMaker hosting just requires a `deploy` call on the SageMaker model object. This call takes an instance count, instance type, and optionally serializer and deserializer functions. These are used when the resulting predictor is created on the endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.serializers import CSVSerializer\n",
    "from sagemaker.deserializers import JSONLinesDeserializer\n",
    "\n",
    "predictor = sagemaker_model.deploy(\n",
    "    1, \"ml.m4.xlarge\", serializer=CSVSerializer(), deserializer=JSONLinesDeserializer()\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to do some predictions, we'll extract some of the data we used for training and do predictions against it. This is, of course, bad statistical practice, but a good way to see how the mechanism works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "shape = pd.read_csv(\"container/local_test/payload.csv\", header=None)\n",
    "shape.sample(2)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Prediction is as easy as calling predict with the predictor we got back from deploy and the data we want to do predictions with. The serializers take care of doing the data conversions for us."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configure SageMaker Clarify"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SageMaker Clarify needs to be configured to generate the explainability reports. This can easily be done using the Amazon SageMaker Python SDK.\n",
    "\n",
    "We first define the `SageMakerClarifyProcessor` object by providing the `instance_type`, `instance_count`, `role` and the session. This information is used by SageMaker Clarify to run a processing job with the appropriate configurations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker import clarify\n",
    "\n",
    "clarify_processor = clarify.SageMakerClarifyProcessor(\n",
    "    role=role,\n",
    "    instance_count=1,\n",
    "    instance_type=\"ml.m5.xlarge\",\n",
    "    sagemaker_session=sagemaker_session,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`ModelConfig` is used to specify how to use the trained model. There are 2 options -\n",
    "1. When a `model_name` is specified along with an `instance_type` and `instance_count`, SageMaker Clarify spins up an ephemeral endpoint for the duration that the SageMaker Clarify processing job is active.\n",
    "2. If there is an existing endpoint, you can specific the `endpoint_name` and SageMaker Clarify uses that endpoint. This reduces the job runtime since it avoids the creation of an endpoint. For example - if you created an endpoint above and want Clarify to use that, then you should use this option.\n",
    "\n",
    "In this notebook, we use the first option."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_config = clarify.ModelConfig(\n",
    "    model_name=model_name,\n",
    "    instance_type=\"ml.m5.xlarge\",\n",
    "    instance_count=1,\n",
    "    accept_type=\"application/jsonlines\",\n",
    "    content_type=\"text/csv\",\n",
    "    endpoint_name_prefix=None,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A `DataConfig` object specifies information about I/O to SageMaker Clarify. We specify where to find the input, where to store the output, headers and the dataset type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "file_path = \"container/local_test/payload.csv\"\n",
    "\n",
    "df_test_clarify = pd.read_csv(file_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "explainability_output_path = \"s3://{}/{}/clarify-text-explainability\".format(\n",
    "    sagemaker_session.default_bucket(), \"explainability\"\n",
    ")\n",
    "explainability_data_config = clarify.DataConfig(\n",
    "    s3_data_input_path=file_path,\n",
    "    s3_output_path=explainability_output_path,\n",
    "    headers=[\"Review Text\"],\n",
    "    dataset_type=\"text/csv\",\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clarify uses `ShapConfig` to obtain feature attributions. `text_config` is used to specify the language and granularity. The number of samples that are to be used in the Kernel SHAP algorithm is determined by the value of the `num_samples` argument. `baseline` is used by Clarify to replace subsets of the input text to obtain predictions for computing SHAP values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "shap_config = clarify.SHAPConfig(\n",
    "    baseline=[[\"<UNK>\"]],\n",
    "    num_samples=1000,\n",
    "    agg_method=\"mean_abs\",\n",
    "    save_local_shap_values=True,\n",
    "    text_config=clarify.TextConfig(granularity=\"token\", language=\"english\"),\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we configure `ModelPredictedLabelConfig` that specifies how to extract a predicted label from the model output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.clarify import ModelPredictedLabelConfig\n",
    "\n",
    "modellabel_config = ModelPredictedLabelConfig(probability=\"prob\", label=\"label\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Running the Clarify explainability job involves spinning up a processing job and a model endpoint which may take a few minutes.\n",
    "# After this you will see a progress bar for the SHAP computation.\n",
    "# The size of the dataset (num_examples) and the num_samples for shap will effect the running time.\n",
    "clarify_processor.run_explainability(\n",
    "    data_config=explainability_data_config,\n",
    "    model_config=model_config,\n",
    "    explainability_config=shap_config,\n",
    "    model_scores=modellabel_config,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "local_feature_attributions_file = \"out.jsonl\"\n",
    "analysis_results = []\n",
    "analysis_result = sagemaker.s3.S3Downloader.download(\n",
    "    explainability_output_path + \"/explanations_shap/\" + local_feature_attributions_file,\n",
    "    local_path=\"./\",\n",
    ")\n",
    "\n",
    "shap_out = []\n",
    "file = sagemaker.s3.S3Downloader.read_file(\n",
    "    explainability_output_path + \"/explanations_shap/\" + local_feature_attributions_file\n",
    ")\n",
    "for line in file.split(\"\\n\"):\n",
    "    if line:\n",
    "        shap_out.append(json.loads(line))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(json.dumps(shap_out[0], indent=2))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At the highest level of this JSON Line, there are two keys: explanations, join_source_value (Not present here as we have not included a joinsource column in the input dataset). Explanations contains a list of attributions for each feature in the dataset. In this case, we have a single element, because the input dataset also had a single feature. It also contains details like feature_name, data_type of the features (indicating whether Clarify inferred the column as numerical, categorical or text). Each token attribution also contains a description field that contains the token itself, and the starting index of the token in original input. This allows you to reconstruct the original sentence from the output as well."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, you learned about how to use SageMaker Clarify to run text explainability job for SageMaker BlazingText model and get local and global explainations. This will help understand which tokens, sentences, or phrases are contributing how much to the prediction outcomes."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Optional] Cleanup"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally if you created an endpoint and do not want to retain it, please remember to delete the Amazon SageMaker endpoint to avoid charges:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor.delete_endpoint()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-clarify|text_explainability_sagemaker_algorithm|blazingtext_clarify_explainability.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}