{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e676831e",
   "metadata": {},
   "source": [
    "# SageMaker Batch Transform Inference job with Hugging Face Transformers\n",
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c8fbfa1",
   "metadata": {},
   "source": [
    "\n",
    "## Introduction\n",
    "---\n",
    "\n",
    "SageMaker 리얼타임 엔드포인트(SageMaker real-time endpoint)는 실시간으로 추론 결괏값을 빠른 응답속도 내에 전송받을 수 있지만, **호스팅 서버가 최소 1대 이상 구동**되어야 하므로 비용적인 측면에서 부담이 됩니다. 이런 경우 아래의 유즈케이스들에 해당하면 SageMaker 배치 변환(batch transform) 기능을 사용해 훈련 인스턴스처럼 배치 변환을 수행하는 때에만 컴퓨팅 인스턴스를 사용하여 비용을 절감할 수 있습니다.\n",
    "\n",
    "- 일/주/월 단위 정기적인 마케팅 캠페인이나 실시간 추천이 필요 없는 경우 전체 데이터셋에 대한 추론 결괏값 계산\n",
    "- 일부 추론 결괏값을 데이터베이스나 스토리지에 저장\n",
    "- SageMaker 호스팅 엔드포인트가 제공하는 1초 미만의 대기 시간이 필요하지 않은 경우\n",
    "\n",
    "자세한 내용은 아래 웹페이지를 참조해 주세요.\n",
    "- Amazon SageMaker Batch Transform: (https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html) \n",
    "- API docs: https://sagemaker.readthedocs.io/en/stable/overview.html#sagemaker-batch-transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22d33988",
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "import json\n",
    "import sys\n",
    "import numpy as np\n",
    "import logging\n",
    "import sagemaker\n",
    "from sagemaker.s3 import S3Uploader, s3_path_join\n",
    "from sagemaker.huggingface import HuggingFaceModel\n",
    "\n",
    "logging.basicConfig(\n",
    "    level=logging.INFO, \n",
    "    format='[{%(filename)s:%(lineno)d} %(levelname)s - %(message)s',\n",
    "    handlers=[\n",
    "        logging.FileHandler(filename='tmp.log'),\n",
    "        logging.StreamHandler(sys.stdout)\n",
    "    ]\n",
    ")\n",
    "logger = logging.getLogger(__name__)\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "role = sagemaker.get_execution_role()\n",
    "\n",
    "# sagemaker session bucket -> used for uploading data, models and logs\n",
    "# sagemaker will automatically create this bucket if it not exists\n",
    "sagemaker_session_bucket=None\n",
    "if sagemaker_session_bucket is None and sess is not None:\n",
    "    # set to default bucket if a bucket name is not given\n",
    "    sagemaker_session_bucket = sess.default_bucket()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f0b456f",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "## 1. Run Batch Transform after training a model \n",
    "---\n",
    "\n",
    "추론을 수행할 데이터셋에 대한 S3 uri를 지정하고 transformer job을 실행하면, SageMaker는 배치 변환을 위한 컴퓨팅 인스턴스를 프로비저닝 후 S3에 저장된 데이터셋을 다운로드하고 추론을 수행한 후, 결과를 S3에 업로드합니다.\n",
    "\n",
    "\n",
    "```python\n",
    "batch_job = huggingface_estimator.transformer(\n",
    "    instance_count=1,\n",
    "    instance_type='ml.c5.2xlarge',\n",
    "    strategy='SingleRecord')\n",
    "\n",
    "batch_job.transform(\n",
    "    data='s3://s3-uri-to-batch-data',\n",
    "    content_type='application/json',    \n",
    "    split_type='Line')\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9914b6bf",
   "metadata": {},
   "source": [
    "### Data Pre-Processing\n",
    "\n",
    "배치 변환을 위해 https://github.com/e9t/nsmc/ 에 공개된 네이버 영화 리뷰 테스트 데이터셋을 다운로드합니다. 테스트 데이터셋은 총 5만건입니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa4262f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8f42599",
   "metadata": {},
   "outputs": [],
   "source": [
    "num_lines = sum(1 for line in open('ratings_test.txt')) - 1\n",
    "y_true = np.zeros((num_lines))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d076a6b",
   "metadata": {},
   "source": [
    "Hugging Face 추론 컨테이너는 `{'input' : '입력 데이터'}` 포맷으로 된 요청을 인식하기에 테스트 데이터셋을 아래와 같은 포맷으로 변환해야 합니다.\n",
    "\n",
    "```json\n",
    "{'inputs': '뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아'}\n",
    "{'inputs': '지루하지는 않은데 완전 막장임... 돈주고 보기에는....'}\n",
    "{'inputs': '3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??'}\n",
    "{'inputs': '음악이 주가 된, 최고의 음악영화'}\n",
    "...\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2140b277",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_csv_file = 'ratings_test.txt'\n",
    "dataset_jsonl_file=\"./ratings_test.jsonl\"\n",
    "with open(dataset_csv_file, \"r+\") as infile, open(dataset_jsonl_file, \"w+\") as outfile:\n",
    "    reader = csv.DictReader(infile, delimiter=\"\\t\")\n",
    "    for idx, row in enumerate(reader):\n",
    "        row_dict = {'inputs': row['document']}\n",
    "\n",
    "        y_true[idx] = row['label']\n",
    "        json.dump(row_dict, outfile)\n",
    "        outfile.write('\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "872b504e",
   "metadata": {},
   "source": [
    "### Upload dataset to S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db56b395",
   "metadata": {},
   "outputs": [],
   "source": [
    "# uploads a given file to S3.\n",
    "input_s3_path = s3_path_join(\"s3://\", sagemaker_session_bucket, \"batch_transform/input\")\n",
    "output_s3_path = s3_path_join(\"s3://\", sagemaker_session_bucket, \"batch_transform/output\")\n",
    "s3_file_uri = S3Uploader.upload(dataset_jsonl_file, input_s3_path)\n",
    "\n",
    "print(f\"{dataset_jsonl_file} uploaded to {s3_file_uri}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55aa859a",
   "metadata": {},
   "source": [
    "### Create Inference Transformer to run the batch job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4f93b70",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Hub Model configuration. <https://huggingface.co/models>\n",
    "hub = {\n",
    "    'HF_MODEL_ID':'daekeun-ml/koelectra-small-v3-nsmc', # model_id from hf.co/models\n",
    "    'HF_TASK':'text-classification' # NLP task you want to use for predictions\n",
    "}\n",
    "\n",
    "# create Hugging Face Model Class\n",
    "huggingface_model = HuggingFaceModel(\n",
    "    env=hub,                        # configuration for loading model from Hub\n",
    "    role=role,                      # iam role with permissions to create an Endpoint\n",
    "    transformers_version=\"4.12.3\",  # transformers version used\n",
    "    pytorch_version=\"1.9.1\",        # pytorch version used\n",
    "    py_version='py38',              # python version used\n",
    ")\n",
    "\n",
    "# create Transformer to run our batch job\n",
    "batch_job = huggingface_model.transformer(\n",
    "    instance_count=1,              # number of instances used for running the batch job\n",
    "    instance_type='ml.g4dn.xlarge',# instance type for the batch job\n",
    "    output_path=output_s3_path,    # we are using the same s3 path to save the output with the input\n",
    "    strategy='SingleRecord')       # How we are sending the \"requests\" to the endpoint\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85f6ac25",
   "metadata": {},
   "outputs": [],
   "source": [
    "# starts batch transform job and uses s3 data as input\n",
    "batch_job.transform(\n",
    "    data=s3_file_uri,               # preprocessed file location on s3 \n",
    "    content_type='application/json',# mime-type of the file    \n",
    "    split_type='Line', # how the datapoints are split, here lines since it is `.jsonl`\n",
    "    wait=False\n",
    ")             "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2591cee6",
   "metadata": {},
   "source": [
    "### Wait for the batch transform jobs to complete\n",
    "\n",
    "배치 추론 작업이 완료될 때까지 기다립니다. 약 15분의 시간이 소요됩니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "657de359",
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_job.wait(logs=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9eeb04a4",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "## 2. Access Prediction file\n",
    "---\n",
    "\n",
    "\n",
    "배치 변환 작업이 성공적으로 완료되면 `.out` 확장자의 출력 파일이 S3에 저장됩니다. `Transformer`에서 `join_source` 매개변수를 사용해서 입력 파일과 출력 파일을 병합하는 것도 가능하며, 자세한 내용은 (https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html 를 참조해 주세요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bac37ed7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import json\n",
    "from sagemaker.s3 import S3Downloader\n",
    "from ast import literal_eval\n",
    "# creating s3 uri for result file -> input file + .out\n",
    "batch_transform_dir = './batch'\n",
    "!rm -rf {batch_transform_dir}\n",
    "os.makedirs(batch_transform_dir, exist_ok=True)\n",
    "\n",
    "output_file = f\"{dataset_jsonl_file}.out\"\n",
    "local_output_path = f\"{batch_transform_dir}/{output_file}\"\n",
    "output_s3_filepath = s3_path_join(output_s3_path, output_file)\n",
    "\n",
    "logger.info(output_s3_filepath)\n",
    "\n",
    "# download file\n",
    "S3Downloader.download(output_s3_filepath, batch_transform_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "824a40d8",
   "metadata": {},
   "source": [
    "### Processing data\n",
    "\n",
    "예측 레이블 및 확률값을 받아옵니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8e861d9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read in the file\n",
    "with open(local_output_path, 'r') as file :\n",
    "    filedata = file.read()\n",
    "\n",
    "# Replace the target string\n",
    "filedata = filedata.replace('][', '\\n').replace('[', '').replace(']', '')\n",
    "\n",
    "# Write the file out again\n",
    "with open(f'{batch_transform_dir}/file.txt', 'w') as file:\n",
    "    file.write(filedata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf75ae7c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "y_pred = np.zeros((num_lines), dtype='int')\n",
    "y_score = np.zeros((num_lines), dtype='float32')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb4c6fb9",
   "metadata": {},
   "outputs": [],
   "source": [
    "batch_transform_result = []\n",
    "with open(f'{batch_transform_dir}/file.txt') as f:\n",
    "    for idx, line in enumerate(f):\n",
    "        result = literal_eval(line)\n",
    "        y_pred[idx] = result['label']\n",
    "        y_score[idx] = result['score']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d28af58f",
   "metadata": {},
   "source": [
    "### Confusion Matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f8ebe27",
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_confusion_matrix(cm, target_names=None, cmap=None, normalize=True, labels=True, title='Confusion matrix'):\n",
    "    import itertools\n",
    "    import matplotlib.pyplot as plt\n",
    "    accuracy = np.trace(cm) / float(np.sum(cm))\n",
    "    misclass = 1 - accuracy\n",
    "\n",
    "    if cmap is None:\n",
    "        cmap = plt.get_cmap('Blues')\n",
    "\n",
    "    if normalize:\n",
    "        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
    "        \n",
    "    plt.figure(figsize=(8, 6))\n",
    "    plt.imshow(cm, interpolation='nearest', cmap=cmap)\n",
    "    plt.title(title)\n",
    "    plt.colorbar()\n",
    "\n",
    "    thresh = cm.max() / 1.5 if normalize else cm.max() / 2\n",
    "    \n",
    "    if target_names is not None:\n",
    "        tick_marks = np.arange(len(target_names))\n",
    "        plt.xticks(tick_marks, target_names)\n",
    "        plt.yticks(tick_marks, target_names)\n",
    "    \n",
    "    if labels:\n",
    "        for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n",
    "            if normalize:\n",
    "                plt.text(j, i, \"{:0.4f}\".format(cm[i, j]),\n",
    "                         horizontalalignment=\"center\",\n",
    "                         color=\"white\" if cm[i, j] > thresh else \"black\")\n",
    "            else:\n",
    "                plt.text(j, i, \"{:,}\".format(cm[i, j]),\n",
    "                         horizontalalignment=\"center\",\n",
    "                         color=\"white\" if cm[i, j] > thresh else \"black\")\n",
    "\n",
    "    plt.tight_layout()\n",
    "    plt.ylabel('True label')\n",
    "    plt.xlabel('Predicted label\\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "722f51da",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "cf = confusion_matrix(y_true, y_pred)\n",
    "plot_confusion_matrix(cf, normalize=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a017c7e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix, classification_report\n",
    "print(classification_report(y_true, y_pred, target_names=['0','1']))"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "c281c456f1b8161c8906f4af2c08ed2c40c50136979eaae69688b01f70e9f4a9"
  },
  "kernelspec": {
   "display_name": "conda_pytorch_latest_p37",
   "language": "python",
   "name": "conda_pytorch_latest_p37"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}