{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3c130782-9478-4cb4-b6ab-852cfbc07d4c",
   "metadata": {},
   "source": [
    "# Lab 1: Korean NLI (Natural Language Inference) Training on AWS\n",
    "\n",
    "## Introduction\n",
    "---\n",
    "\n",
    "본 모듈에서는 허깅페이스 트랜스포머(Hugging Face transformers) 라이브러리를 사용하여 한국어 자연어 추론 (Korean NLI; Natural Language Inference) 쌍을 훈련합니다. 자연어 추론은 전제(premise)와 가설(hypothesis)이 포함된 두 문장 사이에서 전제가 참이라고 가정할 때, 가설의 연결이 참인지(entailment), 모순이 있는지(contradiction), 알 수 없는지(neutral)로 구별하는 다운스트림 태스크입니다.\n",
    "\n",
    "***[Note] SageMaker Studio Lab, SageMaker Studio, SageMaker 노트북 인스턴스, 또는 여러분의 로컬 머신에서 이 데모를 실행할 수 있습니다. SageMaker Studio Lab을 사용하는 경우 GPU를 활성화하세요.***\n",
    "\n",
    "### References\n",
    "- Hugging Face Tutorial: https://huggingface.co/docs/transformers/tasks/question_answering\n",
    "- Fine-tuning with custom datasets: https://huggingface.co/transformers/v4.11.3/custom_datasets.html#question-answering-with-squad-2-0\n",
    "- KorNLI datasets: https://github.com/kakaobrain/KorNLUDatasets/tree/master/KorNLI\n",
    "- KLUE: https://github.com/KLUE-benchmark/KLUE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44cf4937-b073-4611-9982-f4be3b58efa2",
   "metadata": {},
   "source": [
    "\n",
    "## 1. Setup Environments\n",
    "---\n",
    "\n",
    "### Import modules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "fbb97a1c-4057-4e14-a1af-48a773879fd7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import json\n",
    "import logging\n",
    "import argparse\n",
    "import torch\n",
    "from torch import nn\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from tqdm import tqdm\n",
    "from IPython.display import display, HTML\n",
    "\n",
    "from transformers import (\n",
    "    AutoTokenizer, AutoModelForSequenceClassification,\n",
    "    Trainer, TrainingArguments, set_seed\n",
    ")\n",
    "from transformers.trainer_utils import get_last_checkpoint\n",
    "from datasets import load_dataset, load_metric, ClassLabel, Sequence\n",
    "\n",
    "logging.basicConfig(\n",
    "    level=logging.INFO, \n",
    "    format='[{%(filename)s:%(lineno)d} %(levelname)s - %(message)s',\n",
    "    handlers=[\n",
    "        logging.StreamHandler(sys.stdout)\n",
    "    ]\n",
    ")\n",
    "logger = logging.getLogger(__name__)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ba7a122-0df6-4022-a921-b07605d77973",
   "metadata": {},
   "source": [
    "### Argument parser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "36090222-56c2-4a51-b915-4801d12c80a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "def parser_args(train_notebook=False):\n",
    "    parser = argparse.ArgumentParser()\n",
    "\n",
    "    # Default Setting\n",
    "    parser.add_argument(\"--epochs\", type=int, default=3)\n",
    "    parser.add_argument(\"--seed\", type=int, default=42)\n",
    "    parser.add_argument(\"--train_batch_size\", type=int, default=32)\n",
    "    parser.add_argument(\"--eval_batch_size\", type=int, default=64)\n",
    "    parser.add_argument(\"--max_length\", type=int, default=384)\n",
    "    parser.add_argument(\"--stride\", type=int, default=64)\n",
    "    parser.add_argument(\"--warmup_steps\", type=int, default=100)\n",
    "    parser.add_argument(\"--logging_steps\", type=int, default=100)\n",
    "    parser.add_argument(\"--learning_rate\", type=str, default=3e-5)\n",
    "    parser.add_argument(\"--disable_tqdm\", type=bool, default=False)\n",
    "    parser.add_argument(\"--fp16\", type=bool, default=True)\n",
    "    parser.add_argument(\"--tokenizer_id\", type=str, default='klue/roberta-base')\n",
    "    parser.add_argument(\"--model_id\", type=str, default='klue/roberta-base')\n",
    "    \n",
    "    # SageMaker Container environment\n",
    "    parser.add_argument(\"--output_data_dir\", type=str, default=os.environ[\"SM_OUTPUT_DATA_DIR\"])\n",
    "    parser.add_argument(\"--model_dir\", type=str, default=os.environ[\"SM_MODEL_DIR\"])\n",
    "    parser.add_argument(\"--n_gpus\", type=str, default=os.environ[\"SM_NUM_GPUS\"])\n",
    "    parser.add_argument(\"--train_dir\", type=str, default=os.environ[\"SM_CHANNEL_TRAIN\"])\n",
    "    parser.add_argument(\"--valid_dir\", type=str, default=os.environ[\"SM_CHANNEL_VALID\"])\n",
    "    parser.add_argument('--chkpt_dir', type=str, default='/opt/ml/checkpoints')     \n",
    "\n",
    "    if train_notebook:\n",
    "        args = parser.parse_args([])\n",
    "    else:\n",
    "        args = parser.parse_args()\n",
    "    return args"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "acd57291-8742-46e9-9ff5-c7984eb4ca56",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_dir = 'nli_train'\n",
    "valid_dir = 'nli_valid'\n",
    "!rm -rf {train_dir} {valid_dir}\n",
    "os.makedirs(train_dir, exist_ok=True)\n",
    "os.makedirs(valid_dir, exist_ok=True) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2b42b7b-6486-44ec-b35f-94c718cf589e",
   "metadata": {},
   "source": [
    "### Load Arguments\n",
    "\n",
    "주피터 노트북에서 곧바로 실행할 수 있도록 설정값들을 로드합니다. 물론 노트북 환경이 아닌 커맨드라인에서도 `cd scripts & python3 train.py` 커맨드로 훈련 스크립트를 실행할 수 있습니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5659bafa-5ff5-44b0-8901-f31ceea25bfd",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{204499775.py:21} INFO - ***** Arguments *****\n",
      "[{204499775.py:22} INFO - epochs=3\n",
      "seed=42\n",
      "train_batch_size=32\n",
      "eval_batch_size=64\n",
      "max_length=384\n",
      "stride=64\n",
      "warmup_steps=100\n",
      "logging_steps=100\n",
      "learning_rate=3e-05\n",
      "disable_tqdm=False\n",
      "fp16=True\n",
      "tokenizer_id=klue/roberta-base\n",
      "model_id=klue/roberta-base\n",
      "output_data_dir=/home/ec2-user/SageMaker/sm-kornlp-usecases/nli/data\n",
      "model_dir=/home/ec2-user/SageMaker/sm-kornlp-usecases/nli/model\n",
      "n_gpus=4\n",
      "train_dir=/home/ec2-user/SageMaker/sm-kornlp-usecases/nli/nli_train\n",
      "valid_dir=/home/ec2-user/SageMaker/sm-kornlp-usecases/nli/nli_valid\n",
      "chkpt_dir=chkpt\n",
      "\n"
     ]
    }
   ],
   "source": [
    "chkpt_dir = 'chkpt'\n",
    "model_dir = 'model'\n",
    "output_data_dir = 'data'\n",
    "num_gpus = torch.cuda.device_count()\n",
    "\n",
    "!rm -rf {chkpt_dir} {model_dir} {output_data_dir} \n",
    "\n",
    "if os.environ.get('SM_CURRENT_HOST') is None:\n",
    "    is_sm_container = False\n",
    "\n",
    "    #src_dir = '/'.join(os.getcwd().split('/')[:-1])\n",
    "    src_dir = os.getcwd()\n",
    "    os.environ['SM_MODEL_DIR'] = f'{src_dir}/{model_dir}'\n",
    "    os.environ['SM_OUTPUT_DATA_DIR'] = f'{src_dir}/{output_data_dir}'\n",
    "    os.environ['SM_NUM_GPUS'] = str(num_gpus)\n",
    "    os.environ['SM_CHANNEL_TRAIN'] = f'{src_dir}/{train_dir}'\n",
    "    os.environ['SM_CHANNEL_VALID'] = f'{src_dir}/{valid_dir}'\n",
    "\n",
    "args = parser_args(train_notebook=True) \n",
    "args.chkpt_dir = chkpt_dir\n",
    "logger.info(\"***** Arguments *****\")\n",
    "logger.info(''.join(f'{k}={v}\\n' for k, v in vars(args).items()))\n",
    "\n",
    "os.makedirs(args.chkpt_dir, exist_ok=True) \n",
    "os.makedirs(args.model_dir, exist_ok=True)\n",
    "os.makedirs(args.output_data_dir, exist_ok=True) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1c4109c9-ae21-4f42-9c75-d8e13d8b6e51",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "## 2. Preparation & Custructing Feature set\n",
    "---\n",
    "\n",
    "### Dataset\n",
    "\n",
    "본 핸즈온에서 사용할 데이터셋은 KLUE-NLI로 허깅페이스의 dataset 라이브러리로 곧바로 로드할 수 있습니다.\n",
    "- KLUE: https://github.com/KLUE-benchmark/KLUE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "4d2065a2-58b3-448a-9af3-e93dc92d146b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import ClassLabel, Sequence\n",
    "import random\n",
    "import pandas as pd\n",
    "from IPython.display import display, HTML\n",
    "\n",
    "def show_random_elements(dataset, num_examples=10):\n",
    "    assert num_examples <= len(\n",
    "        dataset\n",
    "    ), \"Can't pick more elements than there are in the dataset.\"\n",
    "    picks = []\n",
    "    for _ in range(num_examples):\n",
    "        pick = random.randint(0, len(dataset) - 1)\n",
    "        while pick in picks:\n",
    "            pick = random.randint(0, len(dataset) - 1)\n",
    "        picks.append(pick)\n",
    "\n",
    "    df = pd.DataFrame(dataset[picks])\n",
    "    for column, typ in dataset.features.items():\n",
    "        if isinstance(typ, ClassLabel):\n",
    "            df[column] = df[column].transform(lambda i: typ.names[i])\n",
    "        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):\n",
    "            df[column] = df[column].transform(\n",
    "                lambda x: [typ.feature.names[i] for i in x]\n",
    "            )\n",
    "    display(HTML(df.to_html()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a2050410-9cf1-4e1d-8736-2af109599a92",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{builder.py:641} WARNING - Reusing dataset klue (/home/ec2-user/.cache/huggingface/datasets/klue/nli/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e)\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c1d075e224724a62a7c0417d80b3343d",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/2 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>guid</th>\n",
       "      <th>source</th>\n",
       "      <th>premise</th>\n",
       "      <th>hypothesis</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>klue-nli-v1_train_03020</td>\n",
       "      <td>wikipedia</td>\n",
       "      <td>그는 1903년에 자신과 아내 파울리네 사이에 있었던 한 부부 싸움을 떠올렸다.</td>\n",
       "      <td>그는 1903년 다음 해에 파울리네와 부부가 되었다.</td>\n",
       "      <td>contradiction</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>klue-nli-v1_train_20075</td>\n",
       "      <td>airbnb</td>\n",
       "      <td>지금껏 다녔던 숙소보다 너무 좋았어요.</td>\n",
       "      <td>지금껏 다녔던 숙소 중에 제일 별로였어요.</td>\n",
       "      <td>contradiction</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>klue-nli-v1_train_19388</td>\n",
       "      <td>wikinews</td>\n",
       "      <td>조선민주주의인민공화국이 2013년 2월 12일, 제3차 핵실험을 성공하였다고 공식 발표하였다.</td>\n",
       "      <td>조선민주주의인민공화국은 핵실험에 실패한 적이 있다.</td>\n",
       "      <td>neutral</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>klue-nli-v1_train_03544</td>\n",
       "      <td>wikinews</td>\n",
       "      <td>그런데, 아직도 문제를 단순히 공식만으로 풀게 하고, 지루하게 계산만을 반복시키는 그런 수학, 이거 안 통합니다.</td>\n",
       "      <td>이제 더 이상 수학에 공식외우기는 없어야 합니다.</td>\n",
       "      <td>neutral</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>klue-nli-v1_train_03100</td>\n",
       "      <td>wikitree</td>\n",
       "      <td>그는 주머니에서 오만 원권 두 장을 꺼내 팬에게 건넸고, 공연장에 있던 사람들은 그의 행동에 환호했다.</td>\n",
       "      <td>그는 주머니에서 만원 권 두 장만을 꺼내 팬에게 건넸다.</td>\n",
       "      <td>contradiction</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>klue-nli-v1_train_03446</td>\n",
       "      <td>policy</td>\n",
       "      <td>그러나, 앞날을 결코 낙관할 수 없습니다.</td>\n",
       "      <td>낙관할 수 없는 앞날입니다.</td>\n",
       "      <td>entailment</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>klue-nli-v1_train_01274</td>\n",
       "      <td>NSMC</td>\n",
       "      <td>감독이라는것도 정말 일정 자격 시험이 필요한게 아닐까 하는 생각이 들게 한다</td>\n",
       "      <td>일정 자격 시험을 통과해서 감독 자격을 받도록 해야 하는게 아닐까 하는 생각이 들게 한다.</td>\n",
       "      <td>entailment</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>klue-nli-v1_train_22075</td>\n",
       "      <td>NSMC</td>\n",
       "      <td>타란티노가 연출했다면 더 잘 살렸을듯 연출이 각본을 따라가질못한다</td>\n",
       "      <td>타란티노가 연출했다면 더 좋았을 듯.</td>\n",
       "      <td>entailment</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>klue-nli-v1_train_09824</td>\n",
       "      <td>airbnb</td>\n",
       "      <td>빵이나 커피 잼 과일 맘껏 먹을 수 있고요</td>\n",
       "      <td>잼은 두 종류 뿐이고요.</td>\n",
       "      <td>neutral</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>klue-nli-v1_train_23753</td>\n",
       "      <td>wikitree</td>\n",
       "      <td>한편, 구는 수립된 안전관리계획을 책자로 제작해 관련 유관기관 등에 배부할 예정이다.</td>\n",
       "      <td>안전관리계획을 책자로 제작하는 곳은 유관기관이다.</td>\n",
       "      <td>contradiction</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "datasets = load_dataset('klue', 'nli')\n",
    "show_random_elements(datasets[\"train\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45a8820c-aed0-44c8-a740-ae191f2d33fc",
   "metadata": {},
   "source": [
    "### Tokenization\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bba537de-0c38-42e7-8706-d37c39064055",
   "metadata": {},
   "source": [
    "데이터셋을 토큰화합니다. 한 가지 눈여겨 보실 점이 `tokenizer()` 함수의 인자값에 `return_token_type_ids=False`를 지정한 점입니다.\n",
    "본 핸즈온에서는 RoBERTa 모델로 파인 튜닝을 수행하는데, RoBERTa는 BERT와 달리 다음 문장 여부를 나타내는 Next Sentence Prediction (NSP) 를 수행하지 않으므로, `token_type_ids`를 사용하지 않습니다.\n",
    "\n",
    "토큰화에 대한 자세한 내용은 https://huggingface.co/docs/datasets/process#processing-data-with-map 를 참조하세요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "8672c8bc-069f-43c2-b2c5-0379404738b1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{arrow_dataset.py:2597} WARNING - Loading cached processed dataset at /home/ec2-user/.cache/huggingface/datasets/klue/nli/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e/cache-990f8277f7ffe9bd.arrow\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2941e58f744f4ad7b8e839fa7870e4fa",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/3 [00:00<?, ?ba/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_id, use_fast=True)\n",
    "\n",
    "# tokenizer helper function\n",
    "def tokenize(examples, sentence1_key='premise', sentence2_key='hypothesis'):\n",
    "    return tokenizer(\n",
    "        examples[sentence1_key],\n",
    "        examples[sentence2_key],\n",
    "        max_length=384,\n",
    "        truncation=True,\n",
    "        return_token_type_ids=False,\n",
    "    )\n",
    "\n",
    "encoded_datasets = datasets.map(tokenize, batched=True)\n",
    "\n",
    "# set format for pytorch\n",
    "encoded_datasets.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30df8be3-54bc-450b-a922-3776fad4fa32",
   "metadata": {},
   "source": [
    "### Save Training/Evaluation data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "7d90e20a-5e7e-452e-bbd0-31ce5c92fdd7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
      "To disable this warning, you can either:\n",
      "\t- Avoid using `tokenizers` before the fork if possible\n",
      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
     ]
    }
   ],
   "source": [
    "train_dir = 'datasets/train'\n",
    "test_dir = 'datasets/test'\n",
    "!rm -rf {train_dir} {test_dir}\n",
    "\n",
    "os.makedirs(train_dir, exist_ok=True)\n",
    "os.makedirs(test_dir, exist_ok=True) \n",
    "\n",
    "train_dataset = encoded_datasets['train']\n",
    "valid_dataset = encoded_datasets['validation']\n",
    "\n",
    "if not os.listdir(args.train_dir):\n",
    "    train_dataset.save_to_disk(args.train_dir)\n",
    "if not os.listdir(args.valid_dir):\n",
    "    valid_dataset.save_to_disk(args.valid_dir)\n",
    "\n",
    "# from datasets import load_from_disk\n",
    "# train_dataset = load_from_disk(args.train_dir)\n",
    "# valid_dataset = load_from_disk(args.valid_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0703fcce-daff-469b-8b8f-d02800f469a2",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "## 3. Training\n",
    "---\n",
    "\n",
    "### Define Custom metric\n",
    "특정 시점마다(예: epoch, steps) 검증 데이터셋으로 정밀도(precision), 재현율(recall), F1 스코어, 정확도(accuracy)를 등의 지표를 계산하기 위한 커스텀 함수를 정의합니다.\n",
    "\n",
    "커스텀 함수의 첫번째 인자는 `EvalPrediction` 객체로, 예측값(`predictios`)과 정답값(`label_ids`)를 포함합니다. 자세한 내용은 아래 웹사이트를 참조하세요. \n",
    "https://huggingface.co/transformers/internal/trainer_utils.html#transformers.EvalPrediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "302eb226-faa5-42d2-a54b-108daaeb2724",
   "metadata": {},
   "outputs": [],
   "source": [
    "metric = load_metric(\"glue\", \"qnli\")\n",
    "\n",
    "def compute_metrics(eval_pred):\n",
    "    predictions, labels = eval_pred\n",
    "    predictions = np.argmax(predictions, axis=1)\n",
    "    return metric.compute(predictions=predictions, references=labels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "7993c770-c314-416c-80d4-ebaa146e1e7b",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of the model checkpoint at klue/roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.bias']\n",
      "- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
      "- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
      "Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at klue/roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias', 'classifier.out_proj.weight']\n",
      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
     ]
    }
   ],
   "source": [
    "model = AutoModelForSequenceClassification.from_pretrained(args.model_id, num_labels=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "580b677c-6555-48cf-8f32-29b95605feb0",
   "metadata": {},
   "source": [
    "### Training Preparation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "43b490ff-8d18-4eb5-af31-ec7ebed1f88c",
   "metadata": {},
   "outputs": [],
   "source": [
    "training_args = TrainingArguments(\n",
    "    output_dir=args.chkpt_dir,\n",
    "    overwrite_output_dir=True if get_last_checkpoint(args.chkpt_dir) is not None else False,\n",
    "    num_train_epochs=args.epochs,\n",
    "    per_device_train_batch_size=args.train_batch_size,\n",
    "    per_device_eval_batch_size=args.eval_batch_size,\n",
    "    warmup_steps=args.warmup_steps,\n",
    "    weight_decay=0.01,\n",
    "    learning_rate=float(args.learning_rate),\n",
    "    evaluation_strategy=\"epoch\",`\n",
    "    save_strategy=\"epoch\",\n",
    "    metric_for_best_model=\"accuracy\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "e4f9023b-50cb-4dcd-8678-4c801dd910b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "trainer = Trainer(\n",
    "    model,\n",
    "    training_args,\n",
    "    train_dataset=train_dataset,\n",
    "    eval_dataset=valid_dataset,\n",
    "    tokenizer=tokenizer,\n",
    "    compute_metrics=compute_metrics,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4666c64-e05a-4dc1-b894-00d024a49590",
   "metadata": {},
   "source": [
    "### Training\n",
    "훈련을 수행합니다. 딥러닝 기반 자연어 처리 모델 훈련에는 GPU가 필수이며, 본격적인 훈련을 위해서는 멀티 GPU 및 분산 훈련을 권장합니다. 만약 멀티 GPU가 장착되어 있다면 Trainer에서 총 배치 크기 = 배치 크기 x GPU 개수로 지정한 다음 데이터 병렬화를 자동으로 수행합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "dcd5093e-dea5-4749-8848-b4ea03648fa7",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: source, guid, premise, hypothesis. If source, guid, premise, hypothesis are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.\n",
      "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning\n",
      "  warnings.warn(\n",
      "***** Running training *****\n",
      "  Num examples = 24998\n",
      "  Num Epochs = 3\n",
      "  Instantaneous batch size per device = 32\n",
      "  Total train batch size (w. parallel, distributed & accumulation) = 128\n",
      "  Gradient Accumulation steps = 1\n",
      "  Total optimization steps = 588\n",
      "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "      \n",
       "      <progress value='588' max='588' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      [588/588 03:59, Epoch 3/3]\n",
       "    </div>\n",
       "    <table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       " <tr style=\"text-align: left;\">\n",
       "      <th>Epoch</th>\n",
       "      <th>Training Loss</th>\n",
       "      <th>Validation Loss</th>\n",
       "      <th>Accuracy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>No log</td>\n",
       "      <td>0.504815</td>\n",
       "      <td>0.818333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>No log</td>\n",
       "      <td>0.436733</td>\n",
       "      <td>0.847000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>0.440700</td>\n",
       "      <td>0.450579</td>\n",
       "      <td>0.852667</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table><p>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: source, guid, premise, hypothesis. If source, guid, premise, hypothesis are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.\n",
      "***** Running Evaluation *****\n",
      "  Num examples = 3000\n",
      "  Batch size = 256\n",
      "Saving model checkpoint to chkpt/checkpoint-196\n",
      "Configuration saved in chkpt/checkpoint-196/config.json\n",
      "Model weights saved in chkpt/checkpoint-196/pytorch_model.bin\n",
      "tokenizer config file saved in chkpt/checkpoint-196/tokenizer_config.json\n",
      "Special tokens file saved in chkpt/checkpoint-196/special_tokens_map.json\n",
      "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n",
      "The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: source, guid, premise, hypothesis. If source, guid, premise, hypothesis are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.\n",
      "***** Running Evaluation *****\n",
      "  Num examples = 3000\n",
      "  Batch size = 256\n",
      "Saving model checkpoint to chkpt/checkpoint-392\n",
      "Configuration saved in chkpt/checkpoint-392/config.json\n",
      "Model weights saved in chkpt/checkpoint-392/pytorch_model.bin\n",
      "tokenizer config file saved in chkpt/checkpoint-392/tokenizer_config.json\n",
      "Special tokens file saved in chkpt/checkpoint-392/special_tokens_map.json\n",
      "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n",
      "The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: source, guid, premise, hypothesis. If source, guid, premise, hypothesis are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.\n",
      "***** Running Evaluation *****\n",
      "  Num examples = 3000\n",
      "  Batch size = 256\n",
      "Saving model checkpoint to chkpt/checkpoint-588\n",
      "Configuration saved in chkpt/checkpoint-588/config.json\n",
      "Model weights saved in chkpt/checkpoint-588/pytorch_model.bin\n",
      "tokenizer config file saved in chkpt/checkpoint-588/tokenizer_config.json\n",
      "Special tokens file saved in chkpt/checkpoint-588/special_tokens_map.json\n",
      "\n",
      "\n",
      "Training completed. Do not forget to share your model on huggingface.co/models =)\n",
      "\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 4min 13s, sys: 45.1 s, total: 4min 58s\n",
      "Wall time: 4min 12s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "# train model\n",
    "if get_last_checkpoint(args.chkpt_dir) is not None:\n",
    "    logger.info(\"***** Continue Training *****\")\n",
    "    last_checkpoint = get_last_checkpoint(args.chkpt_dir)\n",
    "    trainer.train(resume_from_checkpoint=last_checkpoint)\n",
    "else:\n",
    "    trainer.train()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7e03dac-1731-4c8e-b490-034bb8ca2837",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "## 4. Evaluation\n",
    "---\n",
    "\n",
    "평가를 수행합니다."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "34f4daa1-89bb-4839-ba64-14d777db0901",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The following columns in the test set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: source, guid, premise, hypothesis. If source, guid, premise, hypothesis are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.\n",
      "***** Running Prediction *****\n",
      "  Num examples = 3000\n",
      "  Batch size = 256\n",
      "/home/ec2-user/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.\n",
      "  warnings.warn('Was asked to gather along dimension 0, but all '\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "      \n",
       "      <progress value='24' max='12' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      [12/12 00:09]\n",
       "    </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "***** Evaluation results at /home/ec2-user/SageMaker/sm-kornlp-usecases/nli/data *****\n",
      "[{4005816495.py:9} INFO - test_accuracy = 0.8526666666666667\n",
      "\n",
      "[{4005816495.py:9} INFO - test_loss = 0.4505786597728729\n",
      "\n",
      "[{4005816495.py:9} INFO - test_runtime = 2.825\n",
      "\n",
      "[{4005816495.py:9} INFO - test_samples_per_second = 1061.937\n",
      "\n",
      "[{4005816495.py:9} INFO - test_steps_per_second = 4.248\n",
      "\n"
     ]
    }
   ],
   "source": [
    "outputs = trainer.predict(valid_dataset)\n",
    "eval_results = outputs.metrics\n",
    "\n",
    "# writes eval result to file which can be accessed later in s3 ouput\n",
    "with open(os.path.join(args.output_data_dir, \"eval_results.txt\"), \"w\") as writer:\n",
    "    print(f\"***** Evaluation results at {args.output_data_dir} *****\")\n",
    "    for key, value in sorted(eval_results.items()):\n",
    "        writer.write(f\"{key} = {value}\\n\")\n",
    "        logger.info(f\"{key} = {value}\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "26460bef-1dfc-4b25-8003-766ed7d6eda8",
   "metadata": {},
   "outputs": [],
   "source": [
    "pred_logits = outputs.predictions\n",
    "true = outputs.label_ids.ravel()\n",
    "pred = pred_logits.argmax(-1).ravel()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "9385c5d4-93e1-4407-a1e3-d933509447f6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "               precision    recall  f1-score   support\n",
      "\n",
      "   entailment       0.86      0.88      0.87      1000\n",
      "      neutral       0.84      0.85      0.85      1000\n",
      "contradiction       0.85      0.83      0.84      1000\n",
      "\n",
      "     accuracy                           0.85      3000\n",
      "    macro avg       0.85      0.85      0.85      3000\n",
      " weighted avg       0.85      0.85      0.85      3000\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import precision_score, recall_score, f1_score, classification_report\n",
    "unique_labels = train_dataset.features['label'].names\n",
    "print(classification_report(true, pred, target_names=unique_labels))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f3d7544-507a-4ca4-802b-f8491022d202",
   "metadata": {},
   "source": [
    "<br>\n",
    "\n",
    "## 5. Prediction\n",
    "---\n",
    "\n",
    "여러분만의 샘플 문장을 만들어서 자유롭게 추론을 수행해 보세요."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "e9101731-9409-43a7-bb89-4691ae0dfc10",
   "metadata": {},
   "outputs": [],
   "source": [
    "idx = [i for i in range(len(train_dataset.features['label'].names))]\n",
    "classes = train_dataset.features['label'].names\n",
    "\n",
    "model.config.label2id = dict(zip(classes, idx))\n",
    "model.config.id2label = dict(zip(idx, classes))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "id": "7ceb8e53-9931-426c-9dad-44e237c74d5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import pipeline\n",
    "classifier = pipeline(\n",
    "    task=\"text-classification\",\n",
    "    model=model, \n",
    "    tokenizer=tokenizer,\n",
    "    top_k=1,\n",
    "    device=0\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "9b197f11-85c0-449e-b9ac-5d5e53430f12",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'label': 'contradiction', 'score': 0.9946433305740356}]"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example = f\"머신러닝은 쉽다. {tokenizer.sep_token} 머신러닝은 어렵다.\"\n",
    "classifier(sentences)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p38",
   "language": "python",
   "name": "conda_pytorch_p38"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}