# Lab 1-1: Korean NER (Named Entity Recognition) Training on AWS


## Introduction
---

본 모듈에서는 허깅페이스 트랜스포머(Hugging Face transformers) 라이브러리를 사용하여 한국어 개체명 인식기(Korean NER; Named Entity Recognition)를 훈련합니다. NER은 문장에서 개체명(엔터티)을 찾는 작업으로 각 토큰 마다 개체명에 해당하는 정답 레이블을 찾는 다운스트림 태스크입니다.

***[Note] SageMaker Studio Lab, SageMaker Studio, SageMaker 노트북 인스턴스, 또는 여러분의 로컬 머신에서 이 데모를 실행할 수 있습니다. SageMaker Studio Lab을 사용하는 경우 GPU를 활성화하세요.***

***[주의] 본 데이터셋은 상업적인 목적으로 사용할 수 없습니다. 본 핸즈온은 연구/참고용으로만 활용하시고, 모델 훈련은 여러분만의 데이터셋을 직접 생성하셔야 합니다.***

### References
- Hugging Face Tutorial: https://huggingface.co/docs/transformers/training
- 네이버, 창원대가 함께하는 NLP Challenge GitHub: https://github.com/naver/nlp-challenge
- 네이버, 창원대가 함께하는 NLP Challenge 리더보드 및 라이센스: http://air.changwon.ac.kr/?page_id=10


## 1. Setup Environments
---

### Import modules

In [1]:
!rm -rf chkpt data model ner_train ner_valid train_data

In [2]:
import os
import sys
import json
import logging
import argparse
import torch
from torch import nn
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from transformers import (
    BertTokenizer, BertTokenizerFast, BertConfig, BertForTokenClassification, 
    Trainer, TrainingArguments, set_seed
)
from transformers.trainer_utils import get_last_checkpoint

logging.basicConfig(
    level=logging.INFO, 
    format='[{%(filename)s:%(lineno)d} %(levelname)s - %(message)s',
    handlers=[
        logging.StreamHandler(sys.stdout)
    ]
)
logger = logging.getLogger(__name__)

### Argument parser

In [3]:
def parser_args(train_notebook=False):
    parser = argparse.ArgumentParser()

    # Default Setting
    parser.add_argument("--epochs", type=int, default=3)
    parser.add_argument("--seed", type=int, default=42)
    parser.add_argument("--train_batch_size", type=int, default=32)
    parser.add_argument("--eval_batch_size", type=int, default=64)
    parser.add_argument("--warmup_steps", type=int, default=100)
    parser.add_argument("--learning_rate", type=str, default=5e-5)
    parser.add_argument("--disable_tqdm", type=bool, default=False)
    #parser.add_argument("--fp16", type=bool, default=True)
    parser.add_argument("--tokenizer_id", type=str, default='bert-base-multilingual-cased')
    #parser.add_argument("--model_id", type=str, default='distilbert-base-multilingual-cased')    
    parser.add_argument("--model_id", type=str, default='bert-base-multilingual-cased')
    
    # SageMaker Container environment
    parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
    parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
    parser.add_argument("--train_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
    parser.add_argument("--valid_dir", type=str, default=os.environ["SM_CHANNEL_VALID"])
    parser.add_argument('--chkpt_dir', type=str, default='/opt/ml/checkpoints')     

    if train_notebook:
        args = parser.parse_args([])
    else:
        args = parser.parse_args()
    return args

In [4]:
train_dir = 'ner_train'
valid_dir = 'ner_valid'
!rm -rf {train_dir} {valid_dir}
os.makedirs(train_dir, exist_ok=True)
os.makedirs(valid_dir, exist_ok=True) 

### Load Arguments

주피터 노트북에서 곧바로 실행할 수 있도록 설정값들을 로드합니다. 물론 노트북 환경이 아닌 커맨드라인에서도 `cd scripts & python3 train.py` 커맨드로 훈련 스크립트를 실행할 수 있습니다.

In [5]:
chkpt_dir = 'chkpt'
model_dir = 'model'
output_data_dir = 'data'
num_gpus = torch.cuda.device_count()

!rm -rf {chkpt_dir} {model_dir} {output_data_dir} 

if os.environ.get('SM_CURRENT_HOST') is None:
    is_sm_container = False

    #src_dir = '/'.join(os.getcwd().split('/')[:-1])
    src_dir = os.getcwd()
    os.environ['SM_MODEL_DIR'] = f'{src_dir}/{model_dir}'
    os.environ['SM_OUTPUT_DATA_DIR'] = f'{src_dir}/{output_data_dir}'
    os.environ['SM_NUM_GPUS'] = str(num_gpus)
    os.environ['SM_CHANNEL_TRAIN'] = f'{src_dir}/{train_dir}'
    os.environ['SM_CHANNEL_VALID'] = f'{src_dir}/{valid_dir}'

args = parser_args(train_notebook=True) 
args.chkpt_dir = chkpt_dir
logger.info("***** Arguments *****")
logger.info(''.join(f'{k}={v}\n' for k, v in vars(args).items()))

os.makedirs(args.chkpt_dir, exist_ok=True) 
os.makedirs(args.model_dir, exist_ok=True)
os.makedirs(args.output_data_dir, exist_ok=True) 

[{204499775.py:21} INFO - ***** Arguments *****
[{204499775.py:22} INFO - epochs=3
seed=42
train_batch_size=32
eval_batch_size=64
warmup_steps=100
learning_rate=5e-05
disable_tqdm=False
tokenizer_id=bert-base-multilingual-cased
model_id=bert-base-multilingual-cased
output_data_dir=/home/ec2-user/SageMaker/sm-kornlp-usecases/named-entity-recognition/data
model_dir=/home/ec2-user/SageMaker/sm-kornlp-usecases/named-entity-recognition/model
n_gpus=4
train_dir=/home/ec2-user/SageMaker/sm-kornlp-usecases/named-entity-recognition/ner_train
valid_dir=/home/ec2-user/SageMaker/sm-kornlp-usecases/named-entity-recognition/ner_valid
chkpt_dir=chkpt



<br>

## 2. Preparation
---

### Dataset

본 핸즈온에서 사용할 데이터셋은 블라블라입니다. 


In [6]:
!wget https://github.com/naver/nlp-challenge/raw/master/missions/ner/data/train/train_data

--2022-07-05 22:45:03--  https://github.com/naver/nlp-challenge/raw/master/missions/ner/data/train/train_data
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/naver/nlp-challenge/master/missions/ner/data/train/train_data [following]
--2022-07-05 22:45:03--  https://raw.githubusercontent.com/naver/nlp-challenge/master/missions/ner/data/train/train_data
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16945023 (16M) [text/plain]
Saving to: ‘train_data’


2022-07-05 22:45:03 (223 MB/s) - ‘train_data’ saved [16945023/16945023]



`_B` / `_I`로 끝나는 개체명은 해당 단어가 개체명의 시작과 내부에 있음을 의미하며, 빈 칸이나 단일 문자 개체명(예: `O`, `-`)은 해당 단어가 어떤 개체명에도 해당하지 않음을 의미합니다.

인덱스의 경우 1,2,3,...으로 순차적으로 증가하다가 다시 1,2,...로 변경되는데 인덱스 1은 새로운 문장의 첫 단어를 의미합니다.

In [7]:
raw_data = pd.read_csv("train_data", names=['src', 'tar'], sep="\t")
raw_data = raw_data.reset_index()
raw_data.head(15)

Unnamed: 0,index,src,tar
0,1,비토리오,PER_B
1,2,양일,DAT_B
2,3,만에,-
3,4,영사관,ORG_B
4,5,감호,CVL_B
5,6,"용퇴,",-
6,7,항룡,-
7,8,압력설,-
8,9,의심만,-
9,10,가율,-


### Data Cleansing

한글, 숫자, 영어, . 외의 단어들을 모두 제거합니다.

In [8]:
raw_data['src'] = raw_data['src'].str.replace("．", ".", regex=False)
raw_data['src'] = raw_data['src'].astype(str)
raw_data['tar'] = raw_data['tar'].astype(str)
raw_data['src'] = raw_data['src'].str.replace(r'[^ㄱ-ㅣ가-힣0-9a-zA-Z.]+', "", regex=True)

개체명-숫자 레이블 간 상호 변환을 위한 딕셔너리를 생성합니다. 

In [9]:
unique_tags = raw_data['tar'].unique().tolist()
tag2id = {tag: id for id, tag in enumerate(unique_tags)}
id2tag = {id: tag for tag, id in tag2id.items()}
logger.info(tag2id)
logger.info(id2tag)

[{99085830.py:4} INFO - {'PER_B': 0, 'DAT_B': 1, '-': 2, 'ORG_B': 3, 'CVL_B': 4, 'NUM_B': 5, 'LOC_B': 6, 'EVT_B': 7, 'TRM_B': 8, 'TRM_I': 9, 'EVT_I': 10, 'PER_I': 11, 'CVL_I': 12, 'NUM_I': 13, 'TIM_B': 14, 'TIM_I': 15, 'ORG_I': 16, 'DAT_I': 17, 'ANM_B': 18, 'MAT_B': 19, 'MAT_I': 20, 'AFW_B': 21, 'FLD_B': 22, 'LOC_I': 23, 'AFW_I': 24, 'PLT_B': 25, 'FLD_I': 26, 'ANM_I': 27, 'PLT_I': 28}
[{99085830.py:5} INFO - {0: 'PER_B', 1: 'DAT_B', 2: '-', 3: 'ORG_B', 4: 'CVL_B', 5: 'NUM_B', 6: 'LOC_B', 7: 'EVT_B', 8: 'TRM_B', 9: 'TRM_I', 10: 'EVT_I', 11: 'PER_I', 12: 'CVL_I', 13: 'NUM_I', 14: 'TIM_B', 15: 'TIM_I', 16: 'ORG_I', 17: 'DAT_I', 18: 'ANM_B', 19: 'MAT_B', 20: 'MAT_I', 21: 'AFW_B', 22: 'FLD_B', 23: 'LOC_I', 24: 'AFW_I', 25: 'PLT_B', 26: 'FLD_I', 27: 'ANM_I', 28: 'PLT_I'}


모델 추론 시에 개체명을 알기 쉽게 표현하기 위해 개체명-개체 설명 변환 딕셔너리를 생성합니다. 

In [10]:
def get_tag2entity(tag2id):
    tag2entity = {}
    
    for idx, (tag, id) in enumerate(tag2id.items()):
        if "PER" in tag:
            entity = 'Person'
        elif "FLD" in tag:
            entity = "Field"
        elif "AFW" in tag:
            entity = "Artifacts_works"
        elif "ORG" in tag:
            entity = "Organization"
        elif "LOC" in tag:
            entity = "Location"
        elif "CVL" in tag:
            entity = "Civilization"
        elif "DAT" in tag:
            entity = "Date"
        elif "TIM" in tag:
            entity = "Time"
        elif "NUM" in tag:
            entity = "Number"
        elif "EVT" in tag:
            entity = "Event"
        elif "ANM" in tag:
            entity = "Animal"
        elif "PLT" in tag:
            entity = "Plant"
        elif "MAT" in tag:
            entity = "Material"
        elif "TRM" in tag:
            entity = "Term"
        else:
            entity = tag

        tag2entity[tag] = entity
    return tag2entity

tag2entity = get_tag2entity(tag2id)

데이터를 문장들과 개체들로 분리합니다. 

In [11]:
tups = []
temp_tup = []
data = [list(x) for x in raw_data[['index', 'src', 'tar']].to_numpy()]

for idx, (i, token, entity) in enumerate(data):
    if i == 1:  # first token in a sentence
        if idx != 0: 
            tups.append(temp_tup)
        temp_tup = []
    temp_tup.append((token, tag2id[entity]))

In [12]:
sentences = []
labels = []

for tup in tups:
    sentence = []
    label = []
    
    sentence.append("[CLS]")
    label.append(tag2id['-'])
    
    for t, l in tup:
        sentence.append(t)
        label.append(l)
        
    sentence.append("[SEP]")
    label.append(tag2id['-'])
    
    sentences.append(sentence)
    labels.append(label)

In [13]:
sentences[0], labels[0]

(['[CLS]',
  '비토리오',
  '양일',
  '만에',
  '영사관',
  '감호',
  '용퇴',
  '항룡',
  '압력설',
  '의심만',
  '가율',
  '[SEP]'],
 [2, 0, 1, 2, 3, 4, 2, 2, 2, 2, 2, 2])

<br>

## 3. Construct Feature set
---

### Tokenization and Labeling

#### Tokenization
연어 처리 모델을 훈련하려면, 토큰화(Tokenization)를 통해 말뭉치(corpus; 자연어 처리를 위한 대량의 텍스트 데이터)를 토큰 시퀀스로 나누는 과정이 필요합니다. BERT 이전의 자연어 처리 모델은 주로 도메인 전문가들이 직접 토큰화해놓은 토크아니저(Mecab, Kkma 등)들을 사용했지만, BERT를 훈련하기 위한 토크나이저는 도메인 지식 필요 없이 말뭉치에서 자주 등장하는 서브워드(subword)를 토큰화합니다. GPT 기반 모델은 BPE(Byte-pair Encoding)라는 통계적 기법을 사용하며, BERT 및 ELECTRA 기반 모델은 BPE와 유사한 Wordpiece를 토크나이저로 사용합니다.

#### Labeling
NER task의 경우, 정답 태그가 토큰 단위가 아니라 단어 단위로 부여되어 있기에, 이를 토큰 단위로 확장해야 합니다.

```python

tokenizer.tokenize('2006년 아마존')
>> ['2006년', '아', '##마', '##존']

원본 레이블링:
(2006년, 'DAT'), ('아마존', 'ORG')

토큰 단위 레이블링 변환:
(2006년, 'DAT'), ('아', 'ORG'), ('##마', 'ORG'), ('##존', 'ORG')    
```

In [14]:
def tokenize_and_preserve_labels(sentence, text_labels, tokenizer, max_len=128):

    """
    Word piece tokenization makes it difficult to match word labels
    back up with individual word pieces. This function tokenizes each
    word one at a time so that it is easier to preserve the correct
    label for each subword. It is, of course, a bit slower in processing
    time, but it will help our model achieve higher accuracy.
    """

    tokenized_sentence = []
    labels = []

    for word, label in zip(sentence, text_labels):
        
        if len(word) > max_len:
            word = word[:max_len]
            
        # Tokenize the word and count # of subwords the word is broken into
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)

        # Add the tokenized word to the final tokenized word list
        tokenized_sentence.extend(tokenized_word)

        # Add the same label to the new list of labels `n_subwords` times
        labels.extend([label] * n_subwords)

    return tokenized_sentence, labels

In [15]:
%%time
from itertools import repeat
from multiprocessing import cpu_count, Pool

num_cores = 16 if cpu_count() > 16 else cpu_count()    
tokenizer = BertTokenizerFast.from_pretrained(args.tokenizer_id)

with Pool(processes=num_cores) as pool:
    tokenized_texts_and_labels = pool.starmap(tokenize_and_preserve_labels, 
                                              zip(sentences, labels, repeat(tokenizer)))

CPU times: user 3.05 s, sys: 369 ms, total: 3.41 s
Wall time: 3.97 s


### Converting Input Ids and Labels

토큰과 엔티티를 토큰에 대응하는 인덱스와 엔티티에 대응하는 레이블로 변환합니다.

In [16]:
ids, labels = [], []

max_len = 128
for t, l in tokenized_texts_and_labels:
    if len(l) > max_len:
        t = t[:max_len]
        l = l[:max_len]
        
    ids.append(torch.tensor(tokenizer.convert_tokens_to_ids(t)))
    labels.append(torch.tensor(l))

패딩(padding)을 수행합니다. `'[PAD]'` 토큰에 대응하는 레이블은 보통 -100입니다.

In [17]:
from torch.nn.utils.rnn import pad_sequence
ids = pad_sequence(ids, batch_first=True, padding_value=0)
labels = pad_sequence(labels, batch_first=True, padding_value=-100)
attention_masks = (labels != -100).long()

### Split into Training set and validation set

In [18]:
random_state = 42
test_size = 0.2
train_ids, valid_ids, train_labels, valid_labels = train_test_split(ids, labels,
                                             random_state=random_state, test_size=test_size)
train_attention_masks, valid_attention_masks, _, _ = train_test_split(attention_masks, labels,
                                             random_state=random_state, test_size=test_size)

### Save Training/Evaluation data and metadata

In [19]:
train_dict = {'input_ids': train_ids, 'attention_mask': train_attention_masks, 'labels': train_labels}
valid_dict = {'input_ids': valid_ids, 'attention_mask': valid_attention_masks, 'labels': valid_labels}

torch.save(train_dict, os.path.join(train_dir, 'train_features.pt'))

with open(os.path.join(train_dir, 'tag2id.json'), 'w') as f:
    json.dump(tag2id, f)    

with open(os.path.join(train_dir, 'id2tag.json'), 'w') as f:
    json.dump(id2tag, f)

with open(os.path.join(train_dir, 'tag2entity.json'), 'w') as f:
    json.dump(tag2entity, f)             

torch.save(valid_dict, os.path.join(valid_dir, 'valid_features.pt'))

### Custom Dataset

훈련/검증 시에 사용할 커스텀 데이터셋을 생성하기 위한 클래스를 생성합니다. BERT 기반 모델은 보통 아래의 입력값을 사용하며, 단일 문장만 사용한다면 `token_type_ids`를 생략해도 무방합니다.
- `input_ids`: 문장이 인덱스(특정 vocab에 매핑하는 숫자값)로 구성된 토큰 시퀀스로 변환된 결괏값
- `attention_mask` : 해당 토큰이 패딩 토큰인지, 아닌지를 마스킹
- `token_type_ids`: 세그먼트 (두 문장 입력 시, 첫번째 문장인지 아닌지를 마스킹)
- `labels`: 해당 토큰에 매핑되는 정답 개체 레이블 

In [20]:
class NERDataset(torch.utils.data.Dataset):
    def __init__(self, input_ids, attention_masks, labels=None, max_len=128):
        self.input_ids = input_ids
        self.attention_masks = attention_masks
        self.labels = labels
        self.max_len = max_len
        
    def __getitem__(self, idx):
        item = {}
        item['input_ids'] = self.input_ids[idx]
        item['attention_mask'] = self.attention_masks[idx]
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

In [21]:
train_dataset = NERDataset(train_ids, train_attention_masks, train_labels)
valid_dataset = NERDataset(valid_ids, valid_attention_masks, valid_labels)
logger.info(f'num_train samples={len(train_dataset)}, num_valid samples={len(valid_dataset)}')

[{1494477010.py:3} INFO - num_train samples=52416, num_valid samples=13104


<br>

## 4. Training
---

### Define Custom metric
특정 시점마다(예: epoch, steps) 검증 데이터셋으로 정밀도(precision), 재현율(recall), F1 스코어, 정확도(accuracy)를 계산하기 위한 커스텀 함수를 정의합니다.

In [22]:
def compute_metrics(p):
    logits = p.predictions
    labels = p.label_ids.ravel()
    preds = logits.argmax(-1).ravel()
    
    preds = preds[labels != -100]
    labels = labels[labels != -100]

    from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
    prec, rec, f1, _ = precision_recall_fscore_support(labels, preds, average='micro')
    acc = accuracy_score(labels, preds)

    metrics = {
        'precision': prec,
        'recall': rec,
        'f1': f1,        
        'accuracy': acc
    }
    
    return metrics

### Training Preparation

본 핸즈온은 허깅페이스의 트랜스포머 라이브러리에 포함된 BertForTokenClassification 모델을 사용합니다. 이 모델은 문장 레벨이 아닌 토큰 레벨로 예측을 수행합니다.

In [23]:
model = BertForTokenClassification.from_pretrained(args.model_id, num_labels=len(tag2id))
model.config.id2label = id2tag
model.config.label2id = tag2id

training_args = TrainingArguments(
    output_dir=args.chkpt_dir,          # output directory
    overwrite_output_dir=True if get_last_checkpoint(args.chkpt_dir) is not None else False,
    num_train_epochs=args.epochs,              # total number of training epochs
    per_device_train_batch_size=args.train_batch_size,  # batch size per device during training
    per_device_eval_batch_size=args.eval_batch_size,   # batch size for evaluation
    warmup_steps=args.warmup_steps,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir=f"{args.output_data_dir}/logs",            # directory for storing logs
    eval_steps=100,
    learning_rate=float(args.learning_rate),
    #load_best_model_at_end=True,
    save_strategy="epoch",
    evaluation_strategy="steps",
    metric_for_best_model="f1",
)

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForTokenClassification: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at 

훈련을 수행하기 위한 `Trainer` 클래스를 인스턴스화합니다.

In [24]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset, 
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

### Training
훈련을 수행합니다. 딥러닝 기반 자연어 처리 모델 훈련에는 GPU가 필수이며, 본격적인 훈련을 위해서는 멀티 GPU 및 분산 훈련을 권장합니다. 만약 멀티 GPU가 장착되어 있다면 Trainer에서 총 배치 크기 = 배치 크기 x GPU 개수로 지정한 다음 데이터 병렬화를 자동으로 수행합니다.

In [25]:
%%time
# train model
if get_last_checkpoint(args.chkpt_dir) is not None:
    logger.info("***** Continue Training *****")
    last_checkpoint = get_last_checkpoint(args.chkpt_dir)
    trainer.train(resume_from_checkpoint=last_checkpoint)
else:
    trainer.train()

***** Running training *****
  Num examples = 52416
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 1230


Step,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
100,No log,0.6643,0.812209,0.812209,0.812209,0.812209
200,No log,0.522309,0.85177,0.85177,0.85177,0.85177
300,No log,0.46102,0.866471,0.866471,0.866471,0.866471
400,No log,0.416925,0.876684,0.876684,0.876684,0.876684
500,0.673100,0.407849,0.880267,0.880267,0.880267,0.880267
600,0.673100,0.393547,0.882919,0.882919,0.882919,0.882919
700,0.673100,0.380853,0.887344,0.887344,0.887344,0.887344
800,0.673100,0.362687,0.8914,0.8914,0.8914,0.8914
900,0.673100,0.373592,0.890279,0.890279,0.890279,0.890279
1000,0.343400,0.367624,0.891256,0.891256,0.891256,0.891256


***** Running Evaluation *****
  Num examples = 13104
  Batch size = 256
***** Running Evaluation *****
  Num examples = 13104
  Batch size = 256
***** Running Evaluation *****
  Num examples = 13104
  Batch size = 256
***** Running Evaluation *****
  Num examples = 13104
  Batch size = 256
Saving model checkpoint to chkpt/checkpoint-410
Configuration saved in chkpt/checkpoint-410/config.json
Model weights saved in chkpt/checkpoint-410/pytorch_model.bin
tokenizer config file saved in chkpt/checkpoint-410/tokenizer_config.json
Special tokens file saved in chkpt/checkpoint-410/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 13104
  Batch size = 256
***** Running Evaluation *****
  Num examples = 13104
  Batch size = 256
***** Running Evaluation *****
  Num examples = 13104
  Batch size = 256
***** Running Evaluation *****
  Num examples = 13104
  Batch size = 256
Saving model checkpoint to chkpt/checkpoint-820
Configuration saved in chkpt/checkpoint-820/config.jso

CPU times: user 19min 35s, sys: 2min 57s, total: 22min 33s
Wall time: 14min 35s


<br>

## 5. Evaluation
---

평가를 수행합니다.

In [26]:
outputs = trainer.predict(valid_dataset)
eval_results = outputs.metrics

# writes eval result to file which can be accessed later in s3 ouput
with open(os.path.join(args.output_data_dir, "eval_results.txt"), "w") as writer:
    print(f"***** Evaluation results at {args.output_data_dir} *****")
    for key, value in sorted(eval_results.items()):
        writer.write(f"{key} = {value}\n")
        logger.info(f"{key} = {value}\n")

***** Running Prediction *****
  Num examples = 13104
  Batch size = 256


***** Evaluation results at /home/ec2-user/SageMaker/sm-kornlp-usecases/named-entity-recognition/data *****
[{2675283658.py:10} INFO - test_accuracy = 0.8945943620394088

[{2675283658.py:10} INFO - test_f1 = 0.8945943620394088

[{2675283658.py:10} INFO - test_loss = 0.35873326659202576

[{2675283658.py:10} INFO - test_precision = 0.8945943620394088

[{2675283658.py:10} INFO - test_recall = 0.8945943620394088

[{2675283658.py:10} INFO - test_runtime = 15.2768

[{2675283658.py:10} INFO - test_samples_per_second = 857.77

[{2675283658.py:10} INFO - test_steps_per_second = 3.404



In [27]:
compute_metrics(outputs)

{'precision': 0.8945943620394088,
 'recall': 0.8945943620394088,
 'f1': 0.8945943620394088,
 'accuracy': 0.8945943620394088}

In [28]:
pred_logits = outputs.predictions
true = outputs.label_ids.ravel()
pred = pred_logits.argmax(-1).ravel()
pred = pred[true != -100]
true = true[true != -100]

true_tag = [id2tag[x] for x in true]
pred_tag = [id2tag[x] for x in pred]

In [29]:
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
print(classification_report(true_tag, pred_tag, labels=unique_tags))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       PER_B       0.85      0.85      0.85     22546
       DAT_B       0.90      0.88      0.89      8588
           -       0.94      0.96      0.95    280106
       ORG_B       0.86      0.85      0.85     28424
       CVL_B       0.78      0.76      0.77     31916
       NUM_B       0.94      0.94      0.94     22513
       LOC_B       0.79      0.75      0.77     12026
       EVT_B       0.78      0.79      0.78      7349
       TRM_B       0.78      0.70      0.73     12164
       TRM_I       0.56      0.46      0.50      1721
       EVT_I       0.73      0.79      0.76      3441
       PER_I       0.75      0.72      0.74      2972
       CVL_I       0.47      0.39      0.42      1724
       NUM_I       0.71      0.80      0.75      2857
       TIM_B       0.79      0.87      0.83      1133
       TIM_I       0.89      0.94      0.92       442
       ORG_I       0.65      0.63      0.64      3122
       DAT_I       0.82    

  _warn_prf(average, modifier, msg_start, len(result))


<br>

## 6. Prediction
---

In [30]:
def predict_fn_token(example, model):
 
    inputs_with_offsets = tokenizer(example, return_offsets_mapping=True, return_tensors='pt')
    inputs = inputs_with_offsets.copy()
    inputs.pop('offset_mapping')
    tokens = inputs_with_offsets.tokens()
    offsets = inputs_with_offsets["offset_mapping"]
        
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    inputs = inputs.to(device)
    model = model.to(device)
    
    output = model(**inputs)
    softmax_fn = nn.Softmax(dim=-1)
    softmax_output = softmax_fn(output['logits'])
    prob, pred = torch.max(softmax_output, dim=-1)
    pred_str_lst = [model.config.id2label[id] for id in pred.squeeze().tolist()]
    prob_lst, pred_lst = prob.squeeze().tolist(), pred.squeeze().tolist()
    offsets_lst = offsets.squeeze().tolist()
    
    df = pd.DataFrame(zip(tokens, pred_str_lst, prob_lst, pred_lst, offsets_lst), 
                  columns=['token', 'tag', 'score', 'label', 'offset'])
    
    return df


def predict_fn_word(example, model):

    from transformers import pipeline
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    device_id = -1 if device.type == "cpu" else 0
    
    nlp = pipeline("ner", model=model.to(device), device=device_id, 
                   tokenizer=tokenizer, aggregation_strategy='average')
    results = nlp(example)
    
    entity_lst, score_lst, word_lst, start_lst, end_lst = [], [], [], [], []
    tag2entity[''] = '-'

    for result in results:
        entity = tag2entity[result['entity_group']]
        score = result['score']
        word = result['word']
        start = result['start']
        end = result['end']

        entity_lst.append(entity)
        score_lst.append(score)
        word_lst.append(word)
        start_lst.append(start)
        end_lst.append(end)

    df = pd.DataFrame(zip(word_lst, entity_lst, score_lst, start_lst, end_lst), 
                      columns=['word', 'entity', 'score', 'start', 'end'])
    return df

In [31]:
example = '잉글랜드 프로축구 프리미어리그 토트넘 홋스퍼가 손흥민의 A매치 100경기 이상 출전 센추리클럽 가입에 축하를 보냈다.'

### Prediction by token

토큰 단위로 예측을 수행하므로 모델의 결괏값은 단어 단위가 아닙니다. 단어 단위로 예측하는 방법은 아래 섹션을 확인하세요.

In [32]:
token_df = predict_fn_token(example, model)
token_df.head(20)

Unnamed: 0,token,tag,score,label,offset
0,[CLS],-,0.999501,2,"[0, 0]"
1,잉,LOC_B,0.951433,6,"[0, 1]"
2,##글,LOC_B,0.956234,6,"[1, 2]"
3,##랜드,LOC_B,0.962042,6,"[2, 4]"
4,프로,CVL_B,0.731096,4,"[5, 7]"
5,##축,CVL_B,0.785559,4,"[7, 8]"
6,##구,CVL_B,0.755286,4,"[8, 9]"
7,프,ORG_B,0.497772,3,"[10, 11]"
8,##리,ORG_B,0.487815,3,"[11, 12]"
9,##미,ORG_B,0.498659,3,"[12, 13]"


### Prediction by word

pipeline 메소드와 score를 어떻게 집계할 것인지를 지정하는 aggregation_strategy를 를 사용하여 단어 단위로 예측을 수행합니다.
FastTokenizer를 사용하면 밑바닥부터 구현할 필요 없이, 편리하게 단어 단위로 예측 결과를 얻을 수 있습니다.

In [33]:
word_df = predict_fn_word(example, model)
word_df

Unnamed: 0,word,entity,score,start,end
0,잉글랜드,Location,0.95657,0,4
1,프로축구,Civilization,0.757313,5,9
2,프리미어리그 토트넘,Organization,0.73356,10,20
3,홋스퍼가,Organization,0.975155,21,25
4,손흥민의,Person,0.994924,26,30
5,A매치,Event,0.936316,31,34
6,100경기,Number,0.996811,35,40
7,이상 출전,-,0.944591,41,46
8,센추리클럽,Organization,0.979096,47,52
9,가입에 축하를 보냈다.,-,0.997992,53,65


### Example
여러분만의 샘플 문장을 만들어서 자유롭게 추론을 수행해 보세요.

In [34]:
example = '아마존 SageMaker는 머신 러닝 통합 엔드투엔드 관리형 서비스로 2017년 런칭되었다.'
word_df = predict_fn_word(example, model)
word_df

Unnamed: 0,word,entity,score,start,end
0,아마존,Organization,0.693333,0,3
1,SageMaker는 머신,Term,0.709337,4,17
2,러닝,Term,0.552327,18,20
3,통합,-,0.768233,21,23
4,엔드투엔드,Term,0.454221,24,29
5,관리형 서비스로,-,0.648227,30,38
6,2017년,Date,0.990274,39,44
7,런칭되었다.,-,0.999204,45,51
