# Lab 1: Sentence-BERT (SBERT) Training

### Fine-tuning Sentence-BERT (SBERT) & SBERT Embedding for applications
---


## Introduction
---

본 모듈ì—서는 문장 ìž„ë² ë”©ì„ ì‚°ì¶œí•˜ëŠ” Sentence-BERT 모ë¸ì„ STS ë°ì´í„°ì…‹ìœ¼ë¡œ 파ì¸íŠœë‹í•´ 봅니다.
SentenceTransformers 패키지를 사용하면 파ì¸íŠœë‹ì„ 쉽게 ìˆ˜í–‰í• ìˆ˜ 있습니다. 다만, 현 ì‹œì ì—는 분산 í›ˆë ¨ 기능 지ì›ì´ 잘 ë˜ì§€ 않으므로, 대용량 ë°ì´í„°ì…‹ìœ¼ë¡œ 파ì¸íŠœë‹í•˜ëŠ” 니즈가 있다면 커스텀 í›ˆë ¨ 코드를 ì§ì ‘ 작성하셔야 합니다.

***[Note] SageMaker Studio Lab, SageMaker Studio, SageMaker ë…¸íŠ¸ë¶ ì¸ìŠ¤í„´ìŠ¤, ë˜ëŠ” ì—¬ëŸ¬ë¶„ì˜ ë¡œì»¬ ë¨¸ì‹ ì—ì„œ ì´ ë°ëª¨ë¥¼ ì‹¤í–‰í• ìˆ˜ 있습니다. SageMaker Studio Labì„ ì‚¬ìš©í•˜ëŠ” 경우 GPU를 활성화하세요.***

### References

- Hugging Face Tutorial: https://huggingface.co/docs/transformers/training
- Sentence-BERT paper: https://arxiv.org/abs/1908.10084
- SentenceTransformers: https://www.sbert.net


## 1. Setup Environments
---

### Import modules

In [1]:
# !pip install sentence_transformers datasets faiss-gpu progressbar

In [2]:
import os
import sys
import json
import logging
import argparse
import torch
import gzip
import csv
import math
import urllib
from torch import nn
import numpy as np
import pandas as pd
from tqdm import tqdm

from datetime import datetime
from datasets import load_dataset
from torch.utils.data import DataLoader
from sentence_transformers import SentenceTransformer, SentencesDataset, LoggingHandler, losses, models, util
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.readers import InputExample
from transformers.trainer_utils import get_last_checkpoint

logging.basicConfig(
 level=logging.INFO,
 format='%(asctime)s - %(message)s',
 datefmt='%Y-%m-%d %H:%M:%S',
 handlers=[LoggingHandler()]
)

logger = logging.getLogger(__name__)

### Argument parser

In [3]:
def parser_args(train_notebook=False):
 parser = argparse.ArgumentParser()

 # Default Setting
 parser.add_argument("--epochs", type=int, default=1)
 parser.add_argument("--seed", type=int, default=42)
 parser.add_argument("--train_batch_size", type=int, default=32)
 parser.add_argument("--eval_batch_size", type=int, default=32)
 parser.add_argument("--warmup_steps", type=int, default=100)
 parser.add_argument("--logging_steps", type=int, default=100)
 parser.add_argument("--learning_rate", type=str, default=5e-5)
 parser.add_argument("--disable_tqdm", type=bool, default=False)
 parser.add_argument("--fp16", type=bool, default=True)
 parser.add_argument("--tokenizer_id", type=str, default='sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens')
 parser.add_argument("--model_id", type=str, default='sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens')
 
 # SageMaker Container environment
 parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
 parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
 parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
 parser.add_argument("--train_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
 parser.add_argument("--valid_dir", type=str, default=os.environ["SM_CHANNEL_VALID"])
 parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"]) 
 parser.add_argument('--chkpt_dir', type=str, default='/opt/ml/checkpoints') 

 if train_notebook:
 args = parser.parse_args([])
 else:
 args = parser.parse_args()
 return args

In [4]:
train_dir = 'train'
valid_dir = 'valid'
test_dir = 'test'
!rm -rf {train_dir} {valid_dir} {test_dir} 
os.makedirs(train_dir, exist_ok=True)
os.makedirs(valid_dir, exist_ok=True) 
os.makedirs(test_dir, exist_ok=True) 

### Load Arguments

주피터 노트ë¶ì—ì„œ 곧바로 ì‹¤í–‰í• ìˆ˜ 있ë„ë¡ ì„¤ì •ê°’ë“¤ì„ ë¡œë“œí•©ë‹ˆë‹¤. ë¬¼ë¡ ë…¸íŠ¸ë¶ í™˜ê²½ì´ ì•„ë‹Œ 커맨드ë¼ì¸ì—ì„œë„ `cd scripts & python3 train.py` 커맨드로 í›ˆë ¨ 스í¬ë¦½íŠ¸ë¥¼ ì‹¤í–‰í• ìˆ˜ 있습니다.

In [5]:
chkpt_dir = 'chkpt'
model_dir = 'model'
output_data_dir = 'data'
num_gpus = torch.cuda.device_count()

!rm -rf {chkpt_dir} {model_dir} {output_data_dir} 

if os.environ.get('SM_CURRENT_HOST') is None:
 is_sm_container = False

 #src_dir = '/'.join(os.getcwd().split('/')[:-1])
 src_dir = os.getcwd()
 os.environ['SM_MODEL_DIR'] = f'{src_dir}/{model_dir}'
 os.environ['SM_OUTPUT_DATA_DIR'] = f'{src_dir}/{output_data_dir}'
 os.environ['SM_NUM_GPUS'] = str(num_gpus)
 os.environ['SM_CHANNEL_TRAIN'] = f'{src_dir}/{train_dir}'
 os.environ['SM_CHANNEL_VALID'] = f'{src_dir}/{valid_dir}'
 os.environ['SM_CHANNEL_TEST'] = f'{src_dir}/{test_dir}'
 
args = parser_args(train_notebook=True) 
args.chkpt_dir = chkpt_dir
logger.info("***** Arguments *****")
logger.info(''.join(f'{k}={v}\n' for k, v in vars(args).items()))

os.makedirs(args.chkpt_dir, exist_ok=True) 
os.makedirs(args.model_dir, exist_ok=True)
os.makedirs(args.output_data_dir, exist_ok=True) 

2022-07-20 08:42:26 - ***** Arguments *****
2022-07-20 08:42:26 - epochs=1
seed=42
train_batch_size=32
eval_batch_size=32
warmup_steps=100
logging_steps=100
learning_rate=5e-05
disable_tqdm=False
fp16=True
tokenizer_id=salti/bert-base-multilingual-cased-finetuned-squad
model_id=salti/bert-base-multilingual-cased-finetuned-squad
output_data_dir=/home/ec2-user/SageMaker/sm-kornlp-usecases/sentence-bert-finetuning/data
model_dir=/home/ec2-user/SageMaker/sm-kornlp-usecases/sentence-bert-finetuning/model
n_gpus=4
train_dir=/home/ec2-user/SageMaker/sm-kornlp-usecases/sentence-bert-finetuning/train
valid_dir=/home/ec2-user/SageMaker/sm-kornlp-usecases/sentence-bert-finetuning/valid
test_dir=/home/ec2-user/SageMaker/sm-kornlp-usecases/sentence-bert-finetuning/test
chkpt_dir=chkpt



<br>

## 2. Preparation
---
본 핸즈온ì—ì„œ ì‚¬ìš©í• ë°ì´í„°ì…‹ì€ KorSTS (https://github.com/kakaobrain/KorNLUDatasets) 와 KLUE-STS (https://github.com/KLUE-benchmark/KLUE) 입니다.
ë‹¨ì¼ ë°ì´í„°ì…‹ìœ¼ë¡œ í›ˆë ¨í•´ë„ ë¬´ë°©í•˜ì§€ë§Œ, ë‘ ë°ì´í„°ì…‹ì„ ëª¨ë‘ í™œìš©í•˜ì—¬ í›ˆë ¨ ì‹œ, ì•½ê°„ì˜ ì„±ëŠ¥ í–¥ìƒì´ 있습니다.

### Training Tips
SBERT í›ˆë ¨ì€ ì¼ë°˜ì 으로 아래 3가지 ë°©ë²•ë“¤ì„ ë² ì´ìŠ¤ë¼ì¸ìœ¼ë¡œ 사용합니다.
1. NLI ë°ì´í„°ì…‹ìœ¼ë¡œ í›ˆë ¨
2. STS ë°ì´í„°ì…‹ìœ¼ë¡œ í›ˆë ¨
3. NLI ë°ì´í„°ì…‹ìœ¼ë¡œ í›ˆë ¨ 후 STS ë°ì´í„°ì…‹ìœ¼ë¡œ 파ì¸íŠœë‹

í•œêµì–´ ë°ì´í„°ì˜ 경우, STSì˜ í›ˆë ¨ ë°ì´í„°ê°€ ìƒëŒ€ì 으로 ì ìŒì—ë„ ë¶ˆêµ¬í•˜ê³ NLI 기반 모ë¸ë³´ë‹¤ 예측 ì„±ëŠ¥ì´ ìš°ìˆ˜í•©ë‹ˆë‹¤. ë”°ë¼ì„œ, 2번째 방법으로 진행합니다. <br>
다만, STS보다 조금 ë” ì¢‹ì€ ì˜ˆì¸¡ ì„±ëŠ¥ì„ ì›í•œë‹¤ë©´ NLI ë°ì´í„°ë¡œ ë¨¼ì € í›ˆë ¨í•˜ê³ STS ë°ì´í„°ì…‹ìœ¼ë¡œ ì´ì–´ì„œ í›ˆë ¨í•˜ëŠ” ê²ƒì„ ê¶Œìž¥í•©ë‹ˆë‹¤.

### KLUE-STS ë°ì´í„°ì…‹ 다운로드 ë° í”¼ì³ì…‹ ìƒì„±
KLUE-STS ë°ì´í„°ì…‹ì„ 허깅페ì´ìŠ¤ ë°ì´í„°ì…‹ 허브ì—ì„œ 다운로드 후, SBERT í›ˆë ¨ì— í•„ìš”í•œ 피ì³ì…‹ì„ ìƒì„±í•©ë‹ˆë‹¤.

In [6]:
logger.info("Read KLUE-STS train/dev dataset")
datasets = load_dataset("klue", "sts")

train_samples = []
dev_samples = []

for phase in ["train", "validation"]:
 examples = datasets[phase]

 for example in examples:
 score = float(example["labels"]["label"]) / 5.0 # 0.0 ~ 1.0 스케ì¼ë¡œ ìœ ì‚¬ë„ ì •ê·œí™”
 inp_example = InputExample(texts=[example["sentence1"], example["sentence2"]], label=score)

 if phase == "validation":
 dev_samples.append(inp_example)
 else:
 train_samples.append(inp_example)

2022-07-20 08:42:30 - Read KLUE-STS train/dev dataset
2022-07-20 08:42:30 - Reusing dataset klue (/home/ec2-user/.cache/huggingface/datasets/klue/sts/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e)


 0%| | 0/2 [00:00<?, ?it/s]


### KorSTS ë°ì´í„°ì…‹ 다운로드 ë° í”¼ì³ì…‹ ìƒì„±
KorSTS ë°ì´í„°ì…‹ì€ 허깅페ì´ìŠ¤ì—ë„ ë“±ë¡ë˜ì–´ 있지만, 향후 ì—¬ëŸ¬ë¶„ì˜ ì»¤ìŠ¤í…€ ë°ì´í„°ì…‹ì„ ê°™ì´ ì‚¬ìš©í•˜ëŠ” ìœ ì¦ˆì¼€ì´ìŠ¤ë¥¼ ê³ ë ¤í•˜ì—¬ GitHubì˜ ë°ì´í„°ì…‹ì„ 다운로드받아 ì‚¬ìš©í•˜ê² ìŠµë‹ˆë‹¤. 

In [7]:
repo = 'https://raw.githubusercontent.com/kakaobrain/KorNLUDatasets/master/KorSTS'
urllib.request.urlretrieve(f'{repo}/sts-train.tsv', filename=f'{args.train_dir}/sts-train.tsv')
urllib.request.urlretrieve(f'{repo}/sts-dev.tsv', filename=f'{args.valid_dir}/sts-dev.tsv')
urllib.request.urlretrieve(f'{repo}/sts-test.tsv', filename=f'{args.test_dir}/sts-test.tsv')

# !wget https://raw.githubusercontent.com/kakaobrain/KorNLUDatasets/master/KorSTS/sts-train.tsv -O {train_dir}/sts-train.tsv
# !wget https://raw.githubusercontent.com/kakaobrain/KorNLUDatasets/master/KorSTS/sts-dev.tsv -O {valid_dir}/sts-dev.tsv
# !wget https://raw.githubusercontent.com/kakaobrain/KorNLUDatasets/master/KorSTS/sts-test.tsv -O {test_dir}/sts-test.tsv

('/home/ec2-user/SageMaker/sm-kornlp-usecases/sentence-bert-finetuning/test/sts-test.tsv',
 <http.client.HTTPMessage at 0x7fee52a7f970>)

In [8]:
logger.info("Read KorSTS train dataset")

with open(f'{args.train_dir}/sts-train.tsv', 'rt', encoding='utf8') as fIn:
 reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
 for row in reader:
 if row["sentence1"] and row["sentence2"]: 
 score = float(row['score']) / 5.0 # Normalize score to range 0 ... 1
 inp_example = InputExample(texts=[row['sentence1'], row['sentence2']], label=score)
 train_samples.append(inp_example)
 
logging.info("Read KorSTS dev dataset") 
with open(f'{args.valid_dir}/sts-dev.tsv', 'rt', encoding='utf8') as fIn:
 reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
 for row in reader:
 if row["sentence1"] and row["sentence2"]: 
 score = float(row['score']) / 5.0 # Normalize score to range 0 ... 1
 inp_example = InputExample(texts=[row['sentence1'], row['sentence2']], label=score)
 dev_samples.append(inp_example) 

2022-07-20 08:42:36 - Read KorSTS train dataset
2022-07-20 08:42:36 - Read KorSTS dev dataset


<br>

## 3. Training
---

### Training Preparation

### Model

In [9]:
model_name = 'sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens'

train_batch_size = args.train_batch_size
num_epochs = args.epochs
model_save_path = f'{args.model_dir}/training_sts_'+model_name.replace("/", "-")+'-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
logger.info(model_save_path)

# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
word_embedding_model = models.Transformer(model_name)

2022-07-20 08:43:44 - /home/ec2-user/SageMaker/sm-kornlp-usecases/sentence-bert-finetuning/model/training_sts_sentence-transformers-xlm-r-100langs-bert-base-nli-stsb-mean-tokens-2022-07-20_08-43-44


문장 ìž„ë² ë”©ì„ ê³„ì‚°í•˜ê¸° 위한 Pooler를 ì •ì˜í•©ë‹ˆë‹¤. BERTë¡œ 분류 태스í¬ë¥¼ ìˆ˜í–‰í• ë•ŒëŠ” 첫 번째 [CLS] í† í°ì˜ ì¶œë ¥ 벡터를 ìž„ë² ë”© 벡터로 사용하지만, SBERTì—서는 BERTì˜ ëª¨ë“ í† í°ë“¤ì˜ ì¶œë ¥ ë²¡í„°ë“¤ì„ ì‚¬ìš©í•˜ì—¬ ìž„ë² ë”© 벡터를 계산합니다. ì´ ë•Œ mean poolingì´ë‚˜ max poolingì„ ì‚¬ìš©í• ìˆ˜ 있으며, 본 ì˜ˆì œì—서는 mean poolingì„ ì‚¬ìš©í•©ë‹ˆë‹¤.

In [11]:
# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
 pooling_mode_mean_tokens=True,
 pooling_mode_cls_token=False,
 pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

2022-07-20 08:49:35 - Use pytorch device: cuda


ëª¨ë¸ í›ˆë ¨ ë° ê²€ì¦ì— 필요한 í´ëž˜ìŠ¤ ì¸ìŠ¤í„´ìŠ¤ë¥¼ ìƒì„±í•©ë‹ˆë‹¤. ë² ì´ìŠ¤ë¼ì¸ìœ¼ë¡œ 사용ë˜ëŠ” ê²€ì¦ ì§€í‘œëŠ” ë‘ ë¬¸ìž¥ì˜ ìž„ë² ë”© ë²¡í„°ì˜ ìœ ì‚¬ë„를 산출하는 ì½”ì‚¬ì¸ ìœ ì‚¬ë„입니다.

In [12]:
train_dataset = SentencesDataset(train_samples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
train_loss = losses.CosineSimilarityLoss(model=model)

evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')

warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1) # 10% of train data for warm-up
logger.info("Warmup-steps: {}".format(warmup_steps))

2022-07-20 08:49:36 - Warmup-steps: 55


í›ˆë ¨ì„ ìˆ˜í–‰í•©ë‹ˆë‹¤. 분산 í›ˆë ¨ì„ ìˆ˜í–‰í•˜ì§€ëŠ” 않지만, ë°ì´í„° ë³¼ë¥¨ì´ í¬ì§€ 않으므로 수 분 ë‚´ì— í›ˆë ¨ì´ ì™„ë£Œë©ë‹ˆë‹¤.

### Start Training

In [13]:
# Train the model
model.fit(
 train_objectives=[(train_dataloader, train_loss)],
 evaluator=evaluator,
 epochs=num_epochs,
 evaluation_steps=int(len(train_dataloader)*0.5),
 warmup_steps=warmup_steps,
 output_path=model_save_path,
 use_amp=True
)

Epoch: 0%| | 0/1 [00:00<?, ?it/s]

Iteration: 0%| | 0/545 [00:00<?, ?it/s]

2022-07-20 08:50:25 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset in epoch 0 after 272 steps:
2022-07-20 08:50:28 - Cosine-Similarity :	Pearson: 0.8458	Spearman: 0.8462
2022-07-20 08:50:28 - Manhattan-Distance:	Pearson: 0.8333	Spearman: 0.8371
2022-07-20 08:50:28 - Euclidean-Distance:	Pearson: 0.8339	Spearman: 0.8380
2022-07-20 08:50:28 - Dot-Product-Similarity:	Pearson: 0.8095	Spearman: 0.8114
2022-07-20 08:50:28 - Save model to /home/ec2-user/SageMaker/sm-kornlp-usecases/sentence-bert-finetuning/model/training_sts_sentence-transformers-xlm-r-100langs-bert-base-nli-stsb-mean-tokens-2022-07-20_08-43-44
2022-07-20 08:51:12 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-dev dataset in epoch 0 after 544 steps:
2022-07-20 08:51:15 - Cosine-Similarity :	Pearson: 0.8511	Spearman: 0.8513
2022-07-20 08:51:15 - Manhattan-Distance:	Pearson: 0.8378	Spearman: 0.8416
2022-07-20 08:51:15 - Euclidean-Distance:	Pearson: 0.8383	Spearman: 0.8425
2022-07-20 08:51:15

<br>

## 4. Evaluation
---
í›ˆë ¨ì´ ì™„ë£Œë˜ì—ˆë‹¤ë©´, 테스트 ë°ì´í„°ì…‹ìœ¼ë¡œ 예측 ì„±ëŠ¥ì„ ë³¼ 수 있는 ì§€í‘œë“¤ì„ ì‚°ì¶œí•©ë‹ˆë‹¤.

In [14]:
test_samples = []
logger.info("Read KorSTS test dataset") 
with open(f'{args.test_dir}/sts-test.tsv', 'rt', encoding='utf8') as fIn:
 reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
 for row in reader:
 if row["sentence1"] and row["sentence2"]: 
 score = float(row['score']) / 5.0 # Normalize score to range 0 ... 1
 inp_example = InputExample(texts=[row['sentence1'], row['sentence2']], label=score)
 test_samples.append(inp_example) 

2022-07-20 08:51:47 - Read KorSTS test dataset


In [15]:
##############################################################################
# Load the stored model and evaluate its performance on STS benchmark dataset
##############################################################################

model = SentenceTransformer(model_save_path)
test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path=model_save_path)

2022-07-20 08:51:48 - Load pretrained SentenceTransformer: /home/ec2-user/SageMaker/sm-kornlp-usecases/sentence-bert-finetuning/model/training_sts_sentence-transformers-xlm-r-100langs-bert-base-nli-stsb-mean-tokens-2022-07-20_08-43-44
2022-07-20 08:51:51 - Use pytorch device: cuda
2022-07-20 08:51:51 - EmbeddingSimilarityEvaluator: Evaluating the model on sts-test dataset:
2022-07-20 08:51:53 - Cosine-Similarity :	Pearson: 0.8287	Spearman: 0.8310
2022-07-20 08:51:53 - Manhattan-Distance:	Pearson: 0.8242	Spearman: 0.8283
2022-07-20 08:51:53 - Euclidean-Distance:	Pearson: 0.8245	Spearman: 0.8287
2022-07-20 08:51:53 - Dot-Product-Similarity:	Pearson: 0.7619	Spearman: 0.7608


0.8309806357819561

<br>

## 5. Applications
---

In [16]:
import random
import time
from operator import itemgetter 

def get_faiss_index(emb, data, dim=768):
 import faiss
 n_gpus = torch.cuda.device_count()

 if n_gpus == 0:
 # Create the Inner Product Index
 index = faiss.IndexFlatIP(dim)
 else:
 flat_config = []
 res = [faiss.StandardGpuResources() for i in range(n_gpus)]
 for i in range(n_gpus):
 cfg = faiss.GpuIndexFlatConfig()
 cfg.useFloat16 = False
 cfg.device = i
 flat_config.append(cfg)

 index = faiss.GpuIndexFlatIP(res[0], dim, flat_config[0])

 index = faiss.IndexIDMap(index)
 index.add_with_ids(emb, np.array(range(0, len(data)))) 
 return index


def search(model, query, data, index, k=5, random_select=False, verbose=True):
 t = time.time()
 query_vector = model.encode(query)
 dists, top_k_inds = index.search(query_vector, k)
 if verbose:
 print('total time: {}'.format(time.time() - t))
 results = [itemgetter(*ind)(data) for ind in top_k_inds] 
 
 if random_select:
 return [random.choice(r) for r in results]
 else:
 return results

### Chatbot

ì±—ë´‡ì€ í¬ê²Œ ë‘ ê°€ì§€ 형태로 개발합니다. 1) ìƒì„± 모ë¸ì„ 사용하여 해당 ì§ˆë¬¸ì— ëŒ€í•œ ì°½ì˜ì ì¸ ë‹µë³€ì„ ìƒì„±í•˜ê±°ë‚˜, 2) ìˆ˜ë§Žì€ ì§ˆë¬¸-답변 리스트들 중 ì§ˆë¬¸ì— ë¶€í•©í•˜ëŠ” 질문 í›„ë³´ë“¤ì„ ì¶”ë¦° ë‹¤ìŒ í•´ë‹¹ í›„ë³´ì— ì í•©í•œ ë‹µë³€ì„ ì°¾ëŠ” ë°©ì‹ì´ì£ .
본 í•¸ì¦ˆì˜¨ì€ 2)ì˜ ë°©ë²•ìœ¼ë¡œ 간단하게 ì±—ë´‡ 예시를 보여드립니다. 질문 í…스트를 ìž…ë ¥ìœ¼ë¡œ 받으면, 해당 ì§ˆë¬¸ì˜ ìž„ë² ë”©ì„ ê³„ì‚°í•˜ì—¬ 질문 ìž„ë² ë”©ê³¼ ëª¨ë“ ì§ˆë¬¸ ë¦¬ìŠ¤íŠ¸ì˜ ìž„ë² ë”©ì„ ë¹„êµí•˜ì—¬ ìœ ì‚¬ë„ê°€ 가장 ë†’ì€ ì§ˆë¬¸ í›„ë³´ë“¤ì„ ì°¾ê³ , ê° í›„ë³´ì— ë§¤ì¹ë˜ëŠ” ë‹µë³€ì„ ì°¾ìŠµë‹ˆë‹¤.

ì½”ì‚¬ì¸ ìœ ì‚¬ë„를 ì§ì ‘ ê³„ì‚°í• ìˆ˜ë„ ìžˆì§€ë§Œ, 페ì´ìŠ¤ë¶ì—ì„œ 개발한 Faiss ë¼ì´ë¸ŒëŸ¬ë¦¬ (https://github.com/facebookresearch/faiss) 를 사용하면 훨씬 ë¹ ë¥¸ ì†ë„ë¡œ ê³„ì‚°í• ìˆ˜ 있습니다. Faiss는 Product ì–‘ìží™” ì•Œê³ ë¦¬ì¦˜ì„ GPUë¡œ ë”ìš± ë¹ ë¥´ê²Œ 구현한 ë¼ì´ë¸ŒëŸ¬ë¦¬ë¡œ, ì •ë³´ ì†ì‹¤ì„ 가급ì 줄ì´ë©´ì„œ ìž„ë² ë”© 벡터를 ì¸ë±ì‹±í•©ë‹ˆë‹¤.

References
- Billion-scale similarity search with GPUs: https://arxiv.org/pdf/1702.08734.pdf
- Product Quantizers for k-NN Tutorial Part 1: https://mccormickml.com/2017/10/13/product-quantizer-tutorial-part-1
- Product Quantizers for k-NN Tutorial Part 1: http://mccormickml.com/2017/10/22/product-quantizer-tutorial-part-2

#### Preparing chatbot dataset

In [17]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/songys/Chatbot_data/master/ChatbotData.csv", 
 filename=f"{args.train_dir}/chatbot-train.csv")
chatbot_df = pd.read_csv(f'{args.train_dir}/chatbot-train.csv')
chatbot_df.head()

Unnamed: 0,Q,A,label
0,12ì‹œ ë•¡!,하루가 ë˜ ê°€ë„¤ìš”.,0
1,1ì§€ë§ í•™êµ ë–¨ì–´ì¡Œì–´,위로해 드립니다.,0
2,3ë°•4ì¼ ë†€ëŸ¬ê°€ê³ ì‹¶ë‹¤,ì—¬í–‰ì€ ì–¸ì œë‚˜ ì¢‹ì£ .,0
3,3ë°•4ì¼ ì •ë„ ë†€ëŸ¬ê°€ê³ ì‹¶ë‹¤,ì—¬í–‰ì€ ì–¸ì œë‚˜ ì¢‹ì£ .,0
4,PPL 심하네,ëˆˆì‚´ì´ ì°Œí‘¸ë ¤ì§€ì£ .,0


#### Embedding

In [18]:
chatbot_q_data = chatbot_df['Q'].tolist()
chatbot_a_data = chatbot_df['A'].tolist()
chatbot_emb = model.encode(chatbot_q_data, normalize_embeddings=True, batch_size=64, show_progress_bar=True)

Batches: 0%| | 0/185 [00:00<?, ?it/s]

#### Indexing the dataset

In [19]:
chatbot_index = get_faiss_index(chatbot_emb, chatbot_q_data)

2022-07-20 08:52:14 - Loading faiss with AVX2 support.
2022-07-20 08:52:14 - Could not load library with AVX2 support due to:
ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'")
2022-07-20 08:52:14 - Loading faiss.
2022-07-20 08:52:14 - Successfully loaded faiss.


#### Inference
샘플 ì§ˆë¬¸ë“¤ì— ëŒ€í•œ ì¶”ë¡ ì„ ìˆ˜í–‰í•©ë‹ˆë‹¤. 

In [36]:
query = ['커피 ë¼ë–¼ ë§ˆì‹œê³ ì‹¶ì–´']
search(model, query, chatbot_a_data, chatbot_index, random_select=True)

Batches: 0%| | 0/1 [00:00<?, ?it/s]

total time: 0.03067159652709961


['ì¢‹ì€ ì‹œê°„ 보내시길 ë°”ë¼ìš”.']

In [37]:
query = ['너무 ì¡¸ë ¤', 'ë†€ê³ ì‹¶ì–´']
search(model, query, chatbot_a_data, chatbot_index, random_select=True)

Batches: 0%| | 0/1 [00:00<?, ?it/s]

total time: 0.0308225154876709


['ë‚®ìž ì„ ìž ê¹ ìžë„ 괜찮아요.', 'ê°™ì´ ë†€ì•„ìš”.']

### Semantic Search (News)


시멘틱(ì˜ë¯¸) ê²€ìƒ‰ì€ ê²€ìƒ‰ 쿼리가 키워드를 찾는 것ë¿ë§Œ 아니ë¼, ê²€ìƒ‰ì— ì‚¬ìš©ë˜ëŠ” ë‹¨ì–´ì˜ ì˜ë„와 문맥ì ì˜ë¯¸ë¥¼ 파악하는 ê²ƒì„ ëª©í‘œë¡œ 합니다.
시멘틱 ìœ ì‚¬ë„ ê²€ìƒ‰ì„ ë˜í•œ ìƒê¸° ì±—ë´‡ 예시와 마찬가지로, 해당 검색 쿼리를 ìž…ë ¥í•˜ë©´, 검색 ì¿¼ë¦¬ì˜ ìž„ë² ë”©ì„ ê³„ì‚°í•˜ì—¬ ëª¨ë“ ë¬¸ì„œ(예: 뉴스 ì œëª©/요약, 웹페ì´ì§€ ì œëª©/요약) ë¦¬ìŠ¤íŠ¸ì˜ ìž„ë² ë”©ì„ ë¹„êµí•˜ì—¬ 가장 ìœ ì‚¬ë„ê°€ ë†’ì€ ë¬¸ì„œ í›„ë³´ë“¤ì„ ì°¾ìŠµë‹ˆë‹¤.

References
- Billion-scale semantic similarity search with FAISS+SBERT: https://towardsdatascience.com/billion-scale-semantic-similarity-search-with-faiss-sbert-c845614962e2
- Korean Contemporary Corpus of Written Sentences: http://nlp.kookmin.ac.kr/kcc/

#### Preparing news dataset

In [38]:
import progressbar

class MyProgressBar():
 def __init__(self):
 self.pbar = None

 def __call__(self, block_num, block_size, total_size):
 if not self.pbar:
 self.pbar=progressbar.ProgressBar(maxval=total_size)
 self.pbar.start()

 downloaded = block_num * block_size
 if downloaded < total_size:
 self.pbar.update(downloaded)
 else:
 self.pbar.finish()

In [39]:
url = 'http://nlp.kookmin.ac.kr/kcc/KCCq28_Korean_sentences_EUCKR_v2.zip'
news_path = f'{args.train_dir}/KCCq28_Korean_sentences_EUCKR_v2.zip'
urllib.request.urlretrieve(url, news_path, MyProgressBar())

100% |########################################################################|


('/home/ec2-user/SageMaker/sm-kornlp-usecases/sentence-bert-finetuning/train/KCCq28_Korean_sentences_EUCKR_v2.zip',
 <http.client.HTTPMessage at 0x7f53a416d3a0>)

In [40]:
import zipfile
with zipfile.ZipFile(news_path, 'r') as zip_ref:
 zip_ref.extractall(train_dir)

In [41]:
!rm -rf {news_path}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [42]:
news_data = []
f = open(f'{args.train_dir}/KCCq28_Korean_sentences_EUCKR_v2.txt', 'rt', encoding='cp949')
lines = f.readlines()
for line in lines:
 line = line.strip()
 news_data.append(line)
f.close()

In [43]:
news_data = news_data[:10000] # For debug purpose

#### Embedding

In [44]:
news_emb = model.encode(news_data, normalize_embeddings=True, batch_size=64, show_progress_bar=True)

Batches: 0%| | 0/157 [00:00<?, ?it/s]

#### Indexing the dataset

In [45]:
news_index = get_faiss_index(news_emb, news_data)

#### Inference
샘플 ì§ˆë¬¸ë“¤ì— ëŒ€í•œ ì¶”ë¡ ì„ ìˆ˜í–‰í•©ë‹ˆë‹¤. 

In [46]:
query =['ì•„ì´ìŠ¤ ë¼ë–¼', '미세먼지']
search(model, query, news_data, news_index, k=7, random_select=False)

Batches: 0%| | 0/1 [00:00<?, ?it/s]

total time: 0.031549692153930664


[('ì•„ì´ì„œí”Œë¼ì´ëŠ” "ì•„ì´í°4ì— ì‚¬ìš©í•œ \'A4\'와 마찬가지로 \'ë‹¤ì´ ë§ˆí¬\'ë¡œ ë³¼ ë•Œ ì‚¼ì„±ì „ìžê°€ ë§Œë“ ê²ƒìœ¼ë¡œ ë³´ì¸ë‹¤"ê³ í–ˆë‹¤.',
 'ì•„ì´ìŠ¤í¬ë¦¼ì„ ì„œë¹„ìŠ¤í•˜ê³ ìžˆëŠ” ì‹œê³µê·¸ë£¹ì˜ ë°•ê¸°ì„ íšŒìž¥ì€ "콘í…ì¸ ì œê³µì„œë¹„ìŠ¤ 품질ì¸ì¦ìœ¼ë¡œ ì•„ì´ìŠ¤í¬ë¦¼ 서비스와 콘í…ì¸ ì— ëŒ€í•´ ì‹ ë¢°ë„와 ê³µì‹ ë ¥ì„ ì¸ì •ë°›ì•˜ë‹¤.',
 'ì—˜ë¦¬ì—‡ì˜ ìžíšŒì‚¬ì¸ ë¸”ë ˆì´í¬ ìºí”¼íƒˆê³¼ í¬í„° ìºí”¼íƒˆì€ í™ë³´ëŒ€í–‰ì‚¬ë¥¼ 통해 "ì‚¼ì„±ì „ìžê°€ ì œì‹œí•œ 개략ì ì¸ ì£¼ì£¼ê°€ì¹˜ ì œê³ ë°©ì•ˆì´ í–¥í›„ íšŒì‚¬ì— ê±´ì„¤ì ì¸ ì²« 걸ìŒì´ ë 것으로 ë³´ê³ ìžˆë‹¤"ê³ ë°í˜”다.',
 'ê²€í† ê²°ê³¼ì— ë”°ë¼ ê²°ë¡ ì„ ë‚´ë¦´ 것"ì´ë¼ê³ ë§í–ˆë‹¤.',
 'ì¡° 대사는 "대미 ì˜ì¡´ë„를 줄ì´ë ¤ê³ ë…¸ë ¥í•˜ê³ ìžˆëŠ” ìºë‚˜ë‹¤ëŠ” ëˆˆê¸¸ì„ ì•„ì‹œì•„Â·íƒœí‰ì–‘ 지ì—으로 ëŒë¦¬ê³ 있다"ë©° "특히 ì´ ì§€ì—ì—서는 ìºë‚˜ë‹¤ì™€ FTA를 ë§ºì€ êµê°€ê°€ 없어 ë”ìš± ìƒì§•ì ì˜ë¯¸ê°€ 있다"ê³ ë§í–ˆë‹¤.',
 'ë°• ëŒ€í†µë ¹ì€ ë˜ "ì ê·¹ì ì¸ ì„¸ì¼ì¦ˆ ì™¸êµ ëŒ€í†µë ¹ìœ¼ë¡œ ë‚˜ì„œê² ë‹¤"는 ì ì„ ë¶„ëª…ížˆ 했다.',
 'ì• í”Œì´ í˜‘ë ¥ì—…ì²´ì˜ ì¸ë ¥ì„ ì˜ìž…하는 ì´ìœ ì— ëŒ€í•´ ì•„ì´í°ì¸ ìºë‚˜ë‹¤ëŠ” "ì• í”Œì´ GPU ìžì²´