{
"cells": [
{
"cell_type": "markdown",
"id": "bccc028d",
"metadata": {},
"source": [
"# Sentiment Classification for Movie Review Dataset (Korean)\n",
"\n",
"본 핸즈온에서는 네이버 영화 리뷰에 대한 감정(0: 부정, 1: 긍정)을 요약한 네이버 영화 리뷰 데이터셋으로 AutoGluon 훈련을 수행합니다."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "6b4f26a5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1\n"
]
}
],
"source": [
"import os\n",
"import torch\n",
"import mxnet as mx\n",
"num_gpus = torch.cuda.device_count()\n",
"\n",
"if num_gpus == 0:\n",
" os.environ['AUTOGLUON_TEXT_TRAIN_WITHOUT_GPU'] = '1'\n",
"\n",
"print(num_gpus)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "74648979",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import warnings\n",
"import matplotlib.pyplot as plt\n",
"warnings.filterwarnings('ignore')\n",
"np.random.seed(123)"
]
},
{
"cell_type": "markdown",
"id": "e9b082b0",
"metadata": {},
"source": [
"
\n",
"\n",
"## 1. Data preparation and Training\n",
"\n",
"https://github.com/e9t/nsmc/ 에 공개된 네이버 영화 리뷰 데이터셋을 다운로드합니다.\n",
"훈련 데이터는 총 15만건이며, 테스트 데이터는 총 5만건입니다."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "fe401129",
"metadata": {},
"outputs": [],
"source": [
"save_path = 'ag-02-sentiment-classifcation-kor'\n",
"!rm -rf $save_path input"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "17622d81",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2022-08-29 10:23:42-- https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 14628807 (14M) [text/plain]\n",
"Saving to: ‘./input/ratings_train.txt’\n",
"\n",
"100%[======================================>] 14,628,807 --.-K/s in 0.04s \n",
"\n",
"2022-08-29 10:23:42 (328 MB/s) - ‘./input/ratings_train.txt’ saved [14628807/14628807]\n",
"\n",
"--2022-08-29 10:23:42-- https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 4893335 (4.7M) [text/plain]\n",
"Saving to: ‘./input/ratings_test.txt’\n",
"\n",
"100%[======================================>] 4,893,335 --.-K/s in 0.02s \n",
"\n",
"2022-08-29 10:23:42 (237 MB/s) - ‘./input/ratings_test.txt’ saved [4893335/4893335]\n",
"\n"
]
}
],
"source": [
"!wget -nc https://raw.githubusercontent.com/e9t/nsmc/master/ratings_train.txt -P ./input/\n",
"!wget -nc https://raw.githubusercontent.com/e9t/nsmc/master/ratings_test.txt -P ./input/"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a0ebb893",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"train_df = pd.read_csv('./input/ratings_train.txt', header=0, delimiter='\\t')\n",
"test_df = pd.read_csv('./input/ratings_test.txt', header=0, delimiter='\\t')\n",
"train_df = train_df[['document', 'label']]\n",
"test_df = test_df[['document', 'label']]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "9aaeb0b0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | document | \n", "label | \n", "
---|---|---|
16269 | \n", "가벼우면서도 보여줄건 다 보여주는 성실함. | \n", "1 | \n", "
140471 | \n", "겁나재밌어...ㅋㅋ아는내용그대로나와도보게되긴함..시청률이떨어지고있지만 트로트의연인 ... | \n", "1 | \n", "
78683 | \n", "젊은시절 이소룡의 광팬이 되어 개봉작마다 개봉 첫날에 영화보기 위해 줄서서 기다렸던... | \n", "1 | \n", "
2605 | \n", "최악...감동도없고 ....대놓고범죄..ㅡㅡ말도안돼는영화 | \n", "0 | \n", "
81156 | \n", "어머니에게 감사드려요ㅜㅜ | \n", "1 | \n", "