{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "1099b04396475b6a0143fa303da9fa44ad87b660"
   },
   "source": [
    "# Custom negativity intent recognizer(사용자 맞춤 부정적인 의도 인식기)\n",
    "\n",
    "### [(원본)](https://github.com/aws-samples/amazon-comprehend-custom-entity/blob/master/3-AWS-Comprehend-Negative-Intent-Recognizer.ipynb)\n",
    "\n",
    "\n",
    "이 노트북은 우리가 word2vec 모델로 생성한 사용자정의 키워드를 활용하는 Amazon Comprehend의 Custom entities에 대해 학습 데이터셋을 준비하는 방법에 대해 다룹니다. \n",
    "\n",
    "우리는 \"frustrated\"와 의미상으로 유사한 키워드를 기반으로 Custom negativity intent recognizer(사용자 맞춤 부정적인 의도 인식기)를 구현할 것입니다.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
    "_kg_hide-input": true,
    "_kg_hide-output": true,
    "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5"
   },
   "outputs": [],
   "source": [
    "# library imports\n",
    "import re\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib\n",
    "import csv\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "54e810d8b9c1936c8569093badabc4d7b25ea881"
   },
   "source": [
    "이 예제에서는 통신사 도메인을 위해 전처리와 필터링을 거쳤던, 이전 노트북에서 만들었던 데이터넷을 재사용할 것입니다. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "_uuid": "9365c16e4481ec49f5c084f7c3b0cf50dd55047f",
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(32716, 1)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>@sprintcare is the worst customer service | @1...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>@sprintcare is the worst customer service | @1...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>@sprintcare is the worst customer service | @1...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>@115714 y’all lie about your “great” connectio...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>@115714 whenever I contact customer support, t...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text\n",
       "0  @sprintcare is the worst customer service | @1...\n",
       "1  @sprintcare is the worst customer service | @1...\n",
       "2  @sprintcare is the worst customer service | @1...\n",
       "3  @115714 y’all lie about your “great” connectio...\n",
       "4  @115714 whenever I contact customer support, t..."
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "colnames=['text'] \n",
    "tweets = pd.read_csv('./data/tweet_telco.csv',encoding='utf-8',names=colnames, header=None)\n",
    "print(tweets.shape)\n",
    "tweets.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "845eba8749f15e1e2b10aa43414f40860259f4e0"
   },
   "source": [
    "<a id='data-wrangling'></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "데이터셋을 생성하기 위해서 NEGATIVITY라는 새로운 클래스에 대한 Entity 목록을 제공해야 합니다.\n",
    "\n",
    "관련된 Entity들을 찾기 위해서 \"frustrated\"와 의미적으로 유사한 단어를 찾기 위한 커스텀 word2vec 모델을 사용할 것입니다. \n",
    "키워드를 생성하는 방법은 blazingtext_word2vec_telco_tweets.ipynb 노트북을 참조하시기 바랍니다 \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "negative_words = ['Really', 'cheated', 'annoyed', 'unhelpful', 'frustrated', 'upset' , 'unhappy', 'angry', 'badly', 'bad', 'surprised', 'sadly', 'dissatisfied', 'disappointed', 'disgusted']\n",
    "\n",
    "df_entity_list = pd.DataFrame(negative_words, columns=['Text'])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "클래스 레이블과 함께 컬럼을 하나 더 추가합니다. 이것은 Amazon Comprehend의 학습 데이터셋의 요구 사양입니다. \n",
    "\n",
    "자세한 내용은 여기를 참조하십시오. \n",
    "\n",
    "https://docs.aws.amazon.com/comprehend/latest/dg/cer-entity-list.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_entity_list['Type'] = 'NEGATIVE'\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "학습 파일을 생성합니다. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tweets['text'].to_csv('./data/raw_negative.csv', encoding='utf-8', index=False)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\"@sprintcare is the worst customer service | @115712 Can you please send us a private message, so that I can gain further details about your account?\"\n",
      "@sprintcare is the worst customer service | @115712 I would love the chance to review the account and provide assistance.\n",
      "@sprintcare is the worst customer service | @115712 Hello! We never like our customers to feel like they are not valued.\n",
      "\"@115714 y’all lie about your “great” connection. 5 bars LTE, still won’t load something. Smh. | @115713 H there! We'd definitely like to work with you on this, how long have you been experiencing this issue? -AA\"\n",
      "\"@115714 whenever I contact customer support, they tell me I have shortcode enabled on my account, but I have never in the 4 years I've tried https://t.co/0G98RtNxPK | @115715 Please send me a private message so that I can send you the link to access your account. -FR\"\n",
      "\"@115913 @115911 just called in to switch from AT&amp;T. They wanted $75 to switch 3 phones! I said no way! Inconsistent messaging - shame! | @115912 @115913 We always want to be upfront and honest with you, David. Please send a DM our way. https://t.co/lH0SH5fy2m *MikeRice\"\n",
      "@TMobileHelp trying to redeem a free tuesday code and its not letting me telling me theres an error can i get help with this? | @115914 Yay for T-Mobile Tuesdays- lets's get you taken care of! Send a DM my way: https://t.co/jGtdfLsVbg *ErikaHoleman\n",
      "\"T-Mobile So Bs My Internet Stop Working For 3 Hours https://t.co/YISXwAHace | @115915 Well, we'd be happy to take a look at that and see what's going on, Jenny. DM us, we're here to keep you connected. *JoanO\"\n",
      "No. That opportunity has passed. Bad customer experience @115911 &amp; @343   Not going back ever again. #FAIL https://t.co/05pINkDFGd | @115916 Hey we are here and we want to help! If you have any questions feel free to shoot us a DM! *KatGrisham\n",
      "\"@TMobileHelp @1247 The @10568 code is not working when applied at checkout. Can you please look into this? https://t.co/SJvM7AW24t | @115918 We want this to work perfectly! Do you have at least 10 photos selected? If not, add another and try. DM w/ any questions! *ChrisBradstreet\"\n"
     ]
    }
   ],
   "source": [
    "!head ./data/raw_negative.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Entity 목록 파일을 생성합니다. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_entity_list.to_csv('./data/entity_negative_list.csv', encoding='utf-8', index=False)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Text,Type\n",
      "Really,NEGATIVE\n",
      "cheated,NEGATIVE\n",
      "annoyed,NEGATIVE\n",
      "unhelpful,NEGATIVE\n",
      "frustrated,NEGATIVE\n",
      "upset,NEGATIVE\n",
      "unhappy,NEGATIVE\n",
      "angry,NEGATIVE\n",
      "badly,NEGATIVE\n"
     ]
    }
   ],
   "source": [
    "!head ./data/entity_negative_list.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "원본 telco tweet 데이터셋에서 테스트 파일을 생성합니다. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tweets['text'].tail(10000).to_csv('./data/telco_device_test.csv', encoding='utf-8', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 모델 학습하기\n",
    "\n",
    "콘솔에서 custom entity recognizer job 을 생성합니다. 상세한 방법은 첫 노트북인 1-AWS-Comprehend-Custom-Entities.ipynb 를 참조하세요.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Custom entity 모델 테스트하기\n",
    "\n",
    "Comprehend API를 호출하여 앞서 준비한 테스트 파일에서 Test job을 실행해봅니다. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"JobId\": \"94638d72b858902bed0dedaadaedb104\",\n",
      "    \"JobStatus\": \"SUBMITTED\"\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "!aws comprehend start-entities-detection-job \\\n",
    "     --entity-recognizer-arn \"arn:aws:comprehend:us-east-1:{account number}:entity-recognizer/cer-negative\" \\\n",
    "     --job-name Test \\\n",
    "     --data-access-role-arn \"arn:aws:iam::{account number}:role/service-role/AmazonComprehendServiceRole-cer-test\" \\\n",
    "     --language-code en \\\n",
    "     --input-data-config \"S3Uri=s3://{bucket name}/custom_entities/telco_device_test.csv\" \\\n",
    "     --output-data-config \"S3Uri=s3://{bucket name}/custom_entities/output/\" \\\n",
    "     --region \"us-east-1\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "39751a13337cd09b32588e2d0fc5f7e7817cca8b"
   },
   "source": [
    "결과는 --output-data-config 경로에 json 파일이 생성됩니다. 그리고 결과는 Glue와 Athena를 사용하여 확인할 수 있습니다. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernel_info": {
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "nteract": {
   "version": "0.15.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}