{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Build a Sentiment Analysis Model with Scikit Learn\n",
    "\n",
    "With Amazon SageMaker\n",
    "\n",
    "source: https://www.twilio.com/blog/2017/12/sentiment-analysis-scikit-learn.html\n",
    "\n",
    "First, import the necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.externals import joblib\n",
    "import random\n",
    "import boto3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Declare and initialize the s3 resource. Define constant variables that hold the value of file paths. Replace ```your-bucket-name``` with the name of the bucket you created for this workshop."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3 = boto3.resource('s3')\n",
    "\n",
    "BUCKET_NAME = 'your-bucket-name'\n",
    "DATADIR='ServerlessAIWorkshop/SentimentAnalysis'\n",
    "MODEL_FILE = 'sentiment_analysis_model.pkl'\n",
    "VOCAB_FILE = 'vocabulary.pkl'\n",
    "\n",
    "MODEL_FILE_KEY = DATADIR + '/' + MODEL_FILE\n",
    "VOCAB_FILE_KEY = DATADIR + '/' + VOCAB_FILE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Declare array objects to store dataset. Data and label will be stored separately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = []\n",
    "data_labels = [] #label will be either 'pos' or 'neg'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Open and read each dataset and append to the array objects created above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"./pos_tweets.txt\") as f:\n",
    "    for i in f: \n",
    "        data.append(i) \n",
    "        data_labels.append('pos')\n",
    "\n",
    "with open(\"./neg_tweets.txt\") as f:\n",
    "    for i in f: \n",
    "        data.append(i)\n",
    "        data_labels.append('neg')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Vectorize the tweets content and convert to a two-dimensional array of word counts. This conversion to array is necessary to split operation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vectorizer = CountVectorizer(\n",
    "    analyzer = 'word', # exclude common words such as “the” or “and”\n",
    "    lowercase = False,\n",
    ")\n",
    "features = vectorizer.fit_transform(data)\n",
    "features_nd = features.toarray() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split the training data to get an evaluation set.\n",
    "* X = dataset\n",
    "* y = label\n",
    "* train_ = training dataset\n",
    "* test_ = validation dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_size = 0.8\n",
    "test_size = 1.0 - train_size\n",
    "seed = 7\n",
    "X_train, X_test, y_train, y_test  = train_test_split(\n",
    "        features_nd, \n",
    "        data_labels,\n",
    "        train_size=train_size,\n",
    "        test_size=test_size,\n",
    "        random_state=seed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Build a classifier using sckit learn's logistic regression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "log_model = LogisticRegression() # using LogisticRegression class from Sciki-learn\n",
    "log_model = log_model.fit(X_train, y_train) # train the model using the training dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run inference using the test/validation dataset. Sklearn.metrics.accuracy_score calculates what percentage of tweets are classified correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred = log_model.predict(X_test)\n",
    "print(accuracy_score(y_test, y_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use scikit learns joblib module to first write the trained model and test dataset that is in memory into files (.pkl and .csv) on the notebook instance. Upload those artifacts to your S3 bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "joblib.dump(log_model, MODEL_FILE)                # Save Linear Regression coefficients for inference\n",
    "joblib.dump(vectorizer.vocabulary_, VOCAB_FILE)   # Save vocabulary for inference\n",
    "\n",
    "# upload the trailed model/pickle file as well as vocabulary to S3 bucket\n",
    "s3.meta.client.upload_file(MODEL_FILE, BUCKET_NAME, MODEL_FILE_KEY)\n",
    "s3.meta.client.upload_file(VOCAB_FILE, BUCKET_NAME, VOCAB_FILE_KEY)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}