{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1>SMS Spam Classifier</h1>\n",
    "<br />\n",
    "This notebook shows how to implement a basic spam classifier for SMS messages using Amazon SageMaker built-in linear learner algorithm.\n",
    "The idea is to use the SMS spam collection dataset available at <a href=\"https://archive.ics.uci.edu/ml/datasets/sms+spam+collection\">https://archive.ics.uci.edu/ml/datasets/sms+spam+collection</a> to train and deploy a binary classification model by leveraging on the built-in Linear Learner algoirithm available in Amazon SageMaker.\n",
    "\n",
    "Amazon SageMaker's Linear Learner algorithm extends upon typical linear models by training many models in parallel, in a computationally efficient manner. Each model has a different set of hyperparameters, and then the algorithm finds the set that optimizes a specific criteria. This can provide substantially more accurate models than typical linear algorithms at the same, or lower, cost."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get started by setting some configuration variables and getting the Amazon SageMaker session and the current execution role, using the Amazon SageMaker high-level SDK for Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker import get_execution_role\n",
    "\n",
    "bucket_name = '<bucket-name>'\n",
    "\n",
    "role = get_execution_role()\n",
    "bucket_key_prefix = 'sms-spam-classifier'\n",
    "vocabulary_length = 9013\n",
    "\n",
    "print(role)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now download the spam collection dataset, unzip it and read the first 10 rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p dataset\n",
    "!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip -o dataset/smsspamcollection.zip\n",
    "!unzip -o dataset/smsspamcollection.zip -d dataset\n",
    "!head -10 dataset/SMSSpamCollection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now load the dataset into a Pandas dataframe and execute some data preparation.\n",
    "More specifically we have to:\n",
    "<ul>\n",
    "    <li>replace the target column values (ham/spam) with numeric values (0/1)</li>\n",
    "    <li>tokenize the sms messages and encode based on word counts</li>\n",
    "    <li>split into train and test sets</li>\n",
    "    <li>upload to a S3 bucket for training</li>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import pickle\n",
    "from sms_spam_classifier_utilities import one_hot_encode\n",
    "from sms_spam_classifier_utilities import vectorize_sequences\n",
    "\n",
    "df = pd.read_csv('dataset/SMSSpamCollection', sep='\\t', header=None)\n",
    "df[df.columns[0]] = df[df.columns[0]].map({'ham': 0, 'spam': 1})\n",
    "\n",
    "targets = df[df.columns[0]].values\n",
    "messages = df[df.columns[1]].values\n",
    "\n",
    "# one hot encoding for each SMS message\n",
    "one_hot_data = one_hot_encode(messages, vocabulary_length)\n",
    "encoded_messages = vectorize_sequences(one_hot_data, vocabulary_length)\n",
    "\n",
    "df2 = pd.DataFrame(encoded_messages)\n",
    "df2.insert(0, 'spam', targets)\n",
    "\n",
    "# Split into training and validation sets (80%/20% split)\n",
    "split_index = int(np.ceil(df.shape[0] * 0.8))\n",
    "train_set = df2[:split_index]\n",
    "val_set = df2[split_index:]\n",
    "\n",
    "train_set.to_csv('dataset/sms_train_set.csv', header=False, index=False)\n",
    "val_set.to_csv('dataset/sms_val_set.csv', header=False, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have to upload the two files back to Amazon S3 in order to be accessed by the Amazon SageMaker training cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "\n",
    "s3 = boto3.resource('s3')\n",
    "target_bucket = s3.Bucket(bucket_name)\n",
    "\n",
    "with open('dataset/sms_train_set.csv', 'rb') as data:\n",
    "    target_bucket.upload_fileobj(data, '{0}/train/sms_train_set.csv'.format(bucket_key_prefix))\n",
    "    \n",
    "with open('dataset/sms_val_set.csv', 'rb') as data:\n",
    "    target_bucket.upload_fileobj(data, '{0}/val/sms_val_set.csv'.format(bucket_key_prefix))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Training the model with Linear Learner</h2>\n",
    "\n",
    "We are now ready to run the training using the Amazon SageMaker Linear Learner built-in algorithm. First let's get the linear larner container."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "\n",
    "from sagemaker.amazon.amazon_estimator import get_image_uri\n",
    "container = get_image_uri(boto3.Session().region_name, 'linear-learner', repo_version=\"latest\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we'll kick off the base estimator, making sure to pass in the necessary hyperparameters. Notice:\n",
    "\n",
    "<ul>\n",
    "    <li>feature_dim is set to the same dimension of the vocabulary.</li>\n",
    "<li>predictor_type is set to 'binary_classifier' since we are trying to predict whether a SMS message is spam or not.</li>\n",
    "<li>mini_batch_size is set to 100.</li>\n",
    "<ul>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "\n",
    "output_path = 's3://{0}/{1}/output'.format(bucket_name, bucket_key_prefix)\n",
    "\n",
    "linear = sagemaker.estimator.Estimator(container,\n",
    "                                       role, \n",
    "                                       train_instance_count=1, \n",
    "                                       train_instance_type='ml.c5.2xlarge',\n",
    "                                       output_path=output_path,\n",
    "                                       base_job_name='sms-spam-classifier-ll')\n",
    "linear.set_hyperparameters(feature_dim=vocabulary_length,\n",
    "                           predictor_type='binary_classifier',\n",
    "                           mini_batch_size=100)\n",
    "\n",
    "train_config = sagemaker.session.s3_input('s3://{0}/{1}/train/{2}'\n",
    "                                          .format(bucket_name, bucket_key_prefix, 'sms_train_set.csv'), \n",
    "                                          content_type='text/csv')\n",
    "test_config = sagemaker.session.s3_input('s3://{0}/{1}/val/{2}'\n",
    "                                         .format(bucket_name, bucket_key_prefix, 'sms_val_set.csv'), \n",
    "                                         content_type='text/csv')\n",
    "\n",
    "linear.fit({'train': train_config, 'test': test_config })"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3><span style=\"color:red\">THE FOLLOWING STEPS ARE NOT MANDATORY IF YOU PLAN TO DEPLOY TO AWS LAMBDA AND ARE INCLUDED IN THIS NOTEBOOK FOR EDUCATIONAL PURPOSES.</span></h3>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Deploying the model</h2>\n",
    "\n",
    "Let's deploy the trained model to a real-time inference endpoint fully-managed by Amazon SageMaker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pred = linear.deploy(initial_instance_count=1,\n",
    "                     instance_type='ml.m5.large')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Executing Inferences</h2>\n",
    "\n",
    "Now, we can invoke the Amazon SageMaker real-time endpoint to execute some inferences, by providing SMS messages and getting the predicted label (SPAM = 1, HAM = 0) and the related probability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.predictor import RealTimePredictor\n",
    "from sms_spam_classifier_utilities import one_hot_encode\n",
    "from sms_spam_classifier_utilities import vectorize_sequences\n",
    "from sagemaker.predictor import csv_serializer, json_deserializer\n",
    "\n",
    "# Uncomment the following line to connect to an existing endpoint.\n",
    "# pred = RealTimePredictor('<endpoint_name>')\n",
    "\n",
    "pred.content_type = 'text/csv'\n",
    "pred.serializer = csv_serializer\n",
    "pred.deserializer = json_deserializer\n",
    "\n",
    "test_messages = [\"FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop\"]\n",
    "one_hot_test_messages = one_hot_encode(test_messages, vocabulary_length)\n",
    "encoded_test_messages = vectorize_sequences(one_hot_test_messages, vocabulary_length)\n",
    "\n",
    "result = pred.predict(encoded_test_messages)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Cleaning-up</h2>\n",
    "\n",
    "When done, we can delete the Amazon SageMaker real-time inference endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pred.delete_endpoint()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}