{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1>SMS Spam Classifier</h1>\n",
    "<br />\n",
    "This notebook shows how to implement a basic spam classifier for SMS messages using Apache MXNet as deep learning framework.\n",
    "The idea is to use the SMS spam collection dataset available at <a href=\"https://archive.ics.uci.edu/ml/datasets/sms+spam+collection\">https://archive.ics.uci.edu/ml/datasets/sms+spam+collection</a> to train and deploy a neural network model by leveraging on the built-in open-source container for Apache MXNet available in Amazon SageMaker."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get started by setting some configuration variables and getting the Amazon SageMaker session and the current execution role, using the Amazon SageMaker high-level SDK for Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker import get_execution_role\n",
    "\n",
    "bucket_name = '<bucket-name>'\n",
    "\n",
    "role = get_execution_role()\n",
    "bucket_key_prefix = 'sms-spam-classifier'\n",
    "vocabulary_length = 9013\n",
    "\n",
    "print(role)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now download the spam collection dataset, unzip it and read the first 10 rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p dataset\n",
    "!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip -o dataset/smsspamcollection.zip\n",
    "!unzip -o dataset/smsspamcollection.zip -d dataset\n",
    "!head -10 dataset/SMSSpamCollection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now load the dataset into a Pandas dataframe and execute some data preparation.\n",
    "More specifically we have to:\n",
    "<ul>\n",
    "    <li>replace the target column values (ham/spam) with numeric values (0/1)</li>\n",
    "    <li>tokenize the sms messages and encode based on word counts</li>\n",
    "    <li>split into train and test sets</li>\n",
    "    <li>upload to a S3 bucket for training</li>\n",
    "</ul>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import pickle\n",
    "from sms_spam_classifier_utilities import one_hot_encode\n",
    "from sms_spam_classifier_utilities import vectorize_sequences\n",
    "\n",
    "df = pd.read_csv('dataset/SMSSpamCollection', sep='\\t', header=None)\n",
    "df[df.columns[0]] = df[df.columns[0]].map({'ham': 0, 'spam': 1})\n",
    "\n",
    "targets = df[df.columns[0]].values\n",
    "messages = df[df.columns[1]].values\n",
    "\n",
    "# one hot encoding for each SMS message\n",
    "one_hot_data = one_hot_encode(messages, vocabulary_length)\n",
    "encoded_messages = vectorize_sequences(one_hot_data, vocabulary_length)\n",
    "\n",
    "df2 = pd.DataFrame(encoded_messages)\n",
    "df2.insert(0, 'spam', targets)\n",
    "\n",
    "# Split into training and validation sets (80%/20% split)\n",
    "split_index = int(np.ceil(df.shape[0] * 0.8))\n",
    "train_set = df2[:split_index]\n",
    "val_set = df2[split_index:]\n",
    "\n",
    "train_set.to_csv('dataset/sms_train_set.gz', header=False, index=False, compression='gzip')\n",
    "val_set.to_csv('dataset/sms_val_set.gz', header=False, index=False, compression='gzip')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have to upload the two files back to Amazon S3 in order to be accessed by the Amazon SageMaker training cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "\n",
    "s3 = boto3.resource('s3')\n",
    "target_bucket = s3.Bucket(bucket_name)\n",
    "\n",
    "with open('dataset/sms_train_set.gz', 'rb') as data:\n",
    "    target_bucket.upload_fileobj(data, '{0}/train/sms_train_set.gz'.format(bucket_key_prefix))\n",
    "    \n",
    "with open('dataset/sms_val_set.gz', 'rb') as data:\n",
    "    target_bucket.upload_fileobj(data, '{0}/val/sms_val_set.gz'.format(bucket_key_prefix))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Training the model with MXNet</h2>\n",
    "\n",
    "We are now ready to run the training using the Amazon SageMaker MXNet built-in container. First let's have a look at the script defining our neural network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat 'sms_spam_classifier_mxnet_script.py'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are now ready to run the training using the MXNet estimator object of the SageMaker Python SDK."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.mxnet import MXNet\n",
    "\n",
    "output_path = 's3://{0}/{1}/output'.format(bucket_name, bucket_key_prefix)\n",
    "code_location = 's3://{0}/{1}/code'.format(bucket_name, bucket_key_prefix)\n",
    "\n",
    "m = MXNet('sms_spam_classifier_mxnet_script.py',\n",
    "          role=role,\n",
    "          train_instance_count=1,\n",
    "          train_instance_type='ml.c5.2xlarge',\n",
    "          output_path=output_path,\n",
    "          base_job_name='sms-spam-classifier-mxnet',\n",
    "          framework_version=1.2,\n",
    "          code_location = code_location,\n",
    "          hyperparameters={'batch_size': 100,\n",
    "                         'epochs': 20,\n",
    "                         'learning_rate': 0.01})\n",
    "\n",
    "inputs = {'train': 's3://{0}/{1}/train/'.format(bucket_name, bucket_key_prefix),\n",
    " 'val': 's3://{0}/{1}/val/'.format(bucket_name, bucket_key_prefix)}\n",
    "\n",
    "m.fit(inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h3><span style=\"color:red\">THE FOLLOWING STEPS ARE NOT MANDATORY IF YOU PLAN TO DEPLOY TO AWS LAMBDA AND ARE INCLUDED IN THIS NOTEBOOK FOR EDUCATIONAL PURPOSES.</span></h3>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Deploying the model</h2>\n",
    "\n",
    "Let's deploy the trained model to a real-time inference endpoint fully-managed by Amazon SageMaker."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mxnet_pred = m.deploy(initial_instance_count=1,\n",
    "                      instance_type='ml.m5.large')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Executing Inferences</h2>\n",
    "\n",
    "Now, we can invoke the Amazon SageMaker real-time endpoint to execute some inferences, by providing SMS messages and getting the predicted label (SPAM = 1, HAM = 0) and the related probability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.mxnet.model import MXNetPredictor\n",
    "from sms_spam_classifier_utilities import one_hot_encode\n",
    "from sms_spam_classifier_utilities import vectorize_sequences\n",
    "\n",
    "# Uncomment the following line to connect to an existing endpoint.\n",
    "# mxnet_pred = MXNetPredictor('<endpoint_name>')\n",
    "\n",
    "test_messages = [\"FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop\"]\n",
    "one_hot_test_messages = one_hot_encode(test_messages, vocabulary_length)\n",
    "encoded_test_messages = vectorize_sequences(one_hot_test_messages, vocabulary_length)\n",
    "\n",
    "result = mxnet_pred.predict(encoded_test_messages)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Cleaning-up</h2>\n",
    "\n",
    "When done, we can delete the Amazon SageMaker real-time inference endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mxnet_pred.delete_endpoint()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}