{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![MLU Logo](../data/MLU_Logo.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a name=\"0\">Machine Learning Accelerator - Natural Language Processing - Lecture 2</a>\n",
    "\n",
    "## Sagemaker built-in Training and Deployment with LinearLearner\n",
    "\n",
    "In this notebook, we use Sagemaker's built-in machine learning model __LinearLearner__ to predict the __isPositive__ field of our review dataset.\n",
    "\n",
    "Overall dataset schema:\n",
    "* __reviewText:__ Text of the review\n",
    "* __summary:__ Summary of the review\n",
    "* __verified:__ Whether the purchase was verified (True or False)\n",
    "* __time:__ UNIX timestamp for the review\n",
    "* __log_votes:__ Logarithm-adjusted votes log(1+votes)\n",
    "* __isPositive:__ Whether the review is positive or negative (1 or 0)\n",
    "\n",
    "__Notes on AWS SageMaker__\n",
    "\n",
    "* Fully managed machine learning service, to quickly and easily get you started on building and training machine learning models - we have seen that already! Integrated Jupyter notebook instances, with easy access to data sources for exploration and analysis, abstract away many of the messy infrastructural details needed for hands-on ML - you don't have to manage servers, install libraries/dependencies, etc.!\n",
    "\n",
    "\n",
    "* Apart from easily building end-to-end machine learning models in SageMaker notebooks, like we did so far, SageMaker also provides a few __build-in common machine learning algorithms__ (check \"SageMaker Examples\" from your SageMaker instance top menu for a complete updated list) that are optimized to run efficiently against extremely large data in a distributed environment. __LinearLearner__ build-in algorithm in SageMaker is extremely fast at inference and can be trained at scale, in mini-batch fashion over GPU(s). The trained model can then be directly deployed into a production-ready hosted environment for easy access at inference. \n",
    "\n",
    "We will follow these steps:\n",
    "\n",
    "1. <a href=\"#1\">Read the dataset</a>\n",
    "2. <a href=\"#2\">Exploratory Data Analysis</a>\n",
    "3. <a href=\"#3\">Text Processing: Stop words removal and stemming</a>\n",
    "4. <a href=\"#4\">Training - Validation - Test Split</a>\n",
    "5. <a href=\"#5\">Data processing with Pipeline and ColumnTransform</a>\n",
    "6. <a href=\"#6\">Train a classifier with SageMaker build-in algorithm</a>\n",
    "7. <a href=\"#7\">Model evaluation</a>\n",
    "8. <a href=\"#8\">Deploy the model to an endpoint</a>\n",
    "9. <a href=\"#9\">Test the enpoint</a>\n",
    "10. <a href=\"#10\">Clean up model artifacts</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install -q -r ../requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. <a name=\"1\">Reading the dataset</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "We will use the __pandas__ library to read our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The shape of the dataset is: (70000, 6)\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')\n",
    "\n",
    "print('The shape of the dataset is:', df.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the first five rows in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>reviewText</th>\n",
       "      <th>summary</th>\n",
       "      <th>verified</th>\n",
       "      <th>time</th>\n",
       "      <th>log_votes</th>\n",
       "      <th>isPositive</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO...</td>\n",
       "      <td>IDEAL FOR BEGINNER!</td>\n",
       "      <td>True</td>\n",
       "      <td>1361836800</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>unable to open or use</td>\n",
       "      <td>Two Stars</td>\n",
       "      <td>True</td>\n",
       "      <td>1452643200</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Waste of money!!! It wouldn't load to my system.</td>\n",
       "      <td>Dont buy it!</td>\n",
       "      <td>True</td>\n",
       "      <td>1433289600</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>I attempted to install this OS on two differen...</td>\n",
       "      <td>I attempted to install this OS on two differen...</td>\n",
       "      <td>True</td>\n",
       "      <td>1518912000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>I've spent 14 fruitless hours over the past tw...</td>\n",
       "      <td>Do NOT Download.</td>\n",
       "      <td>True</td>\n",
       "      <td>1441929600</td>\n",
       "      <td>1.098612</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                          reviewText  \\\n",
       "0  PURCHASED FOR YOUNGSTER WHO\\nINHERITED MY \"TOO...   \n",
       "1                              unable to open or use   \n",
       "2   Waste of money!!! It wouldn't load to my system.   \n",
       "3  I attempted to install this OS on two differen...   \n",
       "4  I've spent 14 fruitless hours over the past tw...   \n",
       "\n",
       "                                             summary  verified        time  \\\n",
       "0                                IDEAL FOR BEGINNER!      True  1361836800   \n",
       "1                                          Two Stars      True  1452643200   \n",
       "2                                       Dont buy it!      True  1433289600   \n",
       "3  I attempted to install this OS on two differen...      True  1518912000   \n",
       "4                                   Do NOT Download.      True  1441929600   \n",
       "\n",
       "   log_votes  isPositive  \n",
       "0   0.000000         1.0  \n",
       "1   0.000000         0.0  \n",
       "2   0.000000         0.0  \n",
       "3   0.000000         0.0  \n",
       "4   1.098612         0.0  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. <a name=\"2\">Exploratory Data Analysis</a>\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the target distribution for our datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0    43692\n",
       "0.0    26308\n",
       "Name: isPositive, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"isPositive\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Checking the number of missing values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "reviewText    11\n",
      "summary       14\n",
      "verified       0\n",
      "time           0\n",
      "log_votes      0\n",
      "isPositive     0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(df.isna().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have missing values in our text fields."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. <a name=\"3\">Text Processing: Stop words removal and stemming</a>\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n",
      "[nltk_data] Downloading package stopwords to\n",
      "[nltk_data]     /home/ec2-user/nltk_data...\n",
      "[nltk_data]   Package stopwords is already up-to-date!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Install the library and functions\n",
    "import nltk\n",
    "\n",
    "nltk.download('punkt')\n",
    "nltk.download('stopwords')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will create the stop word removal and text cleaning processes below. NLTK library provides a list of common stop words. We will use the list, but remove some of the words from that list. It is because those words are actually useful to understand the sentiment in the sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk, re\n",
    "from nltk.corpus import stopwords\n",
    "from nltk.stem import SnowballStemmer\n",
    "from nltk.tokenize import word_tokenize\n",
    "\n",
    "# Let's get a list of stop words from the NLTK library\n",
    "stop = stopwords.words('english')\n",
    "\n",
    "# These words are important for our problem. We don't want to remove them.\n",
    "excluding = ['against', 'not', 'don', \"don't\",'ain', 'aren', \"aren't\", 'couldn', \"couldn't\",\n",
    "             'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", \n",
    "             'haven', \"haven't\", 'isn', \"isn't\", 'mightn', \"mightn't\", 'mustn', \"mustn't\",\n",
    "             'needn', \"needn't\",'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \n",
    "             \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n",
    "\n",
    "# New stop word list\n",
    "stop_words = [word for word in stop if word not in excluding]\n",
    "\n",
    "snow = SnowballStemmer('english')\n",
    "\n",
    "def process_text(texts): \n",
    "    final_text_list=[]\n",
    "    for sent in texts:\n",
    "        \n",
    "        # Check if the sentence is a missing value\n",
    "        if isinstance(sent, str) == False:\n",
    "            sent = \"\"\n",
    "            \n",
    "        filtered_sentence=[]\n",
    "        \n",
    "        sent = sent.lower() # Lowercase \n",
    "        sent = sent.strip() # Remove leading/trailing whitespace\n",
    "        sent = re.sub('\\s+', ' ', sent) # Remove extra space and tabs\n",
    "        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:\n",
    "        \n",
    "        for w in word_tokenize(sent):\n",
    "            # We are applying some custom filtering here, feel free to try different things\n",
    "            # Check if it is not numeric and its length>2 and not in stop words\n",
    "            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  \n",
    "                # Stem and add to filtered list\n",
    "                filtered_sentence.append(snow.stem(w))\n",
    "        final_string = \" \".join(filtered_sentence) #final string of cleaned words\n",
    " \n",
    "        final_text_list.append(final_string)\n",
    "        \n",
    "    return final_text_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. <a name=\"4\">Training - Validation - Test Split</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Let's split our dataset into training (80%), validation (10%) and test (10%) using sklearn's [__train_test_split()__](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X_train, X_val, y_train, y_val = train_test_split(df[[\"reviewText\", \"summary\", \"time\", \"log_votes\"]],\n",
    "                                                  df[\"isPositive\"],\n",
    "                                                  test_size=0.20,\n",
    "                                                  shuffle=True,\n",
    "                                                  random_state=324\n",
    "                                                 )\n",
    "\n",
    "X_val, X_test, y_val, y_test = train_test_split(X_val,\n",
    "                                                y_val,\n",
    "                                                test_size=0.5,\n",
    "                                                shuffle=True,\n",
    "                                                random_state=324)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processing the reviewText fields\n",
      "Processing the summary fields\n"
     ]
    }
   ],
   "source": [
    "print(\"Processing the reviewText fields\")\n",
    "X_train[\"reviewText\"] = process_text(X_train[\"reviewText\"].tolist())\n",
    "X_val[\"reviewText\"] = process_text(X_val[\"reviewText\"].tolist())\n",
    "X_test[\"reviewText\"] = process_text(X_test[\"reviewText\"].tolist())\n",
    "\n",
    "print(\"Processing the summary fields\")\n",
    "X_train[\"summary\"] = process_text(X_train[\"summary\"].tolist())\n",
    "X_val[\"summary\"] = process_text(X_val[\"summary\"].tolist())\n",
    "X_test[\"summary\"] = process_text(X_test[\"summary\"].tolist())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our process_text() method in section 3 uses empty string for missing values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. <a name=\"5\">Data processing with Pipeline and ColumnTransform</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "In the previous examples, we have seen how to use pipeline to prepare a data field for our machine learning model. This time, we will focus on multiple fields: numeric and text fields. \n",
    "\n",
    "   * For the numerical features pipeline, the __numerical_processor__ below, we use a MinMaxScaler (don't have to scale features when using Decision Trees, but it's a good idea to see how to use more data transforms). If different processing is desired for different numerical features, different pipelines should be built - just like shown below for the two text features.\n",
    "   * For the text features pipeline, the __text_processor__ below, we use CountVectorizer() for the text fields.\n",
    "   \n",
    "The selective preparations of the dataset features are then put together into a collective ColumnTransformer, to be finally used in a Pipeline along with an estimator. This ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a validation dataset via cross-validation or making predictions on a test dataset in the future."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Grab model features/inputs and target/output\n",
    "numerical_features = ['time',\n",
    "                      'log_votes']\n",
    "\n",
    "text_features = ['summary',\n",
    "                 'reviewText']\n",
    "\n",
    "model_features = numerical_features + text_features\n",
    "model_target = 'isPositive'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Datasets shapes before processing:  (56000, 4) (7000, 4) (7000, 4)\n",
      "Datasets shapes after processing:  (56000, 202) (7000, 202) (7000, 202)\n"
     ]
    }
   ],
   "source": [
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.compose import ColumnTransformer\n",
    "\n",
    "### COLUMN_TRANSFORMER ###\n",
    "##########################\n",
    "\n",
    "# Preprocess the numerical features\n",
    "numerical_processor = Pipeline([\n",
    "    ('num_imputer', SimpleImputer(strategy='mean')),\n",
    "    ('num_scaler', MinMaxScaler()) \n",
    "                                ])\n",
    "# Preprocess 1st text feature\n",
    "text_processor_0 = Pipeline([\n",
    "    ('text_vect_0', CountVectorizer(binary=True, max_features=50))\n",
    "                                ])\n",
    "\n",
    "# Preprocess 2nd text feature (larger vocabulary)\n",
    "text_precessor_1 = Pipeline([\n",
    "    ('text_vect_1', CountVectorizer(binary=True, max_features=150))\n",
    "                                ])\n",
    "\n",
    "# Combine all data preprocessors from above (add more, if you choose to define more!)\n",
    "# For each processor/step specify: a name, the actual process, and finally the features to be processed\n",
    "data_preprocessor = ColumnTransformer([\n",
    "    ('numerical_pre', numerical_processor, numerical_features),\n",
    "    ('text_pre_0', text_processor_0, text_features[0]),\n",
    "    ('text_pre_1', text_precessor_1, text_features[1])\n",
    "                                    ]) \n",
    "\n",
    "### DATA PREPROCESSING ###\n",
    "##########################\n",
    "\n",
    "print('Datasets shapes before processing: ', X_train.shape, X_val.shape, X_test.shape)\n",
    "\n",
    "X_train = data_preprocessor.fit_transform(X_train).toarray()\n",
    "X_val = data_preprocessor.transform(X_val).toarray()\n",
    "X_test = data_preprocessor.transform(X_test).toarray()\n",
    "\n",
    "print('Datasets shapes after processing: ', X_train.shape, X_val.shape, X_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. <a name=\"6\">Train a classifier with SageMaker build-in algorithm</a>\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will call the Sagemaker `LinearLearner()` below. \n",
    "* __Compute power:__ We will use `instance_count` and `instance_type` parameters. This example uses `ml.m4.xlarge` resource for training. We can change the instance type for our needs (For example GPUs for neural networks). \n",
    "* __Model type:__ `predictor_type` is set to __`binary_classifier`__, as we have a binary classification problem here; __`multiclass_classifier`__ could be used if there are 3 or more classes involved, or __'regressor'__ for a regression problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "\n",
    "# Call the LinearLearner estimator object\n",
    "linear_classifier = sagemaker.LinearLearner(role=sagemaker.get_execution_role(),\n",
    "                                            instance_count=1,\n",
    "                                            instance_type='ml.m4.xlarge',\n",
    "                                            predictor_type='binary_classifier')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are using the `record_set()` function of our binary_estimator to set the training, validation, test parts of the estimator. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_records = linear_classifier.record_set(X_train.astype(\"float32\"),\n",
    "                                            y_train.values.astype(\"float32\"),\n",
    "                                            channel='train')\n",
    "val_records = linear_classifier.record_set(X_val.astype(\"float32\"),\n",
    "                                          y_val.values.astype(\"float32\"),\n",
    "                                          channel='validation')\n",
    "test_records = linear_classifier.record_set(X_test.astype(\"float32\"),\n",
    "                                           y_test.values.astype(\"float32\"),\n",
    "                                           channel='test')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`fit()` function applies a distributed version of the Stochastic Gradient Descent (SGD) algorithm and we are sending the data to it. We disabled logs with `logs=False`. You can remove that parameter to see more details about the process. __This process takes about 3-4 minutes on a ml.m4.xlarge instance.__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.\n",
      "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n",
      "INFO:sagemaker:Creating training-job with name: linear-learner-2023-01-26-00-08-27-344\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "2023-01-26 00:08:27 Starting - Starting the training job....\n",
      "2023-01-26 00:08:52 Starting - Preparing the instances for training..................\n",
      "2023-01-26 00:10:25 Downloading - Downloading input data.....\n",
      "2023-01-26 00:10:55 Training - Downloading the training image............\n",
      "2023-01-26 00:12:01 Training - Training image download completed. Training in progress.............\n",
      "2023-01-26 00:13:06 Uploading - Uploading generated training model.\n",
      "2023-01-26 00:13:17 Completed - Training job completed\n",
      "CPU times: user 258 ms, sys: 12.2 ms, total: 270 ms\n",
      "Wall time: 4min 53s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "linear_classifier.fit([train_records,\n",
    "                       val_records,\n",
    "                       test_records],\n",
    "                      logs=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. <a name=\"7\">Model Evaluation</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "We can use Sagemaker analytics to get some performance metrics of our choice on the test set. This doesn't require us to deploy our model. Since this is a binary classfication problem, we can check the accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>timestamp</th>\n",
       "      <th>metric_name</th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>test:binary_classification_accuracy</td>\n",
       "      <td>0.851</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   timestamp                          metric_name  value\n",
       "0        0.0  test:binary_classification_accuracy  0.851"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sagemaker.analytics.TrainingJobAnalytics(linear_classifier._current_job_name, \n",
    "                                         metric_names = ['test:binary_classification_accuracy']\n",
    "                                        ).dataframe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. <a name=\"8\">Deploy the model to an endpoint</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "In the last part of this exercise, we will deploy our model to another instance of our choice. This will allow us to use this model in production environment. Deployed endpoints can be used with other AWS Services such as Lambda and API Gateway. A nice walkthrough is available here: https://aws.amazon.com/blogs/machine-learning/call-an-amazon-sagemaker-model-endpoint-using-amazon-api-gateway-and-aws-lambda/ if you are interested.\n",
    "\n",
    "Run the following cell to deploy the model. We can use different instance types such as: `ml.t2.medium`, `ml.c4.xlarge` etc. __This will take some time to complete (Approximately 7-8 minutes).__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.\n",
      "INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.\n",
      "INFO:sagemaker:Creating model with name: linear-learner-2023-01-26-00-13-35-287\n",
      "INFO:sagemaker:Creating endpoint-config with name NLPLinearLearnerEndpoint2023\n",
      "INFO:sagemaker:Creating endpoint with name NLPLinearLearnerEndpoint2023\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-----------------!CPU times: user 356 ms, sys: 12.4 ms, total: 369 ms\n",
      "Wall time: 8min 32s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "linear_classifier_predictor = linear_classifier.deploy(initial_instance_count = 1,\n",
    "                                                       instance_type = 'ml.t2.medium',\n",
    "                                                       endpoint_name = 'NLPLinearLearnerEndpoint2023'\n",
    "                                                      )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. <a name=\"9\">Test the endpoint</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Let's use the deployed endpoint. We will send our test data and get predictions of it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0.791047990322113, 0.19614160060882568, 0.9611192941665649, 0.6507992744445801, 0.6314922571182251, 0.13109198212623596, 0.7686936855316162, 0.03195519745349884, 0.9872755408287048, 0.9839196801185608, 0.7162235975265503, 0.5463312268257141, 0.8145678043365479, 0.708134651184082, 0.8952619433403015, 0.9754332304000854, 0.6256309747695923, 0.2471412867307663, 0.49016159772872925, 0.9344954490661621, 0.6512935757637024, 0.794103741645813, 0.1836925595998764, 0.7437755465507507, 0.280699759721756, 0.47037971019744873, 0.9491460919380188, 0.09314309805631638, 0.9856282472610474, 0.17901821434497833, 0.8227737545967102, 0.8395636677742004, 0.6778838038444519, 0.9942905902862549, 0.09969635307788849, 0.9922183752059937, 0.41878095269203186, 0.8897599577903748, 0.06615863740444183, 0.26962822675704956, 0.9754551649093628, 0.2753585875034332, 0.6342624425888062, 0.7751972675323486, 0.6481721997261047, 0.16822606325149536, 0.9967265129089355, 0.9503831267356873, 0.989985466003418, 0.781361997127533, 0.9898998141288757, 0.7649602890014648, 0.018655162304639816, 0.04944882169365883, 0.2901366949081421, 0.9671338796615601, 0.6100179553031921, 0.6880717873573303, 0.30723679065704346, 0.9377214908599854, 0.4265963137149811, 0.959131121635437, 0.850977897644043, 0.7168200016021729, 0.16721323132514954, 0.9635111093521118, 0.28084564208984375, 0.9946834444999695, 0.9714354276657104, 0.9908344745635986, 0.9964487552642822, 0.1287352442741394, 0.9932622313499451, 0.22601152956485748, 0.2694837152957916, 0.2592373192310333, 0.9884997010231018, 0.9934214949607849, 0.893867552280426, 0.17719347774982452, 0.049117811024188995, 0.02197575382888317, 0.9254707098007202, 0.9469350576400757, 0.18907281756401062, 0.973484992980957, 0.09400668740272522, 0.8299244046211243, 0.5332624912261963, 0.98997962474823, 0.9918038845062256, 0.37470757961273193, 0.033179402351379395, 0.7867936491966248, 0.6860332489013672, 0.9821308255195618, 0.503849446773529, 0.19375458359718323, 0.9776981472969055, 0.9844034910202026, 0.7079395651817322, 0.46319305896759033, 0.044853806495666504, 0.42751774191856384, 0.4677185118198395, 0.1499156653881073, 0.9472523331642151, 0.03206195682287216, 0.011945920065045357, 0.11832200735807419, 0.047576334327459335, 0.2012472152709961, 0.07877318561077118, 0.6118062138557434, 0.9520149230957031, 0.9426130056381226, 0.42952412366867065, 0.44052594900131226, 0.5875140428543091, 0.09546150267124176, 0.03731287270784378, 0.9985800981521606, 0.00016531682922504842, 0.9487006068229675, 0.048414502292871475, 0.7020835280418396, 0.5137953758239746, 0.9973631501197815, 0.7799039483070374, 0.0027354182675480843, 0.35174018144607544, 0.08463368564844131, 0.003371263388544321, 0.9716757535934448, 0.3322908580303192, 0.022698402404785156, 0.9472607970237732, 0.9801193475723267, 0.0404922254383564, 0.9814785718917847, 0.9817618131637573, 0.9446848630905151, 0.998234748840332, 0.30699726939201355, 0.9880892634391785, 0.9910349249839783, 0.9916506409645081, 0.9895434379577637, 0.9930940866470337, 0.6843202710151672, 0.9183700680732727, 0.9934170246124268, 0.8252106308937073, 0.8670921325683594, 0.03505592793226242, 0.6565587520599365, 0.0977732241153717, 0.10815082490444183, 0.6756683588027954, 0.8446029424667358, 0.8703914284706116, 0.9799814820289612, 0.9176944494247437, 0.8674671053886414, 0.6360636353492737, 0.33141395449638367, 0.8778736591339111, 0.3178897500038147, 0.5015454888343811, 0.8368360996246338, 0.38663458824157715, 0.6378363370895386, 0.8751917481422424, 0.49815431237220764, 0.25496000051498413, 0.005041236523538828, 0.7051519751548767, 0.9866764545440674, 0.9946999549865723, 0.6996867060661316, 0.10383560508489609, 0.48741498589515686, 0.765730619430542, 0.13150960206985474, 0.8710392713546753, 0.755275309085846, 0.8762190341949463, 0.7865755558013916, 0.03102555125951767, 0.22996045649051666, 0.0184261966496706, 0.9510031938552856, 0.1481030434370041, 0.0659005343914032, 0.9704026579856873, 0.42411676049232483, 0.9843926429748535, 0.04635646939277649, 0.9989562034606934, 0.7465599775314331, 0.2666005492210388, 0.5612906217575073, 0.8794891238212585, 0.9579881429672241, 0.9896005988121033, 0.9949610233306885, 0.9930757284164429, 0.9050391912460327, 0.1327841430902481, 0.5133925080299377, 0.044841594994068146, 0.9936574101448059, 0.015146369114518166, 0.00020965468138456345, 0.9903822541236877, 0.9591496586799622, 0.8376017808914185, 0.9503879547119141, 0.04183649271726608, 0.49570536613464355, 0.9600026607513428, 0.9080538749694824, 0.01910424418747425, 0.6207230687141418, 0.9915458559989929, 0.46133360266685486, 0.9839102029800415, 0.020350901409983635, 0.17816443741321564, 0.9795886874198914, 0.5955864191055298, 0.6223548650741577, 0.43253186345100403, 0.2774139940738678, 0.5183238387107849, 0.09153367578983307, 0.034838054329156876, 0.39631715416908264, 0.5904518365859985, 0.8735523819923401, 0.9260737895965576, 0.9424097537994385, 0.22686123847961426, 0.9775280356407166, 0.9880383014678955, 0.7442149519920349, 0.16782650351524353, 0.5452573895454407, 0.9585480690002441, 0.050845544785261154, 0.9088014364242554, 0.9901217818260193, 0.6867887377738953, 0.9416897296905518, 0.3816048502922058, 0.0401664599776268, 0.5717545747756958, 0.0021444722078740597, 0.08334892243146896, 0.848414957523346, 0.00564985629171133, 0.5032182931900024, 0.027054250240325928, 0.042989399284124374, 0.9951210618019104, 0.49910420179367065, 0.3622700273990631, 0.53741455078125, 0.19420109689235687, 0.4249868392944336, 0.37112993001937866, 0.9549214839935303, 0.00010590001329546794, 0.9723601341247559, 0.19039003551006317, 0.3618868589401245, 0.9813993573188782, 0.9408482909202576, 0.1574213206768036, 0.030843155458569527]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Let's get test data in batch size of 25 and make predictions.\n",
    "prediction_batches = [linear_classifier_predictor.predict(batch)\n",
    "                      for batch in np.array_split(X_test.astype(\"float32\"), 25)\n",
    "                     ]\n",
    "\n",
    "# Let's get a list of predictions\n",
    "print([pred.label['score'].float32_tensor.values[0] for pred in prediction_batches[0]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10. <a name=\"10\">Clean up model artifacts</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "You can run the following to delete the endpoint after you are done using it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:sagemaker:Deleting endpoint configuration with name: NLPLinearLearnerEndpoint2023\n",
      "INFO:sagemaker:Deleting endpoint with name: NLPLinearLearnerEndpoint2023\n"
     ]
    }
   ],
   "source": [
    "linear_classifier_predictor.delete_endpoint()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p39",
   "language": "python",
   "name": "conda_pytorch_p39"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}