![MLU Logo](../data/MLU_Logo.png)

# <a name="0">Machine Learning Accelerator - Natural Language Processing - Lecture 2</a>

## Sagemaker built-in Training and Deployment with LinearLearner

In this notebook, we use Sagemaker's built-in machine learning model __LinearLearner__ to predict the __isPositive__ field of our review dataset.

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)
* __isPositive:__ Whether the review is positive or negative (1 or 0)

__Notes on AWS SageMaker__

* Fully managed machine learning service, to quickly and easily get you started on building and training machine learning models - we have seen that already! Integrated Jupyter notebook instances, with easy access to data sources for exploration and analysis, abstract away many of the messy infrastructural details needed for hands-on ML - you don't have to manage servers, install libraries/dependencies, etc.!


* Apart from easily building end-to-end machine learning models in SageMaker notebooks, like we did so far, SageMaker also provides a few __build-in common machine learning algorithms__ (check "SageMaker Examples" from your SageMaker instance top menu for a complete updated list) that are optimized to run efficiently against extremely large data in a distributed environment. __LinearLearner__ build-in algorithm in SageMaker is extremely fast at inference and can be trained at scale, in mini-batch fashion over GPU(s). The trained model can then be directly deployed into a production-ready hosted environment for easy access at inference. 

We will follow these steps:

1. <a href="#1">Read the dataset</a>
2. <a href="#2">Exploratory Data Analysis</a>
3. <a href="#3">Text Processing: Stop words removal and stemming</a>
4. <a href="#4">Training - Validation - Test Split</a>
5. <a href="#5">Data processing with Pipeline and ColumnTransform</a>
6. <a href="#6">Train a classifier with SageMaker build-in algorithm</a>
7. <a href="#7">Model evaluation</a>
8. <a href="#8">Deploy the model to an endpoint</a>
9. <a href="#9">Test the enpoint</a>
10. <a href="#10">Clean up model artifacts</a>

In [1]:
%pip install -q -r ../requirements.txt

Note: you may need to restart the kernel to use updated packages.


## 1. <a name="1">Reading the dataset</a>
(<a href="#0">Go to top</a>)

We will use the __pandas__ library to read our dataset.

In [2]:
import pandas as pd

df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')

print('The shape of the dataset is:', df.shape)

The shape of the dataset is: (70000, 6)


Let's look at the first five rows in the dataset.

In [3]:
df.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0


## 2. <a name="2">Exploratory Data Analysis</a>
(<a href="#0">Go to top</a>)

Let's look at the target distribution for our datasets.

In [4]:
df["isPositive"].value_counts()

1.0    43692
0.0    26308
Name: isPositive, dtype: int64

Checking the number of missing values:

In [5]:
print(df.isna().sum())

reviewText    11
summary       14
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


We have missing values in our text fields.

## 3. <a name="3">Text Processing: Stop words removal and stemming</a>
(<a href="#0">Go to top</a>)

In [6]:
# Install the library and functions
import nltk

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

We will create the stop word removal and text cleaning processes below. NLTK library provides a list of common stop words. We will use the list, but remove some of the words from that list. It is because those words are actually useful to understand the sentiment in the sentence.

In [7]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# Let's get a list of stop words from the NLTK library
stop = stopwords.words('english')

# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', 
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

def process_text(texts): 
    final_text_list=[]
    for sent in texts:
        
        # Check if the sentence is a missing value
        if isinstance(sent, str) == False:
            sent = ""
            
        filtered_sentence=[]
        
        sent = sent.lower() # Lowercase 
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:
        
        for w in word_tokenize(sent):
            # We are applying some custom filtering here, feel free to try different things
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words
 
        final_text_list.append(final_string)
        
    return final_text_list

## 4. <a name="4">Training - Validation - Test Split</a>
(<a href="#0">Go to top</a>)

Let's split our dataset into training (80%), validation (10%) and test (10%) using sklearn's [__train_test_split()__](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df[["reviewText", "summary", "time", "log_votes"]],
                                                  df["isPositive"],
                                                  test_size=0.20,
                                                  shuffle=True,
                                                  random_state=324
                                                 )

X_val, X_test, y_val, y_test = train_test_split(X_val,
                                                y_val,
                                                test_size=0.5,
                                                shuffle=True,
                                                random_state=324)

In [9]:
print("Processing the reviewText fields")
X_train["reviewText"] = process_text(X_train["reviewText"].tolist())
X_val["reviewText"] = process_text(X_val["reviewText"].tolist())
X_test["reviewText"] = process_text(X_test["reviewText"].tolist())

print("Processing the summary fields")
X_train["summary"] = process_text(X_train["summary"].tolist())
X_val["summary"] = process_text(X_val["summary"].tolist())
X_test["summary"] = process_text(X_test["summary"].tolist())

Processing the reviewText fields
Processing the summary fields


Our process_text() method in section 3 uses empty string for missing values.

## 5. <a name="5">Data processing with Pipeline and ColumnTransform</a>
(<a href="#0">Go to top</a>)
In the previous examples, we have seen how to use pipeline to prepare a data field for our machine learning model. This time, we will focus on multiple fields: numeric and text fields. 

   * For the numerical features pipeline, the __numerical_processor__ below, we use a MinMaxScaler (don't have to scale features when using Decision Trees, but it's a good idea to see how to use more data transforms). If different processing is desired for different numerical features, different pipelines should be built - just like shown below for the two text features.
   * For the text features pipeline, the __text_processor__ below, we use CountVectorizer() for the text fields.
   
The selective preparations of the dataset features are then put together into a collective ColumnTransformer, to be finally used in a Pipeline along with an estimator. This ensures that the transforms are performed automatically on the raw data when fitting the model and when making predictions, such as when evaluating the model on a validation dataset via cross-validation or making predictions on a test dataset in the future.

In [10]:
# Grab model features/inputs and target/output
numerical_features = ['time',
                      'log_votes']

text_features = ['summary',
                 'reviewText']

model_features = numerical_features + text_features
model_target = 'isPositive'

In [11]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

### COLUMN_TRANSFORMER ###
##########################

# Preprocess the numerical features
numerical_processor = Pipeline([
    ('num_imputer', SimpleImputer(strategy='mean')),
    ('num_scaler', MinMaxScaler()) 
                                ])
# Preprocess 1st text feature
text_processor_0 = Pipeline([
    ('text_vect_0', CountVectorizer(binary=True, max_features=50))
                                ])

# Preprocess 2nd text feature (larger vocabulary)
text_precessor_1 = Pipeline([
    ('text_vect_1', CountVectorizer(binary=True, max_features=150))
                                ])

# Combine all data preprocessors from above (add more, if you choose to define more!)
# For each processor/step specify: a name, the actual process, and finally the features to be processed
data_preprocessor = ColumnTransformer([
    ('numerical_pre', numerical_processor, numerical_features),
    ('text_pre_0', text_processor_0, text_features[0]),
    ('text_pre_1', text_precessor_1, text_features[1])
                                    ]) 

### DATA PREPROCESSING ###
##########################

print('Datasets shapes before processing: ', X_train.shape, X_val.shape, X_test.shape)

X_train = data_preprocessor.fit_transform(X_train).toarray()
X_val = data_preprocessor.transform(X_val).toarray()
X_test = data_preprocessor.transform(X_test).toarray()

print('Datasets shapes after processing: ', X_train.shape, X_val.shape, X_test.shape)

Datasets shapes before processing:  (56000, 4) (7000, 4) (7000, 4)
Datasets shapes after processing:  (56000, 202) (7000, 202) (7000, 202)


## 6. <a name="6">Train a classifier with SageMaker build-in algorithm</a>
(<a href="#0">Go to top</a>)

We will call the Sagemaker `LinearLearner()` below. 
* __Compute power:__ We will use `instance_count` and `instance_type` parameters. This example uses `ml.m4.xlarge` resource for training. We can change the instance type for our needs (For example GPUs for neural networks). 
* __Model type:__ `predictor_type` is set to __`binary_classifier`__, as we have a binary classification problem here; __`multiclass_classifier`__ could be used if there are 3 or more classes involved, or __'regressor'__ for a regression problem.

In [12]:
import sagemaker

# Call the LinearLearner estimator object
linear_classifier = sagemaker.LinearLearner(role=sagemaker.get_execution_role(),
                                            instance_count=1,
                                            instance_type='ml.m4.xlarge',
                                            predictor_type='binary_classifier')

We are using the `record_set()` function of our binary_estimator to set the training, validation, test parts of the estimator. 

In [13]:
train_records = linear_classifier.record_set(X_train.astype("float32"),
                                            y_train.values.astype("float32"),
                                            channel='train')
val_records = linear_classifier.record_set(X_val.astype("float32"),
                                          y_val.values.astype("float32"),
                                          channel='validation')
test_records = linear_classifier.record_set(X_test.astype("float32"),
                                           y_test.values.astype("float32"),
                                           channel='test')

`fit()` function applies a distributed version of the Stochastic Gradient Descent (SGD) algorithm and we are sending the data to it. We disabled logs with `logs=False`. You can remove that parameter to see more details about the process. __This process takes about 3-4 minutes on a ml.m4.xlarge instance.__

In [14]:
%%time
linear_classifier.fit([train_records,
                       val_records,
                       test_records],
                      logs=False)

INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: linear-learner-2023-01-26-00-08-27-344



2023-01-26 00:08:27 Starting - Starting the training job....
2023-01-26 00:08:52 Starting - Preparing the instances for training..................
2023-01-26 00:10:25 Downloading - Downloading input data.....
2023-01-26 00:10:55 Training - Downloading the training image............
2023-01-26 00:12:01 Training - Training image download completed. Training in progress.............
2023-01-26 00:13:06 Uploading - Uploading generated training model.
2023-01-26 00:13:17 Completed - Training job completed
CPU times: user 258 ms, sys: 12.2 ms, total: 270 ms
Wall time: 4min 53s


## 7. <a name="7">Model Evaluation</a>
(<a href="#0">Go to top</a>)

We can use Sagemaker analytics to get some performance metrics of our choice on the test set. This doesn't require us to deploy our model. Since this is a binary classfication problem, we can check the accuracy.

In [15]:
sagemaker.analytics.TrainingJobAnalytics(linear_classifier._current_job_name, 
                                         metric_names = ['test:binary_classification_accuracy']
                                        ).dataframe()

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


Unnamed: 0,timestamp,metric_name,value
0,0.0,test:binary_classification_accuracy,0.851


## 8. <a name="8">Deploy the model to an endpoint</a>
(<a href="#0">Go to top</a>)

In the last part of this exercise, we will deploy our model to another instance of our choice. This will allow us to use this model in production environment. Deployed endpoints can be used with other AWS Services such as Lambda and API Gateway. A nice walkthrough is available here: https://aws.amazon.com/blogs/machine-learning/call-an-amazon-sagemaker-model-endpoint-using-amazon-api-gateway-and-aws-lambda/ if you are interested.

Run the following cell to deploy the model. We can use different instance types such as: `ml.t2.medium`, `ml.c4.xlarge` etc. __This will take some time to complete (Approximately 7-8 minutes).__

In [18]:
%%time
linear_classifier_predictor = linear_classifier.deploy(initial_instance_count = 1,
                                                       instance_type = 'ml.t2.medium',
                                                       endpoint_name = 'NLPLinearLearnerEndpoint2023'
                                                      )

INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating model with name: linear-learner-2023-01-26-00-13-35-287
INFO:sagemaker:Creating endpoint-config with name NLPLinearLearnerEndpoint2023
INFO:sagemaker:Creating endpoint with name NLPLinearLearnerEndpoint2023


-----------------!CPU times: user 356 ms, sys: 12.4 ms, total: 369 ms
Wall time: 8min 32s


## 9. <a name="9">Test the endpoint</a>
(<a href="#0">Go to top</a>)

Let's use the deployed endpoint. We will send our test data and get predictions of it.

In [19]:
import numpy as np

# Let's get test data in batch size of 25 and make predictions.
prediction_batches = [linear_classifier_predictor.predict(batch)
                      for batch in np.array_split(X_test.astype("float32"), 25)
                     ]

# Let's get a list of predictions
print([pred.label['score'].float32_tensor.values[0] for pred in prediction_batches[0]])

[0.791047990322113, 0.19614160060882568, 0.9611192941665649, 0.6507992744445801, 0.6314922571182251, 0.13109198212623596, 0.7686936855316162, 0.03195519745349884, 0.9872755408287048, 0.9839196801185608, 0.7162235975265503, 0.5463312268257141, 0.8145678043365479, 0.708134651184082, 0.8952619433403015, 0.9754332304000854, 0.6256309747695923, 0.2471412867307663, 0.49016159772872925, 0.9344954490661621, 0.6512935757637024, 0.794103741645813, 0.1836925595998764, 0.7437755465507507, 0.280699759721756, 0.47037971019744873, 0.9491460919380188, 0.09314309805631638, 0.9856282472610474, 0.17901821434497833, 0.8227737545967102, 0.8395636677742004, 0.6778838038444519, 0.9942905902862549, 0.09969635307788849, 0.9922183752059937, 0.41878095269203186, 0.8897599577903748, 0.06615863740444183, 0.26962822675704956, 0.9754551649093628, 0.2753585875034332, 0.6342624425888062, 0.7751972675323486, 0.6481721997261047, 0.16822606325149536, 0.9967265129089355, 0.9503831267356873, 0.989985466003418, 0.7813619971

## 10. <a name="10">Clean up model artifacts</a>
(<a href="#0">Go to top</a>)

You can run the following to delete the endpoint after you are done using it.

In [20]:
linear_classifier_predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: NLPLinearLearnerEndpoint2023
INFO:sagemaker:Deleting endpoint with name: NLPLinearLearnerEndpoint2023
