# A notebook to explore text classification using word embedders

In this notebook, I will explore taking a public dataset of books with metadata such as description, title and category/genre. 
Ill then use a word embedder to vectorize the description and title and then use XGBoost to create a classifier on the category. 
I will use GenSim's fasttext implementation as the word embedder to vectorize the description and title. 
I will then repeat this process but using the native FastText implementation and compare the results. 
I will then host these models on Amazon's SageMaker 

## Install libraries, initialise variables, download dataset

In [None]:
! pip install gensim==3.8.3

In [None]:
import gensim
from gensim.models import FastText
from gensim.test.utils import common_texts  # some example sentences
from gensim.utils import simple_preprocess
print(common_texts[1])
print(len(common_texts))

gemsim expects the sentences to already be tokenized and pre-processed.

In [None]:
import pandas as pd
import numpy as np
import json
import sagemaker
import time

In [None]:
# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket() # replace with your own bucket if you have one 
s3 = sagemaker_session.boto_session.resource('s3')


prefix_gensim = 'data_gensim_xgb'
prefix_fasttext = 'data_fasttext'

## Get the data into a working format with just the features we need

In [None]:
# Delete file if already exists
! rm meta_Books.json
# Downloading the book metadata
! wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Books.json.gz
# Uncompressing
!gzip -d meta_Books.json.gz -f

The filesize is a bit too big, so we can reduce that if the below line by taking a subset of that dataset.

In [None]:
#Reducing the dataset 
! head -n 100000 meta_Books.json > books_train.json

In [None]:
#load data
data=pd.read_json('books_train.json', lines=True)
#shuffle the data in place
data = data.sample(frac=1).reset_index(drop=True)
# show first few rows
data.head()

We are only interested in a few columns from this dataset, so we will create a dataframe that onyl returns these

In [None]:
data_subset = data[["category","description", "title" ]]

In [None]:
data_subset.head()

We will do some analysis of the data we have here to see how the data looks.

In [None]:
length = data_subset.category.apply(len)

In [None]:
length.unique()

In [None]:
data_subset["cnt_cats"] = data_subset.category.apply(len)

In [None]:
data_subset["cnt_desc"] = data_subset.description.apply(len)
data_subset.head()

In [None]:
# delete the rows that have no category
data_subset = data_subset[data_subset.cnt_cats != 0]
data_subset = data_subset[data_subset.cnt_desc != 0]

In [None]:
data_subset.head()

In [None]:
data_subset["cat_x2"] = data_subset["category"].str[1]

In [None]:
data_subset.head(10)

We can see that the category column has an array which is a hierachy classification of the book. We can train our classifer on just one of those, they are all books, so no need to be interested in the first element, but the second element looks more interesting.

We just want to clean some of the data as we can see there was some encoding issues whcih we can fix with a "replace"

In [None]:
data_subset["cat_x2"] = data_subset["cat_x2"].replace("&amp;", "&", regex=True)

In [None]:
data_subset["cat_x2"].head()

In [None]:
len(data_subset["cat_x2"].unique())

In [None]:
data_subset['description_str'] = data_subset['description'].apply(lambda x: ' '.join(map(str, x)))

In [None]:
data_subset.head()

We want to update the category column

In [None]:
data_subset["cat_x2"] = data_subset["cat_x2"].astype("category")

In [None]:
data_subset["cat_x2"].cat.codes

In [None]:
data_subset["cat_x2_code"] = data_subset["cat_x2"].cat.codes

In [None]:
data_subset.head()

## GenSim requires us to do some cleansing of the data and tokenize 

In [None]:
def remove_numbers(text): 
    '''  
    This function takes strings containing numbers and returns strings with numbers removed.
    '''
    return re.sub(r'\d+', '', text) 

In [None]:
def remove_mentions(text):
    '''  
    This function takes strings containing mentions and returns strings with 
    mentions (@ and the account name) removed.
    Input(string): one tweet, contains mentions
    Output(string): one tweet, mentions (@ and the account name mentioned) removed 
    '''
    mentions = re.compile(r'@\w+ ?')
    return mentions.sub(r'', text)

In [None]:
def extract_mentions(text):
    '''
    This function takes strings containing mentions and returns strings with 
    mentions (@ and the account name) extracted into a different element,
    and removes the mentions in the original sentence.
    Input(string): one sentence, contains mentions
    '''
    mentions = [i[1:] for i in text.split() if i.startswith("@")]
    sentence = re.compile(r'@\w+ ?').sub(r'', text)
    return sentence,mentions

In [None]:
! pip install spacy

In [None]:
! pip install textblob

In [None]:
import nltk
import spacy
from textblob import TextBlob
import re
import string
import glob
import sagemaker

In [None]:
punc_list = string.punctuation #you can self define list of punctuation to remove here
def remove_punctuation(text): 
    """
    This function takes strings containing self defined punctuations and returns
    strings with punctuations removed.
    """
    translator = str.maketrans('', '', punc_list) 
    return text.translate(translator) 

In [None]:
def remove_whitespace(text): 
    '''
    This function takes strings containing mentions and returns strings with 
    whitespaces removed.
    '''
    return  " ".join(text.split())

In [None]:
def remove_html_tags(text):
    """Remove html tags from a string"""
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [None]:
data_subset.head()

In [None]:
data_subset["description_str"]=data_subset["description_str"].apply(remove_html_tags)
data_subset["title"]=data_subset["title"].apply(remove_html_tags)

In [None]:
data_subset["description_str"] = data_subset["description_str"].str.lower()
data_subset["title"] = data_subset["title"].str.lower()

In [None]:
data_subset["description_str"]=data_subset["description_str"].apply(remove_whitespace).apply(remove_punctuation).apply(remove_numbers)
data_subset["title"]=data_subset["title"].apply(remove_whitespace).apply(remove_punctuation).apply(remove_numbers)


In [None]:
nltk.download('punkt')

In [None]:
from nltk.tokenize import word_tokenize 
def tokenize_sent(text): 
    ''' 
    This function takes strings and returns tokenized words.
    '''
    word_tokens = word_tokenize(text)  
    return word_tokens 

In [None]:
data_subset["description_str_token"] = data_subset["description_str"].apply(tokenize_sent)

In [None]:
data_subset["title_token"] = data_subset["title"].apply(tokenize_sent)

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords

In [None]:
stopwords_list = set(stopwords.words('english'))

In [None]:
from collections import Counter
counter = Counter()
for word in  [w for sent in data_subset["description_str_token"] for w in sent]:
    counter[word] += 1        
counter.most_common(10)

In [None]:
#least frequent words
counter.most_common()[:-10:-1]

In [None]:
top_n = 10
bottom_n = 10
stopwords_list |= set([word for (word, count) in counter.most_common(top_n)])
stopwords_list |= set([word for (word, count) in counter.most_common()[:-bottom_n:-1]])
stopwords_list |= {'thats'}
def remove_stopwords(tokenized_text): 
    '''
    This function takes a list of tokenized words from the description and title, removes self-defined stop words from the list,
    and returns the list of words with stop words removed
    '''
    filtered_text = [word for word in tokenized_text if word not in stopwords_list] 
    return filtered_text

In [None]:
data_subset["description_str_token"] = data_subset["description_str_token"].apply(remove_stopwords)
data_subset["title_token"] = data_subset["title_token"].apply(remove_stopwords)

In [None]:
data_subset.head()

In [None]:
! pip install autocorrect

In [None]:
from autocorrect import Speller

In [None]:
spell = Speller(lang='en', fast = True)
def spelling_correct(tokenized_text):
    """
    This function takes a list of tokenized words from a sentence, spell check every words and returns the 
    corrected words if applicable. Note that not every wrong spelling words will be identified.
    """
    corrected = [spell(word) for word in tokenized_text] 
    return corrected

In [None]:
time_start = time.time()

data_subset["description_str_token"] = data_subset["description_str_token"].apply(spelling_correct)
data_subset["title_token"] = data_subset["title_token"].apply(spelling_correct)

print('Spelling corrected! Time elapsed: {} seconds'.format(time.time()-time_start))

In [None]:
from nltk.tokenize.treebank import TreebankWordDetokenizer
def detokenize_sent(text): 
    ''' 
    This function takes strings and returns tokenized words.
    '''
    word_detokens = TreebankWordDetokenizer().detokenize(text)
    return word_detokens

In [None]:
data_subset["description_str_detoken"] = data_subset["description_str_token"].apply(detokenize_sent)

In [None]:
data_subset["title_str_detoken"] = data_subset["title_token"].apply(detokenize_sent)

In [None]:
data_subset["desc_title_str_detoken"] = data_subset["description_str_detoken"] + ' ' + data_subset["title_str_detoken"]

In [None]:
data_subset['description_str'].replace('', np.nan, inplace=True)

In [None]:
data_subset.head()

In [None]:
# remove the rows which don't have data
data_subset = data_subset.dropna()

In [None]:
data_subset = data_subset.reset_index()

In [None]:
data_subset.to_csv('data_subset.csv')

In [None]:
#data_subset = pd.read_csv('data_subset.csv')

In [None]:
data_subset.head()

### Now data has been cleansed, we are ready to train a model

We will see when we return a sentence in it's vectorized format, we will have an array of 200 items, as that is the size we have choosen, where this is capturing the semantics of the sentence, and that will enable us to compare 2 sentences and see how similar they are for instance, and for this use-case, to be able to train a classifier. 

In [None]:
model_gensim = FastText(size=100, window=3, min_count=1) 

In [None]:
token_desc = data_subset["description_str_token"] + data_subset["title_token"]
token_desc.head()

In [None]:
time_start = time.time()
model_gensim.build_vocab(sentences=token_desc)

print('Build vocab done! Time elapsed: {} seconds'.format(time.time()-time_start))

In [None]:
time_start = time.time()
model_gensim.train(sentences=token_desc, total_examples=len(token_desc), epochs=50) 
print('Model trained! Time elapsed: {} seconds'.format(time.time()-time_start))

In [None]:
from gensim.test.utils import get_tmpfile
fname = get_tmpfile("fasttext.model")

model_gensim.save('books_gensim_model.bin')

In [None]:
description_str_detoken = data_subset["description_str_detoken"]

In [None]:
vector_description_str = model_gensim.wv[description_str_detoken]

In [None]:
len(vector_description_str)

In [None]:
description_str_detoken[1]

In [None]:
vector_description_str[1]

In [None]:
vector_description_str = np.split(vector_description_str,len(vector_description_str))

In [None]:
vector_description_str[1].shape

In [None]:
title_str_detoken = data_subset["title_str_detoken"]

In [None]:
vector_title_str = model_gensim.wv[title_str_detoken]

In [None]:
len(vector_title_str)

In [None]:
vector_title_str.shape

In [None]:
vector_title_str = np.split(vector_title_str,len(vector_title_str))

In [None]:
vector_desc_title = np.concatenate((vector_title_str, vector_description_str), axis=1)

We want to reshape the vector into a 2D with same number of rows and concatenating the data

In [None]:
big_vector_title_descr = vector_desc_title.reshape(len(vector_title_str),200)

In [None]:
big_vector_title_descr.shape

In [None]:
data_subset.head()

In [None]:
len(data_subset)

In [None]:
df_big_vector_title_descr = pd.DataFrame(data=big_vector_title_descr)

In [None]:
df_big_vector_title_descr.head()

Our index on both these DataFrames wont align anymore, so we need to reset the index so we can do that.

In [None]:
data_subset.head()

In [None]:
data_subset_2 = pd.concat([data_subset, df_big_vector_title_descr], axis=1)

In [None]:
data_subset_2.head()

### We want to check the count of each of the classes to check for class imbalance

With another version of XGBoost, we can supply the weights as a vector as a parameter for the training which will improve the model training to help the model be less bias because of the class imbalance

In [None]:
data_subset_2['cat_x2_code'].unique()

In [None]:
data_subset_2_cat_x2_agg = data_subset_2.groupby(by=['cat_x2_code']).count()['index']
print(data_subset_2_cat_x2_agg)

Get the data in the format ready for fasttext too

In [None]:
data_subset_2["fastText_label"] = '__label__' + data_subset["cat_x2_code"].astype(str) 

We have our data in a format that we like now, but for the training, we can select a few columns for this.

In [None]:
list(data_subset_2)

Might be better to pick the columns, rather than drop so many, lets look at the head

In [None]:
#create a new dataframe before saving the data as CSV
df_gensim_xgb_sampleweight = data_subset_2.drop(columns=['index','category','description','title','cnt_cats','cnt_desc','cat_x2','description_str','description_str_token','description_str_detoken','desc_title_str_detoken','title_str_detoken','title_token','fastText_label'])
df_fasttext = data_subset_2[['fastText_label','description_str_detoken', 'title_str_detoken']]

In [None]:
df_fasttext['token_sentence'] = df_fasttext['description_str_detoken'] + " " + df_fasttext['title_str_detoken']

In [None]:
#df_fasttext['untoken'] = [' '.join(map(str, l)) for l in df_fasttext['token_sentence']]

In [None]:
df_fasttext['full'] = df_fasttext['fastText_label'] + ' ' + df_fasttext['token_sentence'] 

In [None]:
df_fasttext.head()

In [None]:
df_gensim_xgb_sampleweight.head()

### For this version of XGBoost, we need to supply 3 arguments to the model which are the features, labels and optionally the sample weight which is going to help improve the performance of the model as we have an imbalanced dataset

In [None]:
X = df_gensim_xgb_sampleweight.drop(['cat_x2_code'], axis=1).values
y = df_gensim_xgb_sampleweight['cat_x2_code'].values


In [None]:
X

In [None]:
import time

from sklearn.manifold import TSNE

time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(X)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

In [None]:
tsne_results

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

In [None]:
df_tsne = pd.DataFrame(tsne_results[:,0], columns=['tsne-2d-one'])
df_tsne['tsne-2d-two'] = pd.DataFrame(tsne_results[:,1])
df_tsne['y'] = y
df_tsne

In [None]:
df_tsne = pd.DataFrame(tsne_results[:,0], columns=['tsne-2d-one'])
df_tsne['tsne-2d-two'] = tsne_results[:,1]
df_tsne['y'] = y
df_tsne

In [None]:
plt.figure(figsize=(16,10))
sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue="y",
    palette=sns.color_palette("hls", 33),
    data=df_tsne,
    legend="full",
    alpha=0.3
)

In [None]:
from sklearn.decomposition import PCA


In [None]:
pca_30 = PCA(n_components=50)
pca_result_30 = pca_30.fit_transform(X)
print('Cumulative explained variation for 30 principal components: {}'.format(np.sum(pca_30.explained_variance_ratio_)))

In [None]:
time_start = time.time()
tsne = TSNE(n_components=2, verbose=0, perplexity=40, n_iter=300)
tsne_pca_results = tsne.fit_transform(pca_result_30)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))


In [None]:
df_tsne_pca = pd.DataFrame(tsne_pca_results[:,0], columns=['tsne-2d-one'])
df_tsne_pca['tsne-2d-two'] = pd.DataFrame(tsne_pca_results[:,1])
df_tsne_pca['y'] = y
df_tsne_pca

In [None]:
plt.figure(figsize=(16,10))
sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue="y",
    palette=sns.color_palette("hls", 33),
    data=df_tsne_pca,
    legend="full",
    alpha=0.3
)

In [None]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)
yX_train = np.column_stack((y_train, X_train))
yX_test = np.column_stack((y_test, X_test))
np.savetxt("book_gensim_train_v1.csv", yX_train, delimiter=",", fmt='%0.3f')
np.savetxt("book_gensim_test_v1.csv", yX_test, delimiter=",", fmt='%0.3f')

In [None]:
# Upload the dataset to an S3 bucket
input_train = sagemaker_session.upload_data(path='book_gensim_train_v1.csv', key_prefix='%s/data' % prefix_gensim)
input_validation = sagemaker_session.upload_data(path='book_gensim_test_v1.csv', key_prefix='%s/data' % prefix_gensim)

In [None]:
#from sagemaker.inputs import TrainingInput

train_data = sagemaker.inputs.TrainingInput(s3_data=input_train,content_type="csv")
validation_data = sagemaker.inputs.TrainingInput(s3_data=input_validation,content_type="csv")

In our training script, we have a parser that is expecting the hyper-parameters below.

In [None]:
hyperparams = {
        "n_estimators": "300", 
        "n_jobs":"4",
        "max_depth":"10",
#        "min_child_weight": "6",
        "learning_rate": "0.1", 
        "objective":'multi:softmax', 
#        "reg_alpha": "10",
        "gamma": "4"
}

instance_type = "ml.m5.2xlarge"

Below is our estimator using the XGBoost framework and using our training script which is using another version of the XGB algorithm, not the SageMaker built-in algorithm.

In [None]:
# updated XGBoost to XGBClassifier https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html#train-a-model-with-open-source-xgboost
from sagemaker import get_execution_role
from sagemaker.xgboost.estimator import XGBoost

role = get_execution_role()

xgb_estimator = XGBoost(
    entry_point="train.py",
    hyperparameters=hyperparams,
    role=role,
    instance_count=1,
    instance_type='ml.m5.4xlarge',
    framework_version="1.2-1",
    eval_metric="merror",
)

In [None]:
time_start = time.time()


xgb_estimator.fit({'train': train_data, 'validation': validation_data })

print('xgb_estimator model trained! Time elapsed: {} seconds'.format(time.time()-time_start))

In [None]:
xgb_predictor_gensim = xgb_estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.2xlarge"
)

In [None]:
print(xgb_predictor_gensim)

In [None]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import NumpyDeserializer
csv_serializer = CSVSerializer()
np_deserializer = NumpyDeserializer()

xgb_predictor_gensim.serializer = csv_serializer
xgb_predictor_gensim.deserializer = np_deserializer



In [None]:
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

predictions_test_xgb_weighted = [ float(xgb_predictor_gensim.predict(x)) for x in X_test]  
score = f1_score(y_test,predictions_test_xgb_weighted,labels=np.unique(y),average='micro')

print('F1 Score(micro): %.1f' % (score * 100.0))

In [None]:
# xgb_predictor_gensim.delete_endpoint()

### In the next steps, we will use the built-in XGBoost which doesn't allow you to set the weights for the classes and see how the results differ.

If we use the XGBClassifer, then we are going to need to divide our training data into 3 files, X =features, y=Labels, and W=weights - all the same length. 

We are going to need to cerate a map to class to add the weight. 

In [None]:
import boto3
container_uri = sagemaker.image_uris.retrieve('xgboost', boto3.Session().region_name, version='1.0-1')

# Create the estimator
xgb_bi = sagemaker.estimator.Estimator(container_uri,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m4.4xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix_gensim),
                                    sagemaker_session=sagemaker_session)
# Set the hyperparameters
xgb_bi.set_hyperparameters(eta=0.1,
                        max_depth=10,
                        gamma=4,
                        num_class=len(np.unique(y)),
                        alpha=10,
                        min_child_weight=6,
                        silent=0,
                        objective='multi:softmax',
                        num_round=300)

In [None]:
xgb_bi.fit({'train': train_data, 'validation': validation_data })

# We trained our model and now want to test out the predictions

In [None]:
xgb_predictor = xgb_bi.deploy(
    initial_instance_count=1, 
    instance_type='ml.m4.xlarge'
)

In [None]:
print(xgb_predictor)

In [None]:
xgb_predictor.serializer = csv_serializer

predictions_test = [ float(xgb_predictor.predict(x).decode('utf-8')) for x in X_test] 
score = f1_score(y_test,predictions_test,labels=np.unique(y),average='micro')

print('F1 Score(micro): %.1f' % (score * 100.0))

In [None]:
sentence = data_subset_2.description_str[42]
sentence_word_embedding = model_gensim.wv[sentence]
class_prediction = xgb_predictor_gensim.predict(sentence_word_embedding)

print(class_prediction)

All done, you can delete your endpoint

In [None]:
#xgb_predictor.delete_endpoint()

# Next we will test out the FastText native supervised Text classification 

In this step, we want to see if the native FastText algorithm is able to do the same but with less hard work.
With native FastText, you do not need to tokenize your sentences, and you also do not need to pick vector size as a parameter for the mdoel training. 
This algorithm will do the work for you behind the scenes. 
What we do need to do though, is get the data in to the required format which means adding a string of "__label__" before the label and then we will concatenate that with the description and title into one field and then present that to the algorithm. 



In [None]:
df_fasttext.head()

Taken the same index as our test example above to see if the fasttext algo can make the same prediction

In [None]:
! pip install fasttext==0.9.1

In [None]:
import fasttext

In [None]:
fasttext_dataset = df_fasttext['full']

In [None]:
from sklearn.model_selection import train_test_split

train_fasttext_native, val_fasttext_native = train_test_split(fasttext_dataset, test_size=0.33, random_state=42)

train_file_name = 'train_books_fasttext_native.csv'
valid_file_name = 'valid_books_fasttext_native.csv'
train_fasttext_native.to_csv(train_file_name, index=False, header=False)
val_fasttext_native.to_csv(valid_file_name, index=False, header=False)

In [None]:
model_native = fasttext.train_supervised(input=train_file_name, lr=0.1, epoch=50)

In [None]:
modelwordGram = fasttext.train_supervised(input=train_file_name, lr=0.1, epoch=50, wordNgrams=2)

### We will run a simple test with the validation data, we are returned the precision and recall, and we can play with the hyperparameters to tune this 

In [None]:
FastText_Precision_Recall = model_native.test(valid_file_name, k=1)
print(FastText_Precision_Recall)

In [None]:
f1_score = 2*((FastText_Precision_Recall[1]*FastText_Precision_Recall[2])/(FastText_Precision_Recall[1]+FastText_Precision_Recall[2]))
print('F1 Score(micro): %.1f' % (f1_score * 100.0))

In [None]:
df_valid_ft= pd.read_csv(valid_file_name)
df_valid_ft.head()

In [None]:
fasttext_sample_validation = data_subset_2['description_str_detoken'] + data_subset_2['title_str_detoken']
fasttext_sample_validation.head()

## Test the prediction versus what we got with the xgb classifer

In [None]:
model_native.predict(fasttext_sample_validation[1], k=1)

In [None]:
modelwordGram.predict(fasttext_sample_validation[1], k=1)

We can host our model on SageMaker. Blazing Text built-in algorithm is compatible with Fasttext's models, so we can upload the fastText model to S3 and then point a SageMaker endpoint configuration to this model, and then deploy our endpoint

In [None]:
model_filename = "books_fasttext_native.bin"
model_native.save_model(model_filename)

In [None]:
from time import gmtime, strftime


In [None]:
!tar -czvf model.tar.gz books_fasttext_native.bin
model_location = sagemaker_session.upload_data("model.tar.gz", bucket=bucket, key_prefix=f"fasttext/model-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}/output")
!rm books_fasttext_native.tar.gz books_fasttext_native.bin

In [None]:
container = sagemaker.image_uris.retrieve("blazingtext",boto3.Session().region_name,  "1")
print('Using SageMaker BlazingText container: {} ({})'.format(container, boto3.Session().region_name))

# Deploy endpoint in SageMaker

Blazing text is compatiable with fasttext models such that you can train the fasttext model wherever you want, and then you can push the model to S3 in the required format, i.e. saved as a .tar.gz file and then can deploy the model in SageMaker to take care of the heavy lifting.

In [None]:
#use blazing text container and the fasttext model
model_fastText_book = sagemaker.Model(
    model_data=model_location, 
    image_uri=container, 
    role=role, 
    sagemaker_session=sagemaker_session)

#

model_fastText_book.deploy(
    initial_instance_count = 1,
    instance_type = 'ml.m4.xlarge')

from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import JSONSerializer

predictor = sagemaker.Predictor(
    endpoint_name=model_fastText_book.endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)


In [None]:
fasttext_sample_validation[1]

In [None]:
sentence = [ fasttext_sample_validation[1] ]
payload = {"instances": sentence }

In [None]:
predictions = predictor.predict(payload)
print(predictions)

# Clean up, delete endpoint

In [None]:
#fastText_predictor.delete_endpoint()