# Loading Word Embeddings in SageMaker for Text Classification with TensorFlow 2

In this notebook, two aspects of Amazon SageMaker will be demonstrated. First, we'll use SageMaker Script Mode with a prebuilt TensorFlow 2 framework container, which enables you to use a training script similar to one you would use outside SageMaker. Second, we'll see how to use the concept of SageMaker input channels to load word embeddings into the container for training. The word embeddings will be used with a Convolutional Neural Net (CNN) in TensorFlow 2 to perform text classification. 

We'll begin with some necessary imports.

In [None]:
import os
import sys
import numpy as np
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# Prepare Dataset and Embeddings

Initially, we download the 20 Newsgroups dataset. 

In [None]:
!mkdir ./20_newsgroup
!wget -O ./20_newsgroup/news20.tar.gz http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz
!tar -xvzf ./20_newsgroup/news20.tar.gz

The next step is to download the GloVe word embeddings that we will load in the neural net.

In [None]:
!mkdir ./glove.6B
!wget https://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip -d ./glove.6B

We have to map the GloVe embedding vectors into an index.

In [None]:
BASE_DIR = ''
GLOVE_DIR = os.path.join(BASE_DIR, 'glove.6B')
TEXT_DATA_DIR = os.path.join(BASE_DIR, '20_newsgroup')
MAX_SEQUENCE_LENGTH = 1000
MAX_NUM_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:
 for line in f:
 values = line.split()
 word = values[0]
 coefs = np.asarray(values[1:], dtype='float32')
 embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

The 20 Newsgroups text also must be preprocessed. For example, the labels for each sample must be extracted and mapped to a numeric index.

In [None]:
texts = [] # list of text samples
labels_index = {} # dictionary mapping label name to numeric id
labels = [] # list of label ids
for name in sorted(os.listdir(TEXT_DATA_DIR)):
 path = os.path.join(TEXT_DATA_DIR, name)
 if os.path.isdir(path):
 label_id = len(labels_index)
 labels_index[name] = label_id
 for fname in sorted(os.listdir(path)):
 if fname.isdigit():
 fpath = os.path.join(path, fname)
 args = {} if sys.version_info < (3,) else {'encoding': 'latin-1'}
 with open(fpath, **args) as f:
 t = f.read()
 i = t.find('\n\n') # skip header
 if 0 < i:
 t = t[i:]
 texts.append(t)
 labels.append(label_id)

print('Found %s texts.' % len(texts))

We can use Keras text preprocessing functions to tokenize the text, limit the sequence length of the samples, and pad shorter sequences as necessary. Additionally, the preprocessed dataset must be split into training and validation sets.

In [None]:
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

After the dataset text preprocessing is complete, we can now map the 20 Newsgroup vocabulary words to their GloVe embedding vectors for use in an embedding matrix. This matrix will be loaded in an Embedding layer of the neural net.

In [None]:
num_words = min(MAX_NUM_WORDS, len(word_index)) + 1
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
 if i > MAX_NUM_WORDS:
 continue
 embedding_vector = embeddings_index.get(word)
 if embedding_vector is not None:
 # words not found in embedding index will be all-zeros.
 embedding_matrix[i] = embedding_vector

print('Number of words:', num_words)
print('Shape of embeddings:', embedding_matrix.shape)

Now the data AND embeddings are saved to file to prepare for training.

Note that we will not be loading the original, unprocessed set of embeddings into the training container — instead, to save loading time, we just save the embedding matrix, which at 16MB is much smaller than the original set of embeddings at 892MB. Depending on how large of a set of embeddings you need for other use cases, you might save further space by saving the embeddings with joblib (more efficient than the original Python pickle), and/or save the embeddings with half precision (fp16) instead of full precision and then restore them to full precision after they are loaded.

In [None]:
data_dir = os.path.join(os.getcwd(), 'data')
os.makedirs(data_dir, exist_ok=True)

train_dir = os.path.join(os.getcwd(), 'data/train')
os.makedirs(train_dir, exist_ok=True)

val_dir = os.path.join(os.getcwd(), 'data/val')
os.makedirs(val_dir, exist_ok=True)

embedding_dir = os.path.join(os.getcwd(), 'data/embedding')
os.makedirs(embedding_dir, exist_ok=True)

np.save(os.path.join(train_dir, 'x_train.npy'), x_train)
np.save(os.path.join(train_dir, 'y_train.npy'), y_train)
np.save(os.path.join(val_dir, 'x_val.npy'), x_val)
np.save(os.path.join(val_dir, 'y_val.npy'), y_val)
np.save(os.path.join(embedding_dir, 'embedding.npy'), embedding_matrix)

# SageMaker Hosted Training

Now that we've prepared our embedding matrix, we can move on to use SageMaker's hosted training functionality. SageMaker hosted training is preferred for doing actual training in place of local notebook prototyping, especially for large-scale, distributed training. Before starting hosted training, the data must be uploaded to S3. The word embedding matrix also will be uploaded. We'll do that now, and confirm the upload was successful.

In [None]:
s3_prefix = 'tf-20-newsgroups'

traindata_s3_prefix = '{}/data/train'.format(s3_prefix)
valdata_s3_prefix = '{}/data/val'.format(s3_prefix)
embeddingdata_s3_prefix = '{}/data/embedding'.format(s3_prefix)

train_s3 = sagemaker.Session().upload_data(path='./data/train/', key_prefix=traindata_s3_prefix)
val_s3 = sagemaker.Session().upload_data(path='./data/val/', key_prefix=valdata_s3_prefix)
embedding_s3 = sagemaker.Session().upload_data(path='./data/embedding/', key_prefix=embeddingdata_s3_prefix)

inputs = {'train':train_s3, 'val': val_s3, 'embedding': embedding_s3}
print(inputs)

We're now ready to set up an Estimator object for hosted training. Hyperparameters are passed in as a dictionary. Importantly, for the case of a model such as this one that takes word embeddings as an input, various aspects of the embeddings can be passed in with the dictionary so the embedding layer can be constructed in a flexible manner and not hardcoded. This allows easier tuning without having to make code modifications. 

In [None]:
import sagemaker
from sagemaker.tensorflow import TensorFlow

train_instance_type = 'ml.p3.2xlarge'
hyperparameters = {'epochs': 20, 
 'batch_size': 128, 
 'num_words': num_words,
 'word_index_len': len(word_index),
 'labels_index_len': len(labels_index),
 'embedding_dim': EMBEDDING_DIM,
 'max_sequence_len': MAX_SEQUENCE_LENGTH
 }

estimator = TensorFlow(entry_point='train.py',
 source_dir='code',
 model_dir=model_dir,
 instance_type=train_instance_type,
 instance_count=1,
 hyperparameters=hyperparameters,
 role=sagemaker.get_execution_role(),
 base_job_name='tf-20-newsgroups',
 framework_version='2.1',
 py_version='py3',
 script_mode=True)

To start the training job, simply call the `fit` method of the `Estimator` object. The `inputs` parameter is the dictionary we created above, which defines three channels. Besides the usual channels for the training and validation datasets, there is a channel for the embedding matrix. This illustrates one aspect of the flexibility of SageMaker for setting up training jobs: in addition to data, you can pass in arbitrary files needed for training. 

In [None]:
estimator.fit(inputs)

# SageMaker hosted endpoint

If we wish to deploy the model to production, the next step is to create a SageMaker hosted endpoint. The endpoint will retrieve the TensorFlow SavedModel created during training and deploy it within a TensorFlow Serving container. This all can be accomplished with one line of code, an invocation of the Estimator's deploy method.

In [None]:
predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.xlarge')

We can now compare the predictions generated by the endpoint with a sample of the validation data. The results are shown as integer labels from 0 to 19 corresponding to the 20 different newsgroups.

In [None]:
results = predictor.predict(x_val[:10])['predictions'] 

print('predictions: \t{}'.format(np.argmax(results, axis=1)))
print('target values: \t{}'.format(np.argmax(y_val[:10], axis=1)))

When you're finished with your review of this notebook, you can delete the prediction endpoint to release the instance(s) associated with it.

In [None]:
sagemaker.Session().delete_endpoint(predictor.endpoint_name)