# Build a Sentiment Analysis Model with Scikit Learn

With Amazon SageMaker

source: https://www.twilio.com/blog/2017/12/sentiment-analysis-scikit-learn.html

First, import the necessary libraries:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.externals import joblib
import random
import boto3

Declare and initialize the s3 resource. Define constant variables that hold the value of file paths. Replace ```your-bucket-name``` with the name of the bucket you created for this workshop.

In [None]:
s3 = boto3.resource('s3')

BUCKET_NAME = 'your-bucket-name'
DATADIR='ServerlessAIWorkshop/SentimentAnalysis'
MODEL_FILE = 'sentiment_analysis_model.pkl'
VOCAB_FILE = 'vocabulary.pkl'

MODEL_FILE_KEY = DATADIR + '/' + MODEL_FILE
VOCAB_FILE_KEY = DATADIR + '/' + VOCAB_FILE

Declare array objects to store dataset. Data and label will be stored separately.

In [None]:
data = []
data_labels = [] #label will be either 'pos' or 'neg'

Open and read each dataset and append to the array objects created above.

In [None]:
with open("./pos_tweets.txt") as f:
 for i in f: 
 data.append(i) 
 data_labels.append('pos')

with open("./neg_tweets.txt") as f:
 for i in f: 
 data.append(i)
 data_labels.append('neg')

Vectorize the tweets content and convert to a two-dimensional array of word counts. This conversion to array is necessary to split operation.

In [None]:
vectorizer = CountVectorizer(
 analyzer = 'word', # exclude common words such as “the” or “and”
 lowercase = False,
)
features = vectorizer.fit_transform(data)
features_nd = features.toarray() 

Split the training data to get an evaluation set.
* X = dataset
* y = label
* train_ = training dataset
* test_ = validation dataset

In [None]:
train_size = 0.8
test_size = 1.0 - train_size
seed = 7
X_train, X_test, y_train, y_test = train_test_split(
 features_nd, 
 data_labels,
 train_size=train_size,
 test_size=test_size,
 random_state=seed)

Build a classifier using sckit learn's logistic regression.

In [None]:
log_model = LogisticRegression() # using LogisticRegression class from Sciki-learn
log_model = log_model.fit(X_train, y_train) # train the model using the training dataset

Run inference using the test/validation dataset. Sklearn.metrics.accuracy_score calculates what percentage of tweets are classified correctly.

In [None]:
y_pred = log_model.predict(X_test)
print(accuracy_score(y_test, y_pred))

Use scikit learns joblib module to first write the trained model and test dataset that is in memory into files (.pkl and .csv) on the notebook instance. Upload those artifacts to your S3 bucket.

In [None]:
joblib.dump(log_model, MODEL_FILE) # Save Linear Regression coefficients for inference
joblib.dump(vectorizer.vocabulary_, VOCAB_FILE) # Save vocabulary for inference

# upload the trailed model/pickle file as well as vocabulary to S3 bucket
s3.meta.client.upload_file(MODEL_FILE, BUCKET_NAME, MODEL_FILE_KEY)
s3.meta.client.upload_file(VOCAB_FILE, BUCKET_NAME, VOCAB_FILE_KEY)