# Sentiment Analysis

After the video was uploaded, it was analyzed by Rekognition and the output has been saved to a json file.

Using Python libraries common to data analysis, it is possible to get insights from this json file and understand when the sentiments were detected during the video.

## Libraries setup and import

We need to ensure we have all the required libraries before getting started.

Let's update pip, install some useful libs from pypy, and import them:

In [None]:
!pip install -U pip simplejson seaborn > /dev/null

In [None]:
import pandas as pd
import json
import numpy as np
import matplotlib.pyplot as plt
import simplejson
import seaborn as sns

We need to obtain some metadata to find where the video and json files are located:

In [None]:
with open('/opt/ml/metadata/resource-metadata.json') as fh:
 metadata = json.loads(fh.read())
accountid = metadata['ResourceArn'].split(':')[4]
bucket_name = 'sentiment-analysis' + accountid
print(bucket_name)

%set_env accountid={accountid}
%set_env bucket_name=sentiment-analysis-{accountid}

In [None]:
%%bash
mkdir -p analyzed
cd analyzed
aws s3 cp s3://$bucket_name/analyzed_videos/ . --recursive
ls -lah

Set the variable `video_file` to match the file you've uploaded **without the file extension.**

#### Example:
 
if you uploaded a file named `My_Video.mp4`, set the variable as `My_Video`


In [None]:
video_file = 'CHANGE-HERE' # CHANGE THIS!!!

with open(f'analyzed/{video_file}.json', 'r') as myfile:
 data = myfile.read()
content = json.loads(data)

## Creating the dataset

If everything worked, now the variable `content` contains data about the video sentiments. This is compatible with **Pandas**, and we are going to create a dataframe, which is an easy to analyze structure.

In [None]:
df = pd.read_json(data)
df.head()

In [None]:
df[['SURPRISED', 'HAPPY', 'CALM', 'CONFUSED','SAD', 'FEAR', 'ANGRY', 'DISGUSTED']].max()

In [None]:
df.columns

## Data Visualization

Having the datasets is only the first part of the insights process. Tables are not human-friendly when you have a lot of different lines and columns. Using data visualization techniques (dataviz) is a more efficient way of understanding patterns, behaviors, etc.

First, let's define some default configurations for new figures:

In [None]:
plt.rcParams['figure.figsize'] = (20, 8)
sns.set()

#### Timeline

As our first graph, we will investigate how the sentiments behaved in the timeline.

In [None]:
plt.figure(figsize=[20, 8])
selected = ['SURPRISED', 'HAPPY', 'CALM', 'CONFUSED','SAD', 'FEAR', 'ANGRY', 'DISGUSTED']
for sentiment in selected:
 plt.plot(df[sentiment], label=sentiment, linewidth=5, alpha=0.8)
plt.legend()
plt.title('Emotions Timeline')
plt.xlabel('Seconds')
plt.ylabel('Points')
plt.show()

It is possible to see the sentiments rising and lowering during the video, but the graph is a little bit confusing. Our first change will be remove the minor sentiments, visualizing only the most important sentiment at that very moment.

We create a dataframe using the Timestamp as our index and the column name with the max value of each line as the value to be printed

In [None]:
tmp_df = df[['Timestamp'] + selected].set_index('Timestamp').idxmax(axis=1)
tmp_df = pd.DataFrame(tmp_df)
tmp_df.columns = ['Sentiment']
tmp_df['IDXSentiment'] = tmp_df['Sentiment'].apply(list(set(tmp_df['Sentiment'])).index)
tmp_df.sample(10)

Now, for each timestamp, we have the name of the predominant sentiment and an index for this sentiment.

In [None]:
plt.scatter(tmp_df.index, tmp_df['IDXSentiment'], c=tmp_df['IDXSentiment'], s=250)
plt.yticks(tmp_df['IDXSentiment'], tmp_df['Sentiment'])
plt.title('Emotions Timeline - Predominant Sentiment')
plt.xlabel('Seconds')
plt.ylabel('Sentiment')
plt.show()

There are other ways of using colors to match behaviors during time.

Let's use an One Hot Encoding approach to get numeric data from classes:

In [None]:
onehot_df = pd.get_dummies(tmp_df['Sentiment'])
onehot_df = onehot_df.reset_index()
onehot_df['Sentiment'] = tmp_df['Sentiment'].values
onehot_df['IDXSentiment'] = tmp_df['IDXSentiment'].values
onehot_df.head()

And now, we plot the stacked areas using the numerical data from the one hot encoded dataset:

In [None]:
y = onehot_df.drop(['Timestamp', 'IDXSentiment', 'Sentiment'], axis=1)
y.plot(kind='area', stacked=True, sort_columns=True)
plt.title('Emotions Timeline - Predominant Sentiment II')
plt.show()

**That is it!** Now we are able to see and track the sentiment changes during the video.

With these datasets, it is possible to get even more statistics about the videos.

Let's check the distributions:

In [None]:
sns.boxplot(x='Sentiment', y='Timestamp', data=tmp_df.reset_index())
plt.show()

We have enough visual support to take our data driven decisions.

## Final Thoughts

We've seena little bit about exploring and visualizing our data.

To learn more about this journey, check the documentation of the projects Pandas (https://pandas.pydata.org/pandas-docs/stable/index.html), Seaborn (https://seaborn.pydata.org/examples/index.html), and Matplotlib (https://matplotlib.org/contents.html).

Bye!