{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sentiment Analysis\n", "\n", "After the video was uploaded, it was analyzed by Rekognition and the output has been saved to a json file.\n", "\n", "Using Python libraries common to data analysis, it is possible to get insights from this json file and understand when the sentiments were detected during the video." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Libraries setup and import\n", "\n", "We need to ensure we have all the required libraries before getting started.\n", "\n", "Let's update pip, install some useful libs from pypy, and import them:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U pip simplejson seaborn > /dev/null" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import json\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import simplejson\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to obtain some metadata to find where the video and json files are located:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('/opt/ml/metadata/resource-metadata.json') as fh:\n", " metadata = json.loads(fh.read())\n", "accountid = metadata['ResourceArn'].split(':')[4]\n", "bucket_name = 'sentiment-analysis' + accountid\n", "print(bucket_name)\n", "\n", "%set_env accountid={accountid}\n", "%set_env bucket_name=sentiment-analysis-{accountid}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%bash\n", "mkdir -p analyzed\n", "cd analyzed\n", "aws s3 cp s3://$bucket_name/analyzed_videos/ . --recursive\n", "ls -lah" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the variable `video_file` to match the file you've uploaded **without the file extension.**\n", "\n", "#### Example:\n", " \n", "if you uploaded a file named `My_Video.mp4`, set the variable as `My_Video`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "video_file = 'CHANGE-HERE' # CHANGE THIS!!!\n", "\n", "with open(f'analyzed/{video_file}.json', 'r') as myfile:\n", " data = myfile.read()\n", "content = json.loads(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating the dataset\n", "\n", "If everything worked, now the variable `content` contains data about the video sentiments. This is compatible with **Pandas**, and we are going to create a dataframe, which is an easy to analyze structure." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "df = pd.read_json(data)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[['SURPRISED', 'HAPPY', 'CALM', 'CONFUSED','SAD', 'FEAR', 'ANGRY', 'DISGUSTED']].max()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Visualization\n", "\n", "Having the datasets is only the first part of the insights process. Tables are not human-friendly when you have a lot of different lines and columns. Using data visualization techniques (dataviz) is a more efficient way of understanding patterns, behaviors, etc.\n", "\n", "First, let's define some default configurations for new figures:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.rcParams['figure.figsize'] = (20, 8)\n", "sns.set()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Timeline\n", "\n", "As our first graph, we will investigate how the sentiments behaved in the timeline." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=[20, 8])\n", "selected = ['SURPRISED', 'HAPPY', 'CALM', 'CONFUSED','SAD', 'FEAR', 'ANGRY', 'DISGUSTED']\n", "for sentiment in selected:\n", " plt.plot(df[sentiment], label=sentiment, linewidth=5, alpha=0.8)\n", "plt.legend()\n", "plt.title('Emotions Timeline')\n", "plt.xlabel('Seconds')\n", "plt.ylabel('Points')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is possible to see the sentiments rising and lowering during the video, but the graph is a little bit confusing. Our first change will be remove the minor sentiments, visualizing only the most important sentiment at that very moment.\n", "\n", "We create a dataframe using the Timestamp as our index and the column name with the max value of each line as the value to be printed" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tmp_df = df[['Timestamp'] + selected].set_index('Timestamp').idxmax(axis=1)\n", "tmp_df = pd.DataFrame(tmp_df)\n", "tmp_df.columns = ['Sentiment']\n", "tmp_df['IDXSentiment'] = tmp_df['Sentiment'].apply(list(set(tmp_df['Sentiment'])).index)\n", "tmp_df.sample(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, for each timestamp, we have the name of the predominant sentiment and an index for this sentiment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(tmp_df.index, tmp_df['IDXSentiment'], c=tmp_df['IDXSentiment'], s=250)\n", "plt.yticks(tmp_df['IDXSentiment'], tmp_df['Sentiment'])\n", "plt.title('Emotions Timeline - Predominant Sentiment')\n", "plt.xlabel('Seconds')\n", "plt.ylabel('Sentiment')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are other ways of using colors to match behaviors during time.\n", "\n", "Let's use an One Hot Encoding approach to get numeric data from classes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "onehot_df = pd.get_dummies(tmp_df['Sentiment'])\n", "onehot_df = onehot_df.reset_index()\n", "onehot_df['Sentiment'] = tmp_df['Sentiment'].values\n", "onehot_df['IDXSentiment'] = tmp_df['IDXSentiment'].values\n", "onehot_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now, we plot the stacked areas using the numerical data from the one hot encoded dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = onehot_df.drop(['Timestamp', 'IDXSentiment', 'Sentiment'], axis=1)\n", "y.plot(kind='area', stacked=True, sort_columns=True)\n", "plt.title('Emotions Timeline - Predominant Sentiment II')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**That is it!** Now we are able to see and track the sentiment changes during the video.\n", "\n", "With these datasets, it is possible to get even more statistics about the videos.\n", "\n", "Let's check the distributions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.boxplot(x='Sentiment', y='Timestamp', data=tmp_df.reset_index())\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have enough visual support to take our data driven decisions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Final Thoughts\n", "\n", "We've seena little bit about exploring and visualizing our data.\n", "\n", "To learn more about this journey, check the documentation of the projects Pandas (https://pandas.pydata.org/pandas-docs/stable/index.html), Seaborn (https://seaborn.pydata.org/examples/index.html), and Matplotlib (https://matplotlib.org/contents.html).\n", "\n", "Bye!" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" } }, "nbformat": 4, "nbformat_minor": 4 }