{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using AWS Lambda and PyWren for Web Scraping and Sentiment Analysis\n",
"\n",
"This notebook is a demonstration of how to perform a web scrape, determine frequent words and build tag clouds massively parallel with AWS Lambda and PyWren. We will be using the [GDELT](https://aws.amazon.com/public-datasets/gdelt/) public dataset to determine which news articles to scrape, we then use NLTK to stem and tokenize words, perform a sentiment analysis and count frequent words. Lastly we will create tag clouds of frequent words and make them available in an Amazon S3 bucket.\n",
"\n",
"### Credits\n",
"- [NLTK](http://www.nltk.org/) - NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.\n",
"- [PyWren](https://github.com/pywren/pywren) - Project by BCCI and riselab. Makes it easy to executive massive parallel map queries across [AWS Lambda](https://aws.amazon.com/lambda/)\n",
"- [Wordcloud](https://github.com/amueller/word_cloud) - A little word cloud generator in Python created by [Andres Mueller](https://github.com/amueller)\n",
"- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) - Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step by Step instructions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setup Logging (optional)\n",
"Only activate the below lines if you want to see all debug messages from PyWren. _Note: The output will be rather chatty and lengthy._"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import logging\n",
"logger = logging.getLogger()\n",
"logger.setLevel(logging.INFO)\n",
"%env PYWREN_LOGLEVEL=INFO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's setup all the necessary libraries:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import boto3, botocore\n",
"from IPython.display import HTML, display, Image, IFrame\n",
"import pywren\n",
"import GDELT_scrape\n",
"\n",
"wrenexec = pywren.default_executor()\n",
"\n",
"def split_list(alist, wanted_parts=1):\n",
" length = len(alist)\n",
" return [ alist[i*length // wanted_parts: (i+1)*length // wanted_parts] \n",
" for i in range(wanted_parts) ]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**IMPORTANT** - We need to update the S3 Bucket variable with the Amazon S3 bucket that has been created with the AWS Cloudformation template earlier. Please update the following variable with your bucketname that you copied out of the Output tab.\n",
"\n",
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"S3BUCKET = 'pywren-workshop-s3bucket-12apl9us09h1d'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's use PyWren to collect a number of GDELT listed articles by polling the GDELT data from the GDELT AWS Dataset from Amazon S3. Whilst PyWren is processing you can watch the Cloudwatch Logs for output from the function - navigate to the [Cloudwatch Logs Console](https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/lambda/pywren_1;start=PT5M).\n",
"\n",
"If you want to understand the exact details, explore [GDELT_scrape.py](/edit/Lab-3-Scrape-Sentiment/GDELT_scrape.py). Here are the relevant code snippets. We basically open up the requested CSV file in Amazon S3 in the AWS Lambda function, so we can benefit from the fast connectivity between AWS Lambda and Amazon S3. We retrieve the relevant Source URLs and return them to our Notebook. We limit this to a 1000 links for this workshop.\n",
"```python\n",
"s3_object = s3.get_object(Bucket='gdelt-open-data', Key='events/' + file)\n",
"f = StringIO.StringIO(s3_object['Body'].read().decode('utf-8','replace').encode('ascii','replace'))\n",
"...\n",
"items = csv.DictReader(f, fieldnames, delimiter='\\t')\n",
"...\n",
"for i, item in enumerate(items):\n",
" links.append(item['SOURCEURL'])\n",
"...\n",
"return links_without_duplicates[:1000]\n",
"```\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"futures = wrenexec.map(GDELT_scrape.get_urls_from_gdelt_data, ['20171121.export.csv'], exclude_modules=[])\n",
"links = pywren.get_all_results(futures)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's have a look at some of the collected links:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"display(links[0][:10])\n",
"display(HTML('Total links: ' + str(len(links[0])) + ''))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's have a look at what some of those links look like:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"display(IFrame(links[0][6],1000,500))\n",
"display(IFrame(links[0][4],1000,500))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Great, now let's web scrape all those links and perform a sentiment analysis over it using NLTK and [VADER](http://www.nltk.org/_modules/nltk/sentiment/vader.html). Let's send all the links to Pywren. Here's the relevant code snippet that we will send to Pywren, explore [GDELT_scrape.py](/edit/Lab-3-Scrape-Sentiment/GDELT_scrape.py) for more details.\n",
"\n",
"```python\n",
"html = urllib.urlopen(link).read()\n",
"soup = BeautifulSoup(html, \"html.parser\")\n",
"text = soup.get_text()\n",
"...\n",
"dynamo_tbl = boto3.resource('dynamodb', 'us-west-2').Table('pywren-workshop-gdelt-table')\n",
"sentiment = SentimentIntensityAnalyzer().polarity_scores(text)\n",
"record = {}\n",
"record['link'] = link\n",
"record['sentiment'] = str(sentiment['compound'])\n",
"words = []\n",
"for word, frequency in get_frequent_words(text):\n",
" words.append(word + ':' + str(frequency))\n",
"record['words'] = words\n",
"response = dynamo_tbl.put_item(Item=record)\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# to optimize our function we divide all our links into 50 parts, so we send multiple links to the same Lambda function\n",
"futures = wrenexec.map(GDELT_scrape.news_analyzer, split_list(links[0], wanted_parts=50))\n",
"pywren.get_all_results(futures)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can see the result slowly landing in the DynamoDB table - have a look at the [DynamoDB Console](https://us-west-2.console.aws.amazon.com/dynamodb/home?region=us-west-2#tables:selected=pywren-workshop-gdelt-table) to see progress. Now let's load the DynamoDB table data into our local Jupyter notebook:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"table = boto3.resource('dynamodb', 'us-west-2').Table('pywren-workshop-gdelt-table')\n",
"db_table = table.scan()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try to understand the sentiments across all the articles that we fetched. Let's plot the overall sentiment histogram across all articles:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"sentiments = []\n",
"for item in db_table['Items']:\n",
" sentiments.append(float(item['sentiment']))\n",
"\n",
"plt.hist(sentiments)\n",
"plt.title(\"Sentiment Histogram\")\n",
"plt.xlabel(\"Sentiment\")\n",
"plt.ylabel(\"Frequency\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unsurprisingly most news articles are generally either more negative or positive with very few neutral articles. Next let's try to build a graph around frequently mentioned words. This will be across all articles:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"all_words = {};\n",
"for item in db_table['Items']:\n",
" for word in item['words']:\n",
" try:\n",
" if word.split(':')[0] in all_words:\n",
" all_words[word.split(':')[0]] = all_words[word.split(':')[0]] + int(word.split(':')[1])\n",
" else:\n",
" all_words[word.split(':')[0]] = int(word.split(':')[1])\n",
" except ValueError:\n",
" pass\n",
"\n",
"top20 = dict(sorted(all_words.iteritems(), key=lambda (k, v): (-v, k))[:20])\n",
"\n",
"x = top20.keys()\n",
"frequency = top20.values()\n",
"x_pos = [i for i, _ in enumerate(x)]\n",
"plt.figure(figsize=(10, 5))\n",
"plt.barh(x_pos, frequency, color='green')\n",
"plt.ylabel(\"Words used\")\n",
"plt.xlabel(\"Frequency\")\n",
"plt.title(\"Word frequency across articles\")\n",
"plt.yticks(x_pos, x)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's have some more fun with frequently used words. How about creating a [Tag Cloud](https://en.wikipedia.org/wiki/Tag_cloud) to identify frequently mentioned words in an article? Let's try this first locally with one article:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"from wordcloud import WordCloud\n",
"link = db_table['Items'][0]['link']\n",
"wordcloud = WordCloud(width = 500, height = 300).generate(GDELT_scrape.scrape_content(link))\n",
"cloud = wordcloud.to_array()\n",
"plt.figure(figsize=(10, 5), dpi=100)\n",
"plt.imshow(cloud)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Great, how about creating such a tag cloud across all the articles? That's a perfect example for PyWren where we have a parallel mapping exercise and we can store the results straight in Amazon S3.\n",
"\n",
"To accomplish this we need to change our PyWren AWS Lambda function to include the necessary libraries such as wordcloud and matplotlib. Since some of those libraries are compiled C code, PyWren will not be able to pickle it up and send it to the Lambda function. Hence we will update the entire PyWren function to include the necessary binaries that have been compiled on an Amazon EC2 instance with Amazon Linux. We pre-packaged this and made it available via [https://s3-us-west-2.amazonaws.com/pywren-workshop/wordcloud.zip](https://s3-us-west-2.amazonaws.com/pywren-workshop/wordcloud.zip)\n",
"\n",
"You can simple push this code to your PyWren AWS Lambda function with below command, assuming you named the PyWren worker function with the default name pywren_1 and region us-west-2:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"lambdaclient = boto3.client('lambda', 'us-west-2')\n",
"\n",
"response = lambdaclient.update_function_code(\n",
" FunctionName='pywren_1',\n",
" Publish=True,\n",
" S3Bucket='pywren-workshop',\n",
" S3Key='wordcloud.zip'\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alright, now let's use PyWren to create tagclouds of all those links in parallel and store the output into Amazon S3. Here's the relevant code snippet:\n",
"```python\n",
"from wordcloud import WordCloud\n",
"s3 = boto3.resource('s3')\n",
"for link in links:\n",
" text = scrape_content(link)\n",
" wordcloud = WordCloud(width = 500, height = 300).generate(text)\n",
" cloud = wordcloud.to_file('/tmp/cloud.jpg')\n",
" s3.Object(S3BUCKET, 'tagclouds/' + hashlib.md5(link).hexdigest() + '.jpg').put(ACL='public-read',ContentType='image/jpeg',Body=open('/tmp/cloud.jpg', 'rb'))\n",
" return 'Ok'\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import cPickle as pickle\n",
"links = []\n",
"for item in db_table['Items']:\n",
" links.append(item['link'])\n",
"pickle.dump(links, open('links.pickle', 'wb')) \n",
"\n",
"!python wordcloud_generator.py write --bucket_name={S3BUCKET}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's try to create a small website with an overview of all these tag clouds:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import hashlib\n",
"\n",
"output = '