# Using AWS Lambda and PyWren to find keywords in Common Crawl dataset

The [Common Crawl](https://aws.amazon.com/public-datasets/common-crawl/) corpus includes web crawl data collected over 8 years. Common Crawl offers the largest, most comprehensive, open repository of web crawl data on the cloud. In this notebook, we are going to use the power of AWS Lambda and pywren to search and compare the popularity of items on the internet.

### Credits
- [PyWren](https://github.com/pywren/pywren) - Project by BCCI and riselab. Makes it easy to executive massive parallel map queries across [AWS Lambda](https://aws.amazon.com/lambda/)
- [Warcio](https://github.com/webrecorder/warcio) - Streaming WARC/ARC library for fast web archive IO 
- [Common Crawl Foundation](http://commoncrawl.org/) - Builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

# Step by Step instructions

### Setup Logging (optional)
Only activate the below lines if you want to see all debug messages from PyWren. _Note: The output will be rather chatty and lengthy._

In [None]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
%env PYWREN_LOGLEVEL=INFO

Let's setup all the necessary libraries:

In [None]:
import boto3, botocore, time
import numpy as np
from IPython.display import HTML, display, Image, IFrame
import matplotlib.pyplot as plt
import pywren
import warc_search

Next we want to identify certain recent crawls datapoints which we want to send to PyWren for further analysis. The Common Crawl dataset is split up into different key naming schemes in an Amazon S3 bucket. More information can be found on the [Getting Started](http://commoncrawl.org/the-data/get-started/) page of Common Crawl. Let's identify some of the folder structure first by using the AWS CLI to list some folders in the Amazon S3 bucket:

In [None]:
!aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2017

Ok, let's drill into some more specific crawls now:

In [None]:
s3 = boto3.client('s3','us-east-1')
items = s3.list_objects(Bucket = 'commoncrawl', Prefix = 'crawl-data/CC-MAIN-2017-39/segments/1505818685129.23/wet/')
keys = items['Contents']
display(HTML('Amount of WARC files available: <b>' + str(len(keys)) + '</b>'))

html = 'Sample links:'
html += '<table>'
for i in keys[:10]:
    html += '<tr>'
    html += '<td><a href="https://commoncrawl.s3.amazonaws.com/' + i['Key'] + '" target="_blank">' + i['Key'] + '</a>'
    html += '<td><b>' + str(round(i['Size']/1024/1024,2)) +  ' MB</b></td>'
    html += '</tr>'
html += '</table>'
display(HTML(html))

Now let's use PyWren to run through various Web Archive format (WARC) files of recent crawls and look for specific keywords. Given the larger file size of the archives (>100MB) per crawl, we can benefit from the proximity of AWS Lambda and Amazon S3 to achieve a faster processing speed.

If you want to understand the exact details, explore [warc_search.py](/edit/Lab-2-Common-Crawl/warc_search.py). Here are the relevant code snippets.
```python
dynamo_tbl = boto3.resource('dynamodb').Table('pywren-workshop-common-crawl')
resp = requests.get('https://commoncrawl.s3.amazonaws.com/' + key, stream = True)
for record in ArchiveIterator(resp.raw, arc2warc=True):
    if record.content_type == 'text/plain':
        webpage_text = record.content_stream().read()
        date = record.rec_headers.get_header('WARC-Date')
        for search_str in search_array:
            if re.search(search_str,webpage_text):
                result[search_str]['count'] += 1
for search_str in search_array:
    if result[search_str]['count'] > 0:
        record={}
        record['warc_file']=key
        record['search_str']=search_str
        record['occurrence']=result[search_str]['count']
        response=dynamo_tbl.put_item(Item=record)
```

Let's first do one crawl on a single WebARC file first: (feel free to change the keywords to your liking)

In [None]:
keywords = 'Amazon, AWS, Python, Java'
wrenexec = pywren.default_executor()
future = wrenexec.call_async(warc_search.keyword_search, keys[:1][0]['Key'], extra_env = {'KEYWORDS' : keywords})
display(HTML('Time to complete: <b>' + str(round(future.result(),2)) + '</b> seconds'))

Let's have a look at the DynamoDB console for the [pywren-workshop-common](https://us-west-2.console.aws.amazon.com/dynamodb/home?region=us-west-2#tables:selected=pywren-workshop-common-crawl) crawl table and click on items to find our results.

Now let's load this up into our local Jupyter notebook:

In [None]:
table = boto3.resource('dynamodb', 'us-west-2').Table('pywren-workshop-common-crawl')
db_table = table.scan(ProjectionExpression='search_str, occurrence')

Time to plot this information out:

In [None]:
occurrences = {}
for item in db_table['Items']:
    if item['search_str'] in occurrences.keys():
        occurrences[item['search_str']] += item['occurrence']
    else:
        occurrences[item['search_str']] = item['occurrence']
    
plt.figure(figsize=(10, 5))
plt.title("Word frequency across crawl data")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.bar(range(len(occurrences)), occurrences.values(), align='center')
plt.xticks(range(len(occurrences)), occurrences.keys())
plt.show()

Let's now run the PyWren function over the first 50 Common Crawl dataset in parallel across multiple AWS Lambda functions:

In [None]:
iterdata = []
for key in keys[:50]:
    iterdata.append(key['Key'])
    
keywords = 'Amazon, AWS, Python, Java'
wrenexec = pywren.default_executor()
future = wrenexec.map(warc_search.keyword_search, iterdata, extra_env = {'KEYWORDS' : keywords})
t1 = time.time()
pywren_results = pywren.get_all_results(future)
duration = time.time() - t1

Let's analyze how long it took us to run over all these large datasets in Amazon S3. As you will see, the total duration is around a minute or less, however our overall total aggregate computation time across all AWS Lambda functions easily exceeds over 15 minutes - this is the power of parallel processing!

In [None]:
display(HTML('Total time for the job: <b>' + str(round(duration,2)) + '</b> seconds'))
display(HTML('Average time per Common Crawl file: <b>' + str(round(np.mean(pywren_results),2)) + '</b> seconds'))
display(HTML('Total aggregate process time across all AWS Lambda function: <b>' + str(round(np.sum(pywren_results),2)) + '</b> seconds'))

Let's replot our information with the newly analyzed amount of data points:

In [None]:
table = boto3.resource('dynamodb', 'us-west-2').Table('pywren-workshop-common-crawl')
db_table = table.scan(ProjectionExpression='warc_file, search_str, occurrence')

occurrences = {}
for item in db_table['Items']:
    if item['search_str'] in occurrences.keys():
        occurrences[item['search_str']] += item['occurrence']
    else:
        occurrences[item['search_str']] = item['occurrence']
    
plt.figure(figsize=(10, 5))
plt.title("Word frequency across crawl data")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.bar(range(len(occurrences)), occurrences.values(), align='center')
plt.xticks(range(len(occurrences)), occurrences.keys())
plt.show()


Next we will use a different function with PyWren to receive the information when a keyword was found including it's URL and send that information straight back to the Jupyter notebook here:

In [None]:
iterdata = []
for key in keys[:10]:
    iterdata.append(key['Key'])
    
keywords = 'Amazon, AWS, Python, Java'
wrenexec = pywren.default_executor()
future = wrenexec.map(warc_search.keyword_search_with_URL, iterdata, extra_env = {'KEYWORDS' : keywords})
url_results = pywren.get_all_results(future)

Let's now try to analyze the different top-level domains that we found and the according keywords:

In [None]:
from datetime import datetime
import tldextract

data = dict()
for result in url_results:
    for key in result.keys():
        if key in data.keys():
            data[key].extend(result[key])
        else:
            data[key] = result[key]
    
# convert values to time values
url_data = {}
for key in data.keys():
    url_data[key] = {}
    for item in data[key]:
        tld = tldextract.extract(item).suffix
        if tld in url_data[key].keys():
            url_data[key][tld] += 1
        else:
            url_data[key][tld] = 1

# render bar charts
for keyword in url_data.keys():
    top20 = dict(sorted(url_data[keyword].iteritems(), key=lambda (k, v): (-v, k))[:20])
    x = top20.keys()
    frequency = top20.values()
    x_pos = [i for i, _ in enumerate(x)]
    plt.figure(figsize=(10, 5))
    plt.barh(x_pos, frequency, color='green')
    plt.ylabel("TLD")
    plt.xlabel("Frequency")
    plt.title("TLD keyword frequency for " + keyword)
    plt.yticks(x_pos, x)
    plt.show()

That's it. We managed to perform a keyword analysis across a large amount of web crawled data in a massively distributed manner and plotted the results back on our local machine.