{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using AWS Lambda and PyWren to find keywords in Common Crawl dataset\n", "\n", "The [Common Crawl](https://aws.amazon.com/public-datasets/common-crawl/) corpus includes web crawl data collected over 8 years. Common Crawl offers the largest, most comprehensive, open repository of web crawl data on the cloud. In this notebook, we are going to use the power of AWS Lambda and pywren to search and compare the popularity of items on the internet.\n", "\n", "### Credits\n", "- [PyWren](https://github.com/pywren/pywren) - Project by BCCI and riselab. Makes it easy to executive massive parallel map queries across [AWS Lambda](https://aws.amazon.com/lambda/)\n", "- [Warcio](https://github.com/webrecorder/warcio) - Streaming WARC/ARC library for fast web archive IO \n", "- [Common Crawl Foundation](http://commoncrawl.org/) - Builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step by Step instructions\n", "\n", "### Setup Logging (optional)\n", "Only activate the below lines if you want to see all debug messages from PyWren. _Note: The output will be rather chatty and lengthy._" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import logging\n", "logger = logging.getLogger()\n", "logger.setLevel(logging.INFO)\n", "%env PYWREN_LOGLEVEL=INFO" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's setup all the necessary libraries:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3, botocore, time\n", "import numpy as np\n", "from IPython.display import HTML, display, Image, IFrame\n", "import matplotlib.pyplot as plt\n", "import pywren\n", "import warc_search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we want to identify certain recent crawls datapoints which we want to send to PyWren for further analysis. The Common Crawl dataset is split up into different key naming schemes in an Amazon S3 bucket. More information can be found on the [Getting Started](http://commoncrawl.org/the-data/get-started/) page of Common Crawl. Let's identify some of the folder structure first by using the AWS CLI to list some folders in the Amazon S3 bucket:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2017" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, let's drill into some more specific crawls now:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3 = boto3.client('s3','us-east-1')\n", "items = s3.list_objects(Bucket = 'commoncrawl', Prefix = 'crawl-data/CC-MAIN-2017-39/segments/1505818685129.23/wet/')\n", "keys = items['Contents']\n", "display(HTML('Amount of WARC files available: ' + str(len(keys)) + ''))\n", "\n", "html = 'Sample links:'\n", "html += '
' + i['Key'] + ''\n", " html += ' | ' + str(round(i['Size']/1024/1024,2)) + ' MB | '\n", " html += '