{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Using S3 and Analytics\n", "\n", "In this notebook, you will use S3 to download an portion of Amazon product reviews from our [Registry of Open Data](https://registry.opendata.aws/), use [Pandas](https://pandas.pydata.org/) to clean, transform, and export the reviews to prepare the data to train a sentiment analysis algorithm. Finally, you will be introduced to AWS Glue and Athena to catalog the metadata of the dataset and query it serverlessly using Athena. There is an optional section at the end to bake the ability to query with Athena into Jupyter using [custom magics](https://ipython.readthedocs.io/en/stable/config/custommagics.html) capabilities with Jupyter.\n", "\n", "Update the variable **bucket** below with your initials in replace of the **{{FIXME}}** place holder." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import botocore\n", "import json\n", "import time\n", "import os\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from IPython.display import display, Markdown\n", "\n", "glue = boto3.client('glue')\n", "s3 = boto3.client('s3')\n", "s3_resource = boto3.resource('s3')\n", "\n", "session = boto3.session.Session()\n", "region = session.region_name\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "\n", "bucket = 'escience-workshop-{{FIXME}}'\n", "amazon_review_bucket = 'amazon-reviews-pds'\n", "\n", "pd.set_option('display.max_colwidth', -1)\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Create S3 Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html)\n", "\n", "We will create an S3 bucket that will be used throughout the workshop for storing data.\n", "\n", "[s3.create_bucket](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.create_bucket) boto3 documentation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_bucket(bucket):\n", " import logging\n", "\n", " try:\n", " s3.create_bucket(Bucket=bucket, CreateBucketConfiguration={'LocationConstraint': region})\n", " except botocore.exceptions.ClientError as e:\n", " logging.error(e)\n", " return 'Bucket {0} could not be created.'.format(bucket)\n", " return 'Created {0} bucket.'.format(bucket)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_bucket(bucket)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Download Amazon Product Reviews](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-download-file.html) \n", "\n", "We will download the reviews from the Amazon Product Review dataset available on the [AWS Open Data Registry](https://registry.opendata.aws/amazon-reviews/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " s3_resource.Bucket(amazon_review_bucket).download_file('tsv/amazon_reviews_us_Mobile_Electronics_v1_00.tsv.gz', 'amazon_reviews_us_Mobile_Electronics_v1_00.tsv.gz')\n", "except botocore.exceptions.ClientError as e:\n", " if e.response['Error']['Code'] == \"404\":\n", " print(\"The object does not exist.\")\n", " else:\n", " raise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Unzip Amazon Video Reviews\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!gzip -d amazon_reviews_us_Mobile_Electronics_v1_00.tsv.gz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls -la" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### View raw tsv file\n", "\n", "We will use [Pandas](https://pandas.pydata.org/) [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method to load the tsv into a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). The dataset contains a number of columns associated with each review. For this exercise you will only be concerned with the `review_body` and `star_rating` columns. The `star_rating` is between 1-5 and the `review` is a free form text field. You will then use the DataFrame's [`head()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method to see the first few rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('amazon_reviews_us_Mobile_Electronics_v1_00.tsv', sep='\\t', error_bad_lines=False)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### View Data types of the columns in the data set\n", "\n", "To see the data types of each column you can use the [`dtypes`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) function. Click [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-dtypes) for more information on the basic dtypes supported." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create new DataFrame for only columns needed\n", "\n", "Below we will will keep only the `review_body` and `star_rating` fields for our dataset. YOu can use the [loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) method to access just the columns and rows you need. You will also filter out the neutral reviews which we are signifying as reviews with a `3` star rating to make more definitive the positive vs. negative reviews." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.loc[:, ['review_body', 'star_rating']]\n", "df = df.query('star_rating != 3.0')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add new column `sentiment` based on `star_rating`\n", "\n", "Next, we will use the [apply]() method to apply a function to each row in the DataFrame adding a new column `sentiment`. This will make it easier to visualize the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_sentiment(value):\n", " if value > 3:\n", " return 1\n", " else:\n", " return 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['sentiment'] = df['star_rating'].apply(get_sentiment)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic Cleaning\n", "There are a number of ways to clean the daatset you are working with. You will go through a few exercises looking at the data and perfoming actions based on the results to cleanse the data. The first thing you want to do is look at the [`unique`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html) `star_rating` values. You will notice there are reviews with a `nan` rating so we will want to remove those using the [`dropna`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method.\n", "\n", "We can visualize the counts to quickly see the breakout by `star_rating`. For that you will be using [seaborn](https://seaborn.pydata.org/).\n", "\n", "Next we want to **de-duplicate** the rows. It is necessary to remove duplicates in order to get unbaised results. Checking duplicates based on `review_body` and `star_rating`. If all values are equal then we will remove those records.\n", "\n", "At this time we can also transform the 'review_body' from it's regular form, with capital letters to the pure lowercase text. We can also see below we have ratings with a `nan` value meaning no `star_rating` was giving for the review. We will drop the `nan` reviews as well to clean up the data set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.star_rating.unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.dropna(inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualize the review counts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(df['star_rating'].value_counts())\n", "sns.countplot(x='star_rating', data=df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Drop duplicates and set proper data types" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import string\n", "df = df.drop_duplicates(subset={'review_body', 'star_rating'})\n", "\n", "df.dropna(inplace=True)\n", "df.loc[:, 'star_rating'] = df['star_rating'].astype(int)\n", "df.loc[:, 'review_body'] = df['review_body'].astype(str).str.lower()\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(df['sentiment'].value_counts())\n", "sns.countplot(x='sentiment', data=df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Python [Wordcloud](http://amueller.github.io/word_cloud/) Visualization\n", "\n", "Another visualization technique you can use to see what words stand out in the data set is to build a Wordcloud visualization. First, we will use pip to install the library nad then build a simple function to plot the words." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python -m pip install wordcloud" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from wordcloud import WordCloud, STOPWORDS\n", "\n", "def plot_wordcloud(text, mask=None, max_words=200, max_font_size=150, figure_size=(25.0,15.0), \n", " title = None, title_size=40, image_color=False):\n", " stopwords = set(STOPWORDS)\n", "\n", " wordcloud = WordCloud(background_color='gray',\n", " stopwords = stopwords,\n", " max_words = max_words,\n", " max_font_size = max_font_size, \n", " random_state = 50,\n", " width=800, \n", " height=400,\n", " mask = mask)\n", " wordcloud.generate(str(text))\n", " \n", " plt.figure(figsize=figure_size)\n", " if image_color:\n", " image_colors = ImageColorGenerator(mask);\n", " plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation=\"bilinear\");\n", " plt.title(title, fontdict={'size': title_size, \n", " 'verticalalignment': 'bottom'})\n", " else:\n", " plt.imshow(wordcloud);\n", " plt.title(title, fontdict={'size': title_size, 'color': 'black', \n", " 'verticalalignment': 'bottom'})\n", " plt.axis('off');\n", " plt.tight_layout() \n", " \n", "plot_wordcloud(df['review_body'], title=\"Word Cloud of Amazon Reviews\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see in the wordcloud `br` which typically means there is HTML markeup in the `review_body`. We want to clean the reviews by removing the markup as well. We will use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to remove the markup. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import bs4\n", "df['review_body'] = df['review_body'].apply(lambda x: bs4.BeautifulSoup(x, 'lxml').get_text())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Discover " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import string\n", "\n", "## Number of words in the text ##\n", "df[\"num_words\"] = df[\"review_body\"].apply(lambda x: len(str(x).split()))\n", "\n", "## Number of unique words in the text ##\n", "df[\"num_unique_words\"] = df[\"review_body\"].apply(lambda x: len(set(str(x).split())))\n", "\n", "## Number of characters in the text ##\n", "df[\"num_chars\"] = df[\"review_body\"].apply(lambda x: len(str(x)))\n", "\n", "## Number of stopwords in the text ##\n", "df[\"num_stopwords\"] = df[\"review_body\"].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))\n", "\n", "## Number of punctuations in the text ##\n", "df[\"num_punctuations\"] = df['review_body'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )\n", "\n", "## Average length of the words in the text ##\n", "df[\"mean_word_len\"] = df[\"review_body\"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Truncate some extreme values for better visuals, think about balancing positive vs negative reviews ##\n", "df = df.query('num_words <= 500 and num_punctuations < 500')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(df['sentiment'].value_counts())\n", "sns.countplot(x='sentiment', data=df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "f, axes = plt.subplots(3, 1, figsize=(10,20))\n", "sns.violinplot(x='sentiment', y='num_words', data=df, ax=axes[0])\n", "axes[0].set_xlabel('Sentiment', fontsize=12)\n", "axes[0].set_title(\"Number of words in each class\", fontsize=15)\n", "\n", "sns.violinplot(x='sentiment', y='num_chars', data=df, ax=axes[1])\n", "axes[1].set_xlabel('Sentiment', fontsize=12)\n", "axes[1].set_title(\"Number of characters in each class\", fontsize=15)\n", "\n", "sns.violinplot(x='sentiment', y='num_punctuations', data=df, ax=axes[2])\n", "axes[2].set_xlabel('Sentiment', fontsize=12)\n", "axes[2].set_title(\"Number of punctuations in each class\", fontsize=15)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Handling imbalanced datasets\n", "\n", "Here you can see we have a larger number of `positive` samples vs. `negative` ones. There are a number of techniques to blance this dataset out and the two most popular approaches are to either under-sample or over-sample. With under sampling you remove rows to balance the dataset out and in over sampling you can duplicate entries in the daatset which could lead to overfitting. This discussion is beyond the scope of this lab. You will under sample the data to balance the dataset but you can find more information [here]()." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.utils import resample\n", "\n", "positive = df[df['sentiment']==1]\n", "negative = df[df['sentiment']==0]\n", "\n", "positive_downsampled = resample(positive,\n", " replace = False, # sample without replacement\n", " n_samples = len(negative), # match minority n\n", " random_state = 27) # reproducible results\n", "\n", "# combine minority and downsampled majority\n", "downsampled = pd.concat([positive_downsampled, negative])\n", "\n", "# checking counts\n", "print(downsampled['sentiment'].value_counts())\n", "sns.countplot(x='sentiment', data=downsampled)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Export datasets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import csv \n", "\n", "file_name = 'reviews.csv'\n", "final_df = downsampled.loc[:, ['review_body', 'star_rating', 'sentiment']]\n", "final_df['review_body'] = downsampled['review_body'].astype('str')\n", "final_df.to_csv(\"reviews.csv\", index=False, header=False, quoting=csv.QUOTE_ALL, escapechar='\\\\', doublequote=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Unlabeled Data\n", "\n", "We will also create a dataset here with the reviews only. This will simulate not having labelled data for training and in a follow up lab you will be able to use [SageMaker Ground Truth](https://aws.amazon.com/sagemaker/groundtruth/) to build up a labeled dataset for training." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gt_df = downsampled.loc[:, ['review_body']]\n", "gt_df.to_csv(\"groundtruth.csv\", index=False, header=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "unlabeled_df = pd.read_csv('groundtruth.csv', error_bad_lines=False)\n", "unlabeled_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Upload to S3](https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html)\n", "\n", "Next, we will upload the files you created above to S3 to be used in a later workshop.\n", "\n", "[s3.upload_file](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_file) boto3 documentation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_resource.Bucket(bucket).Object(os.path.join('raw', file_name)).upload_file(file_name)\n", "s3_resource.Bucket(bucket).Object(os.path.join('unlabeled', 'groundtruth.csv')).upload_file('groundtruth.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List S3 Bucket Contents\n", "\n", "Now that we have uploaded objects to our S3 bucket we can view the contents of the bucket as well through the SDK." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "paginator = s3.get_paginator('list_objects_v2')\n", "for result in paginator.paginate(Bucket=bucket):\n", " if result.get('Contents') is not None:\n", " for files in result.get('Contents'):\n", " if not files.get('Key') == \"\":\n", " print(files.get('Key'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create test, train, and validate datasets for training\n", "\n", "Depending on the framework you are leveraging in your AI/ML workloads you may decide to split the data into test, train, and validate splits before uploading to S3. You can leverage some built in functions in the sklearn package to do the split. To learn more about the sklearn framework click [here](https://scikit-learn.org/stable/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "train, test = train_test_split(final_df, test_size=0.2, random_state=0)\n", "train, validate = train_test_split(train, test_size=0.2, random_state=0)\n", "\n", "print(f'Number of training examples: {len(train.index)}')\n", "print(f'Number of validation examples: {len(validate.index)}')\n", "print(f'Number of testing examples: {len(test.index)}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train.to_csv(\"train.csv\", index=False, header=False, quoting=csv.QUOTE_NONE, escapechar='\\\\')\n", "test.to_csv(\"test.csv\", index=False, header=False, quoting=csv.QUOTE_NONE, escapechar='\\\\')\n", "validate.to_csv(\"validate.csv\", index=False, header=False, quoting=csv.QUOTE_NONE, escapechar='\\\\')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Upload to S3](https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html)\n", "\n", "Now let's upload the splits created above to S3.\n", "\n", "[s3.upload_file](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_file) boto3 documentation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_resource.Bucket(bucket).Object(os.path.join('data', 'train.csv')).upload_file('train.csv')\n", "s3_resource.Bucket(bucket).Object(os.path.join('data', 'test.csv')).upload_file('test.csv')\n", "s3_resource.Bucket(bucket).Object(os.path.join('data', 'validate.csv')).upload_file('validate.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Democratizing Data\n", "\n", "You may determine that you want to make this dataset available to your lab or team to give them the ability to perform their own research on this same dataset. You can utilize a number of services in the [AWS Data Lakes & Analytics](https://aws.amazon.com/big-data/datalakes-and-analytics/) group to do so. This is the start of building out a [Data Lake on AWS](https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/?nc=sn&loc=2). We will start by cataloging the raw data in the Glue Data Catalog which is a Hive compliant metadata repository.\n", "\n", "\n", "### Create the [AWS Glue Catalog Database](https://docs.aws.amazon.com/glue/latest/dg/define-database.html)\n", "\n", "When you define a table in the AWS Glue Data Catalog, you add it to a database. A database is used to organize tables in AWS Glue. You can organize your tables using a crawler or using the AWS Glue console. A table can be in only one database at a time.\n", "\n", "There is a central Glue Catalog for each AWS account. When creating the database you will use your account id declared above as `account_id`\n", "\n", "[glue.create_database](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.create_database)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_database(glue_client, account_id, database_name, description):\n", " \"\"\"Create the specified Glue database if it does not exist\"\"\"\n", " try:\n", " glue_client.get_database(\n", " CatalogId=account_id,\n", " Name=database_name\n", " )\n", " except glue_client.exceptions.EntityNotFoundException:\n", " print(\"Creating database: %s\" % database_name)\n", " glue_client.create_database(\n", " CatalogId=account_id,\n", " DatabaseInput={\n", " 'Name': database_name,\n", " 'Description': description\n", " }\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "database_name = 'reviews'\n", "create_database(glue, account_id, database_name, 'Database for Amazon Reviews')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Create the Raw table in Glue](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html)\n", "\n", "When you define a table in AWS Glue, you also specify the value of a classification field that indicates the type and format of the data that's stored in that table. If a crawler creates the table, these classifications are determined by either a built-in classifier or a custom classifier. If you create a table manually in the console or by using an API, you specify the classification when you define the table. For more information about creating a table using the AWS Glue console, see [Working with Tables on the AWS Glue Console](https://docs.aws.amazon.com/glue/latest/dg/console-tables.html).\n", "\n", "[glue.create_table](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.create_table)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "location = 's3://{0}/raw/'.format(bucket)\n", "raw_table_name = 'raw'\n", "\n", "response = glue.create_table(\n", " CatalogId=account_id,\n", " DatabaseName=database_name,\n", " TableInput={\n", " 'Name': raw_table_name,\n", " 'Description': 'Raw Amazon Mobile Products reviews',\n", " 'StorageDescriptor': {\n", " 'Columns': [ \n", " {\n", " 'Name': 'review_body',\n", " 'Type': 'string',\n", " 'Comment': 'Review text of from the review'\n", " }, \n", " {\n", " 'Name': 'star_rating',\n", " 'Type': 'float',\n", " 'Comment': 'Star rating of from the review'\n", " },\n", " {\n", " 'Name': 'sentiment',\n", " 'Type': 'tinyint',\n", " 'Comment': 'Sentiment based on the star rating of from the review'\n", " }\n", " ],\n", " 'Location': location,\n", " 'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',\n", " 'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',\n", " 'SerdeInfo': {\n", " 'SerializationLibrary': 'org.apache.hadoop.hive.serde2.OpenCSVSerde',\n", " 'Parameters': {\n", " 'escapeChar': '\\\\',\n", " 'separatorChar': ',',\n", " 'serialization.format': '1'\n", " }\n", " },\n", " },\n", " 'TableType': 'EXTERNAL_TABLE',\n", " 'Parameters': {\n", " 'classification': 'csv'\n", " }\n", " }\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### [Query Raw Amazon Reviews with Athena ](https://docs.aws.amazon.com/athena/latest/ug/getting-started.html)\n", "\n", "To see the raw reviews we will be installing a python library for querying the data in the Glue Data Catalog with Athena. More information about [PyAthena](https://pypi.org/project/PyAthena/) or available [ANSI SQL and DDL commands with Athena](https://docs.aws.amazon.com/athena/latest/ug/ddl-sql-reference.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install PyAthena" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pyathena import connect\n", "from pyathena.util import as_pandas\n", "\n", "cursor = connect(region_name=region, s3_staging_dir='s3://{0}/temp'.format(bucket)).cursor()\n", "cursor.execute('select * from {0}.{1} limit 10'.format(database_name, raw_table_name))\n", "\n", "df = as_pandas(cursor)\n", "df.head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cursor.execute('select count(1) as count, star_rating from {0}.{1} group by star_rating'.format(database_name, raw_table_name))\n", "\n", "df = as_pandas(cursor)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Optional - SQL Magic in Jupyter Notebooks\n", "\n", "You can use the built-in functionality in Jupyter to create shortcut magic commands to fit your needs. Here we will use the pyathena library like above to query the review data, but this time we will encapsulate the call in the Jupyter magic command." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pyathena\n", "from pyathena.util import as_pandas\n", "\n", "from IPython.core import magic_arguments\n", "from IPython.core.magic import cell_magic, Magics, magics_class\n", "\n", "\n", "def query_athena(sql, region_name, s3_staging_dir):\n", " cursor = pyathena.connect(\n", " region_name=region_name,\n", " s3_staging_dir=\"{}\".format(s3_staging_dir)).cursor()\n", " cursor.execute(sql)\n", " return cursor\n", "\n", "@magics_class\n", "class AthenaMagics(Magics):\n", " s3_staging_dir = None\n", " region_name = None\n", "\n", " \n", " def parse_args(self, line):\n", " args = magic_arguments.parse_argstring(self.athena, line)\n", "\n", " # s3 staging directory\n", " if args.s3_staging_dir is None and self.s3_staging_dir is None:\n", " raise ValueError(\"s3_staging_dir for Athena should be set\")\n", " if args.s3_staging_dir is not None:\n", " self.s3_staging_dir = args.s3_staging_dir\n", " \n", " # region name\n", " if args.region_name is None and self.region_name is None:\n", " raise ValueError(\"region_name for Athena should be set\")\n", " if args.region_name is not None:\n", " self.region_name = args.region_name\n", " \n", " @cell_magic\n", " @magic_arguments.magic_arguments()\n", " @magic_arguments.argument('--s3_staging_dir', '-s',\n", " help='s3 path required by athena for writing query results (e.g. s3://your/staging/dir)'\n", " )\n", " @magic_arguments.argument('--region_name', '-r',\n", " help='aws region name (e.g. us-west-2)'\n", " )\n", " def athena(self, line='', cell=None):\n", " self.parse_args(line)\n", " cursor = query_athena(cell, self.region_name, self.s3_staging_dir)\n", " return as_pandas(cursor)\n", "\n", " \n", "ip = get_ipython()\n", "ip.register_magics(AthenaMagics)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(region)\n", "print(\"s3://{0}/temp\".format(bucket))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the magic is registered you can use the `%%athena` magic to query data registered in your Glue Data Catalog.\n", "**Replace the {{region}} and {{bucket/temp}} with the region and bucket variable values above Ex.**\n", "\n", "%%athena -r us-west-2 -s s3://escience-workshop/temp" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%athena -r {{region}} -s {{bucket/temp}}\n", "select * \n", "from reviews.raw limit 10;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You have successfully completed this lab!!!! \n", "\n", "In this lab you learned to store and retrieve objects with S3, do data manipulation with a number of tools in python like pandas, seaborn, and sklearn, and register and query data in AWS to start building out a dat lake to democratize access to datasets for machine learning.\n", "\n", "### Cleanup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }