{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "8ea891f6", "metadata": {}, "source": [ "# Amazon Personalize: using text as unstructured item metadata\n", "\n", "The relevance of the recommendations you deliver with Amazon Personalize depends on the data available when the recommendations are generated. Amazon Personalize uses your users’ historical interactions, the attributes of your items, and your users’ metadata to learn what items are most relevant for each user. The primary data required by Amazon Personalize is user-item interactions. The interactions users have with items in your catalog, such as clicking on a product, reading an article, watching a video, or purchasing a product, are an important signal of what they have found relevant in the past. Including item and user attributes, also known as metadata, can enhance the relevance of recommendations; especially for new items that are similar to what your users have found relevant. However, structured metadata such as an item’s category, style, or genre may not always be readily available or doesn’t provide all the information that you have in your narrative descriptions. Now Amazon Personalize allows you to add unstructured metadata such as product descriptions, video transcripts, or article text with your other item attributes. Amazon Personalize hosts, manages, and automatically uses natural language processing (NLP) models to process your text and use it to improve the performance of your Amazon Personalize solutions.\n", "\n", "This notebook will demonstrate how text in the form of product descriptions can be included as unstructured item metadata to improve the relevance of recommendations.\n", "\n", "Amazon Reviews data from the Amazon Prime Pantry category are used for the interactions and items datasets.\n", "\n", "When considering including text in your items dataset, keep the following best practices in mind.\n", "- Text that is editorially validated to be concise, relevant, and informative to each item, where the most pertinent details are mentioned earlier in the text, is preferred over user generated content that may be less relevant or consistent\n", "- A sparsely populated text column will diminish the positive impact of including text in the items dataset\n", "- Cleanse all text of markup and superfluous whitespace formatting before adding to it to the items dataset\n", "- English is currently the only supported language for the text field\n", "- Text fields are currently only considered for the User-Personalization and Personalized-Ranking recipes\n", "\n", "Two dataset groups will be created that will include data with and without item descriptions so we can train separate models and compare their offline and online results." ] }, { "cell_type": "code", "execution_count": 1, "id": "89197fc8", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import json\n", "import numpy as np\n", "from datetime import datetime\n", "import boto3\n", "import time\n", "from time import sleep\n", "from lxml import html" ] }, { "cell_type": "markdown", "id": "4dda155e", "metadata": {}, "source": [ "## Load and inspect datasets\n", "\n", "We'll start by loading the Prime Pantry reviews dataset. You will need to fill out the form for access to the data files:\n", "\n", "http://deepyeti.ucsd.edu/jianmo/amazon/index.html\n", "\n", "Citation:\n", "> Justifying recommendations using distantly-labeled reviews and fined-grained aspects \n", "> Jianmo Ni, Jiacheng Li, Julian McAuley \n", "> Empirical Methods in Natural Language Processing (EMNLP), 2019 [pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/emnlp19a.pdf)" ] }, { "cell_type": "code", "execution_count": 2, "id": "2d4ca5e3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "mkdir: cannot create directory ‘raw_data’: File exists\n", "--2021-07-13 22:06:52-- http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Prime_Pantry.json.gz\n", "Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50\n", "Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 45435146 (43M) [application/octet-stream]\n", "Saving to: ‘Prime_Pantry.json.gz’\n", "\n", "Prime_Pantry.json.g 100%[===================>] 43.33M 12.3MB/s in 4.8s \n", "\n", "2021-07-13 22:06:57 (9.01 MB/s) - ‘Prime_Pantry.json.gz’ saved [45435146/45435146]\n", "\n", "--2021-07-13 22:06:57-- http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_Prime_Pantry.json.gz\n", "Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50\n", "Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 5281662 (5.0M) [application/octet-stream]\n", "Saving to: ‘meta_Prime_Pantry.json.gz’\n", "\n", "meta_Prime_Pantry.j 100%[===================>] 5.04M 6.49MB/s in 0.8s \n", "\n", "2021-07-13 22:06:58 (6.49 MB/s) - ‘meta_Prime_Pantry.json.gz’ saved [5281662/5281662]\n", "\n" ] } ], "source": [ "data_dir = 'raw_data'\n", "!mkdir $data_dir\n", "\n", "!cd $data_dir && \\\n", " wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Prime_Pantry.json.gz && \\\n", " wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_Prime_Pantry.json.gz" ] }, { "cell_type": "markdown", "id": "e66d003e", "metadata": {}, "source": [ "### Load and inspect reviews data\n", "\n", "We'll start by loading the reviews dataset for the Prime Pantry products and running some commands to see what we have to work with." ] }, { "cell_type": "code", "execution_count": 3, "id": "8289e424", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
overallverifiedreviewTimereviewerIDasinreviewerNamereviewTextsummaryunixReviewTimevoteimagestyle
05True12 14, 2014A1NKJW0TNRVS7OB0000DIWNZTamara M.Good clingingClings well1418515200NaNNaNNaN
14True11 20, 2014A2L6X37E8TFTCCB0000DIWNZAmazon CustomerFantastic buy and a good plastic wrap. Even t...Saran could use more Plus to Cling better.1416441600NaNNaNNaN
24True10 11, 2014A2WPR4W6V48121B0000DIWNZnonameokFour Stars1412985600NaNNaNNaN
33False09 1, 2014A27EE7X7L29UMUB0000DIWNZZapNZsSaran Cling Plus is kind of like most of the C...The wrap is fantastic, but the dispensing, cut...14095296004NaNNaN
44True08 10, 2014A1OWT4YZGB5GV9B0000DIWNZAmy RogersThis is my go to plastic wrap so there isn't m...has been doing it's job for years1407628800NaNNaNNaN
\n", "
" ], "text/plain": [ " overall verified reviewTime reviewerID asin \\\n", "0 5 True 12 14, 2014 A1NKJW0TNRVS7O B0000DIWNZ \n", "1 4 True 11 20, 2014 A2L6X37E8TFTCC B0000DIWNZ \n", "2 4 True 10 11, 2014 A2WPR4W6V48121 B0000DIWNZ \n", "3 3 False 09 1, 2014 A27EE7X7L29UMU B0000DIWNZ \n", "4 4 True 08 10, 2014 A1OWT4YZGB5GV9 B0000DIWNZ \n", "\n", " reviewerName reviewText \\\n", "0 Tamara M. Good clinging \n", "1 Amazon Customer Fantastic buy and a good plastic wrap. Even t... \n", "2 noname ok \n", "3 ZapNZs Saran Cling Plus is kind of like most of the C... \n", "4 Amy Rogers This is my go to plastic wrap so there isn't m... \n", "\n", " summary unixReviewTime vote \\\n", "0 Clings well 1418515200 NaN \n", "1 Saran could use more Plus to Cling better. 1416441600 NaN \n", "2 Four Stars 1412985600 NaN \n", "3 The wrap is fantastic, but the dispensing, cut... 1409529600 4 \n", "4 has been doing it's job for years 1407628800 NaN \n", "\n", " image style \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pantry_df = pd.read_json(data_dir + '/Prime_Pantry.json.gz', lines=True, compression='infer')\n", "pantry_df.head()" ] }, { "cell_type": "code", "execution_count": 4, "id": "365cbe16", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(471614, 12)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pantry_df.shape" ] }, { "cell_type": "markdown", "id": "c65ae637", "metadata": {}, "source": [ "What can we learn from this output? There are over 471K reviews and 12 columns of data. The `asin` column is our unique item identifier, `reviewerID` is our unique user identifier, `unixReviewTime` is our timestamp for the review, and `overall` indicates the positivity of the review on a scale of 1-5. We will use this file as the basis for our interactions dataset for Personalize. " ] }, { "cell_type": "markdown", "id": "891c6405", "metadata": {}, "source": [ "### Build and save interactions dataset\n", "\n", "Let's start building our interactions dataset by narrowing down the rows we want to include. The first step is to isolate only the positive reviews. For this we will assume any reviews with an overall rating of 4 or higher is a positive review. Anything rating of 3 or below are either mediocre or negative reviews." ] }, { "cell_type": "code", "execution_count": 5, "id": "ecfd527b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(387692, 12)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "positive_reviews_df = pantry_df[pantry_df['overall'] > 3]\n", "positive_reviews_df.shape" ] }, { "cell_type": "markdown", "id": "13abaf9d", "metadata": {}, "source": [ "We're down to 387K positive reviews. Still plenty for training a model in Personalize.\n", "\n", "Next let's narrow down the dataset to just the columns we need and add an `EVENT_TYPE` column to indicate the type of events we're capturing. Adding an `EVENT_TYPE` column now will make it easier to explore testing real-time events later if you choose to do so (since `eventType` is a required field for the [PutEvents](https://docs.aws.amazon.com/personalize/latest/dg/API_UBS_PutEvents.html) API)." ] }, { "cell_type": "code", "execution_count": 6, "id": "3ab7d273", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewerIDasinunixReviewTimeoverallEVENT_TYPE
0A1NKJW0TNRVS7OB0000DIWNZ14185152005reviewed
1A2L6X37E8TFTCCB0000DIWNZ14164416004reviewed
2A2WPR4W6V48121B0000DIWNZ14129856004reviewed
4A1OWT4YZGB5GV9B0000DIWNZ14076288004reviewed
5A1GN2ADKF1IE7KB0000DIWNZ14052960005reviewed
\n", "
" ], "text/plain": [ " reviewerID asin unixReviewTime overall EVENT_TYPE\n", "0 A1NKJW0TNRVS7O B0000DIWNZ 1418515200 5 reviewed\n", "1 A2L6X37E8TFTCC B0000DIWNZ 1416441600 4 reviewed\n", "2 A2WPR4W6V48121 B0000DIWNZ 1412985600 4 reviewed\n", "4 A1OWT4YZGB5GV9 B0000DIWNZ 1407628800 4 reviewed\n", "5 A1GN2ADKF1IE7K B0000DIWNZ 1405296000 5 reviewed" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "positive_reviews_df = positive_reviews_df[['reviewerID', 'asin', 'unixReviewTime', 'overall']]\n", "positive_reviews_df['EVENT_TYPE']='reviewed'\n", "\n", "positive_reviews_df.head()" ] }, { "cell_type": "markdown", "id": "f05c33a8", "metadata": {}, "source": [ "One last check we should do is sanity check a `unixReviewTime` column value. Since Personalize builds sequence models based on the date and time of each interaction, it's important that the timestamp of each interaction is represented in the expected format so that it is interpreted correctly.\n", "\n", "Let's pick a value for the `unixReviewTime` column and parse it into a human-readable date so we can verify that it's reasonable." ] }, { "cell_type": "code", "execution_count": 7, "id": "0b51f569", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1321488000\n", "2011-11-17 00:00:00\n" ] } ], "source": [ "time_stamp = positive_reviews_df.iloc[50]['unixReviewTime']\n", "print(time_stamp)\n", "print(datetime.utcfromtimestamp(time_stamp).strftime('%Y-%m-%d %H:%M:%S'))" ] }, { "cell_type": "markdown", "id": "29fe21fd", "metadata": {}, "source": [ "The timestamp value looks good. Let's get some final summary information for our dataset." ] }, { "cell_type": "code", "execution_count": 8, "id": "31739017", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewerIDasinunixReviewTimeoverallEVENT_TYPE
count3876923876923.876920e+05387692.000000387692
unique20225410584NaNNaN1
topA35Q0RBM3YNQNFB00XA9DADCNaNNaNreviewed
freq1765288NaNNaN387692
meanNaNNaN1.468847e+094.847227NaN
stdNaNNaN4.314975e+070.359769NaN
minNaNNaN1.073693e+094.000000NaN
25%NaNNaN1.447200e+095.000000NaN
50%NaNNaN1.474718e+095.000000NaN
75%NaNNaN1.498435e+095.000000NaN
maxNaNNaN1.538611e+095.000000NaN
\n", "
" ], "text/plain": [ " reviewerID asin unixReviewTime overall EVENT_TYPE\n", "count 387692 387692 3.876920e+05 387692.000000 387692\n", "unique 202254 10584 NaN NaN 1\n", "top A35Q0RBM3YNQNF B00XA9DADC NaN NaN reviewed\n", "freq 176 5288 NaN NaN 387692\n", "mean NaN NaN 1.468847e+09 4.847227 NaN\n", "std NaN NaN 4.314975e+07 0.359769 NaN\n", "min NaN NaN 1.073693e+09 4.000000 NaN\n", "25% NaN NaN 1.447200e+09 5.000000 NaN\n", "50% NaN NaN 1.474718e+09 5.000000 NaN\n", "75% NaN NaN 1.498435e+09 5.000000 NaN\n", "max NaN NaN 1.538611e+09 5.000000 NaN" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "positive_reviews_df.describe(include='all')" ] }, { "cell_type": "markdown", "id": "9e54247f", "metadata": {}, "source": [ "We have 387K reviews for 202K distinct reviewers/users across 10K unique products. This is basis of our interactions dataset.\n", "\n", "Before we can use this as our interactions dataset, though, we need to rename the columns to match those expected by Personalize." ] }, { "cell_type": "code", "execution_count": 9, "id": "174de8be", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
USER_IDITEM_IDTIMESTAMPEVENT_VALUEEVENT_TYPE
0A1NKJW0TNRVS7OB0000DIWNZ14185152005reviewed
1A2L6X37E8TFTCCB0000DIWNZ14164416004reviewed
2A2WPR4W6V48121B0000DIWNZ14129856004reviewed
4A1OWT4YZGB5GV9B0000DIWNZ14076288004reviewed
5A1GN2ADKF1IE7KB0000DIWNZ14052960005reviewed
..................
471609A19GSVHXVT5NNFB01HI8JVI814948928005reviewed
471610ABSCTKLX9F9IUB01HI8JVI814937696005reviewed
471611A2R33RCWKDHZ3LB01HI8JVI814926464005reviewed
471612A2INGHYEXZDHMCB01HI8JVI814925600005reviewed
471613AJ8M6GVFGIRLSB01HI8JVI814878080005reviewed
\n", "

387692 rows × 5 columns

\n", "
" ], "text/plain": [ " USER_ID ITEM_ID TIMESTAMP EVENT_VALUE EVENT_TYPE\n", "0 A1NKJW0TNRVS7O B0000DIWNZ 1418515200 5 reviewed\n", "1 A2L6X37E8TFTCC B0000DIWNZ 1416441600 4 reviewed\n", "2 A2WPR4W6V48121 B0000DIWNZ 1412985600 4 reviewed\n", "4 A1OWT4YZGB5GV9 B0000DIWNZ 1407628800 4 reviewed\n", "5 A1GN2ADKF1IE7K B0000DIWNZ 1405296000 5 reviewed\n", "... ... ... ... ... ...\n", "471609 A19GSVHXVT5NNF B01HI8JVI8 1494892800 5 reviewed\n", "471610 ABSCTKLX9F9IU B01HI8JVI8 1493769600 5 reviewed\n", "471611 A2R33RCWKDHZ3L B01HI8JVI8 1492646400 5 reviewed\n", "471612 A2INGHYEXZDHMC B01HI8JVI8 1492560000 5 reviewed\n", "471613 AJ8M6GVFGIRLS B01HI8JVI8 1487808000 5 reviewed\n", "\n", "[387692 rows x 5 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "positive_reviews_df.rename(columns = {'reviewerID':'USER_ID', 'asin':'ITEM_ID', \n", " 'unixReviewTime':'TIMESTAMP', 'overall': 'EVENT_VALUE'}, inplace = True)\n", "positive_reviews_df" ] }, { "cell_type": "markdown", "id": "b6fa5b5c", "metadata": {}, "source": [ "Finally, let's save our positive reviews dataframe as a CSV. We'll upload this CSV to Personalize later in this notebook." ] }, { "cell_type": "code", "execution_count": 10, "id": "299f803c", "metadata": {}, "outputs": [], "source": [ "interactions_filename = \"interactions.csv\"\n", "positive_reviews_df.to_csv(interactions_filename, index=False, float_format='%.0f')" ] }, { "cell_type": "markdown", "id": "f816d13d", "metadata": {}, "source": [ "### Load and inspect item metadata\n", "\n", "Now that we have the interactions dataset established, let's turn to the items dataset. This is where we will find the unstructured text value that we will include in the model.\n", "\n", "Like the reviews dataset, the Prime Pantry item metadata file is also represented in JSON. Due to the nested nature of this file, this will present some challenges in getting our data formatted the way we need it.\n", "\n", "Let's start by loading the metadata file into a dataframe and taking a look the data." ] }, { "cell_type": "code", "execution_count": 11, "id": "6821a075", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
categorytech1descriptionfittitlealso_buytech2brandfeaturerankalso_viewdetailsmain_catsimilar_itemdatepriceasinimageURLimageURLHighRes
0[][Sink your sweet tooth into MILK DUDS Candya d...HERSHEY'S Milk Duds Candy, 5 Ounce(Halloween C...[B019KE37WO, B007NQSWEU]Milk Duds[][][]{'ASIN: ': 'B00005BPJO', 'Item model number:':...<img src=\"https://m.media-amazon.com/images/G/...NaT$5.00B00005BPJO[https://images-na.ssl-images-amazon.com/image...[https://images-na.ssl-images-amazon.com/image...
1[][Sink your sweet tooth into MILK DUDS Candya d...HERSHEY'S Milk Duds Candy, 5 Ounce(Halloween C...[B019KE37WO, B007NQSWEU]Milk Duds[][][]{'ASIN: ': 'B00005BPJO', 'Item model number:':...<img src=\"https://m.media-amazon.com/images/G/...NaT$5.00B00005BPJO[https://images-na.ssl-images-amazon.com/image...[https://images-na.ssl-images-amazon.com/image...
2[][A perfect Lentil soup starts with Goya Lentil...Goya Dry Lentils, 16 oz[B003SI144W, B000VDRKEK]Goya[][][B074MFVZG7, B079PTH69L, B000VDRKEK, B074M9T81...{'ASIN: ': 'B0000DIF38'}<img src=\"https://images-na.ssl-images-amazon....NaTB0000DIF38[https://images-na.ssl-images-amazon.com/image...[https://images-na.ssl-images-amazon.com/image...
3[][Saran Premium Wrap is an extra tough yet easy...Saran Premium Plastic Wrap, 100 Sq Ft[B01MY5FHT6, B000PYF8VM, B000SRMDFA, B07CX6LN8...Saran[][][B077QLSLRQ, B00JPKW1RQ, B000FE2IK6, B00XUJHJ9...{'Domestic Shipping: ': 'This item can only be...<img src=\"https://images-na.ssl-images-amazon....NaTB0000DIWNI[https://images-na.ssl-images-amazon.com/image...[https://images-na.ssl-images-amazon.com/image...
4[][200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Eas...Saran Cling Plus Plastic Wrap, 200 Sq Ft[]Saran[][][B0014CZ0TE]{'Domestic Shipping: ': 'This item can only be...<img src=\"https://images-na.ssl-images-amazon....NaTB0000DIWNZ[https://images-na.ssl-images-amazon.com/image...[https://images-na.ssl-images-amazon.com/image...
............................................................
10808[][These bars are where our journey started and ...KIND Bars, Caramel Almond &amp; Sea Salt, Glut...[]KIND[]26,259 in Grocery & Gourmet Food ([B00JQQAN60, B00JQQAWSY, B0111K7V54, B0111K8L9...{'ASIN: ': 'B01HI76312', 'Item model number:':...<img src=\"https://images-na.ssl-images-amazon....NaT$3.98B01HI76312[https://images-na.ssl-images-amazon.com/image...[https://images-na.ssl-images-amazon.com/image...
10809[][These bars are where our journey started and ...KIND Bars, Maple Glazed Pecan &amp; Sea Salt, ...[]KIND[]16,822 in Grocery & Gourmet Food ([B0111K97JC, B00JQQAN60, B0111K8L9Y, B01HI7631...{'ASIN: ': 'B01HI76790'}<img src=\"https://images-na.ssl-images-amazon....NaT$5.81B01HI76790[https://images-na.ssl-images-amazon.com/image...[https://images-na.ssl-images-amazon.com/image...
10810[][These bars are where our journey started and ...KIND Bars, Dark Chocolate Almond &amp; Coconut...[]KIND[]107,057 in Grocery & Gourmet Food ([B0111K7V54, B01HI76312, B00JQQAL0S, B0111K97J...{'ASIN: ': 'B01HI76SA8'}<img src=\"https://images-na.ssl-images-amazon....NaT$4.98B01HI76SA8[][]
10811[][These bars are where our journey started and ...KIND Bars, Honey Roasted Nuts &amp; Sea Salt, ...[]KIND[]24,648 in Grocery & Gourmet Food ([B00JQQAN60, B0111K7V54, B01HI76312, B0111K97J...{'ASIN: ': 'B01HI76XS0'}<img src=\"https://images-na.ssl-images-amazon....NaT$5.81B01HI76XS0[https://images-na.ssl-images-amazon.com/image...[https://images-na.ssl-images-amazon.com/image...
10812[][Delicious, crowd pleasing and convenient it c...Tostitos Salsa, Mild, 69 Ounce[B00ID0EBP4, B00ID0EB1S, B00N0WZSOS, B00IAE65L...Tostitos[]259,965 in Grocery & Gourmet Food ([]{'ASIN: ': 'B01HI8JVI8'}<img src=\"https://images-na.ssl-images-amazon....NaT$6.82B01HI8JVI8[][]
\n", "

10813 rows × 19 columns

\n", "
" ], "text/plain": [ " category tech1 description fit \\\n", "0 [] [Sink your sweet tooth into MILK DUDS Candya d... \n", "1 [] [Sink your sweet tooth into MILK DUDS Candya d... \n", "2 [] [A perfect Lentil soup starts with Goya Lentil... \n", "3 [] [Saran Premium Wrap is an extra tough yet easy... \n", "4 [] [200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Eas... \n", "... ... ... ... .. \n", "10808 [] [These bars are where our journey started and ... \n", "10809 [] [These bars are where our journey started and ... \n", "10810 [] [These bars are where our journey started and ... \n", "10811 [] [These bars are where our journey started and ... \n", "10812 [] [Delicious, crowd pleasing and convenient it c... \n", "\n", " title \\\n", "0 HERSHEY'S Milk Duds Candy, 5 Ounce(Halloween C... \n", "1 HERSHEY'S Milk Duds Candy, 5 Ounce(Halloween C... \n", "2 Goya Dry Lentils, 16 oz \n", "3 Saran Premium Plastic Wrap, 100 Sq Ft \n", "4 Saran Cling Plus Plastic Wrap, 200 Sq Ft \n", "... ... \n", "10808 KIND Bars, Caramel Almond & Sea Salt, Glut... \n", "10809 KIND Bars, Maple Glazed Pecan & Sea Salt, ... \n", "10810 KIND Bars, Dark Chocolate Almond & Coconut... \n", "10811 KIND Bars, Honey Roasted Nuts & Sea Salt, ... \n", "10812 Tostitos Salsa, Mild, 69 Ounce \n", "\n", " also_buy tech2 brand \\\n", "0 [B019KE37WO, B007NQSWEU] Milk Duds \n", "1 [B019KE37WO, B007NQSWEU] Milk Duds \n", "2 [B003SI144W, B000VDRKEK] Goya \n", "3 [B01MY5FHT6, B000PYF8VM, B000SRMDFA, B07CX6LN8... Saran \n", "4 [] Saran \n", "... ... ... ... \n", "10808 [] KIND \n", "10809 [] KIND \n", "10810 [] KIND \n", "10811 [] KIND \n", "10812 [B00ID0EBP4, B00ID0EB1S, B00N0WZSOS, B00IAE65L... Tostitos \n", "\n", " feature rank \\\n", "0 [] [] \n", "1 [] [] \n", "2 [] [] \n", "3 [] [] \n", "4 [] [] \n", "... ... ... \n", "10808 [] 26,259 in Grocery & Gourmet Food ( \n", "10809 [] 16,822 in Grocery & Gourmet Food ( \n", "10810 [] 107,057 in Grocery & Gourmet Food ( \n", "10811 [] 24,648 in Grocery & Gourmet Food ( \n", "10812 [] 259,965 in Grocery & Gourmet Food ( \n", "\n", " also_view \\\n", "0 [] \n", "1 [] \n", "2 [B074MFVZG7, B079PTH69L, B000VDRKEK, B074M9T81... \n", "3 [B077QLSLRQ, B00JPKW1RQ, B000FE2IK6, B00XUJHJ9... \n", "4 [B0014CZ0TE] \n", "... ... \n", "10808 [B00JQQAN60, B00JQQAWSY, B0111K7V54, B0111K8L9... \n", "10809 [B0111K97JC, B00JQQAN60, B0111K8L9Y, B01HI7631... \n", "10810 [B0111K7V54, B01HI76312, B00JQQAL0S, B0111K97J... \n", "10811 [B00JQQAN60, B0111K7V54, B01HI76312, B0111K97J... \n", "10812 [] \n", "\n", " details \\\n", "0 {'ASIN: ': 'B00005BPJO', 'Item model number:':... \n", "1 {'ASIN: ': 'B00005BPJO', 'Item model number:':... \n", "2 {'ASIN: ': 'B0000DIF38'} \n", "3 {'Domestic Shipping: ': 'This item can only be... \n", "4 {'Domestic Shipping: ': 'This item can only be... \n", "... ... \n", "10808 {'ASIN: ': 'B01HI76312', 'Item model number:':... \n", "10809 {'ASIN: ': 'B01HI76790'} \n", "10810 {'ASIN: ': 'B01HI76SA8'} \n", "10811 {'ASIN: ': 'B01HI76XS0'} \n", "10812 {'ASIN: ': 'B01HI8JVI8'} \n", "\n", " main_cat similar_item date \\\n", "0 \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
categorytech1descriptionfittitlealso_buytech2brandfeaturerankalso_viewdetailsmain_catsimilar_itemdatepriceasinimageURLimageURLHighRes
count1081310813108131081310813108131081310813108131081310813108131081310813010813108131081310813
unique119409110782395711960763482859401078641014821081289408940
top[][]Infants' Motrin Concentrated Drops, Fever Redu...[]L'Oreal Paris[][][]{}<img src=\"https://images-na.ssl-images-amazon....NaNB00005BPJO[][]
freq108131081398108132675410813171977759374835241062110813NaN4063217811781
\n", "" ], "text/plain": [ " category tech1 description fit \\\n", "count 10813 10813 10813 10813 \n", "unique 1 1 9409 1 \n", "top [] [] \n", "freq 10813 10813 98 10813 \n", "\n", " title also_buy tech2 \\\n", "count 10813 10813 10813 \n", "unique 10782 3957 1 \n", "top Infants' Motrin Concentrated Drops, Fever Redu... [] \n", "freq 2 6754 10813 \n", "\n", " brand feature rank also_view details \\\n", "count 10813 10813 10813 10813 10813 \n", "unique 1960 763 4828 5940 10786 \n", "top L'Oreal Paris [] [] [] {} \n", "freq 171 9777 5937 4835 24 \n", "\n", " main_cat similar_item date \\\n", "count 10813 10813 0 \n", "unique 4 1 0 \n", "top \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
asinbrandpricedescription
0B00005BPJOMilk Duds$5.00[Sink your sweet tooth into MILK DUDS Candya d...
1B00005BPJOMilk Duds$5.00[Sink your sweet tooth into MILK DUDS Candya d...
2B0000DIF38Goya[A perfect Lentil soup starts with Goya Lentil...
3B0000DIWNISaran[Saran Premium Wrap is an extra tough yet easy...
4B0000DIWNZSaran[200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Eas...
5B0000GH6UGIbarra[Ibarra Chocolate, 19 Oz, , ]
6B0000KC2BKKnorr$3.09[Knorr Granulated Chicken Flavor Bouillon is a...
7B0001E1IN8Castillo[Red chili habanero sauces. They are present t...
8B00032E8XKChicken of the Sea$1.48[Chicken of the Sea Solid White Albacore Tuna ...
9B0005XMTHESmucker's$2.29[Helps build muscles with bcaa's amino acids, ...
\n", "" ], "text/plain": [ " asin brand price \\\n", "0 B00005BPJO Milk Duds $5.00 \n", "1 B00005BPJO Milk Duds $5.00 \n", "2 B0000DIF38 Goya \n", "3 B0000DIWNI Saran \n", "4 B0000DIWNZ Saran \n", "5 B0000GH6UG Ibarra \n", "6 B0000KC2BK Knorr $3.09 \n", "7 B0001E1IN8 Castillo \n", "8 B00032E8XK Chicken of the Sea $1.48 \n", "9 B0005XMTHE Smucker's $2.29 \n", "\n", " description \n", "0 [Sink your sweet tooth into MILK DUDS Candya d... \n", "1 [Sink your sweet tooth into MILK DUDS Candya d... \n", "2 [A perfect Lentil soup starts with Goya Lentil... \n", "3 [Saran Premium Wrap is an extra tough yet easy... \n", "4 [200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Eas... \n", "5 [Ibarra Chocolate, 19 Oz, , ] \n", "6 [Knorr Granulated Chicken Flavor Bouillon is a... \n", "7 [Red chili habanero sauces. They are present t... \n", "8 [Chicken of the Sea Solid White Albacore Tuna ... \n", "9 [Helps build muscles with bcaa's amino acids, ... " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "items_df = pantry_meta_df.copy()\n", "items_df = items_df[['asin', 'brand', 'price', 'description']]\n", "items_df.head(10)" ] }, { "cell_type": "markdown", "id": "b0bf2443", "metadata": {}, "source": [ "Next let's drop duplicate rows based on the `asin` column value. There should only be one duplicate based on the `describe()` output above." ] }, { "cell_type": "code", "execution_count": 14, "id": "c53122a5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(10812, 4)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "items_df = items_df.drop_duplicates(subset=['asin'], keep='last')\n", "items_df.shape" ] }, { "cell_type": "markdown", "id": "090316bb", "metadata": {}, "source": [ "Next let's focus on reformatting and cleaning up the `description` column values. As you can see above, the `description` is currently represented as an array of strings (because that's how it is represented in the JSON file). We need to flatten this array into a single string and strip all HTML markup from each fragment.\n", "\n", "We'll start by creating two utility functions that will be used to clean the `description` (and later the `title` column in the original dataset when we want to display titles for recommended products)." ] }, { "cell_type": "code", "execution_count": 15, "id": "32c14d03", "metadata": {}, "outputs": [], "source": [ "# Strips and cleans a value of HTML markup and whitespace.\n", "def clean_markup(value):\n", " s = str(value).strip()\n", " if s != '':\n", " s = str(html.fromstring(s).text_content())\n", " s = ' '.join(s.split())\n", " \n", " return s.strip()\n", "\n", "# Cleans and reformats the description column value for a dataframe row.\n", "def clean_and_reformat_description(row):\n", " s = ''\n", " for el in row['description']:\n", " el = clean_markup(el)\n", " if el != '':\n", " s += ' ' + el\n", " \n", " return s.strip()" ] }, { "cell_type": "code", "execution_count": 16, "id": "aa51af53", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
asinbrandpricedescription
1B00005BPJOMilk Duds$5.00Sink your sweet tooth into MILK DUDS Candya de...
2B0000DIF38GoyaA perfect Lentil soup starts with Goya Lentils...
3B0000DIWNISaranSaran Premium Wrap is an extra tough yet easy ...
4B0000DIWNZSaran200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy...
5B0000GH6UGIbarraIbarra Chocolate, 19 Oz
...............
10808B01HI76312KIND$3.98These bars are where our journey started and i...
10809B01HI76790KIND$5.81These bars are where our journey started and i...
10810B01HI76SA8KIND$4.98These bars are where our journey started and i...
10811B01HI76XS0KIND$5.81These bars are where our journey started and i...
10812B01HI8JVI8Tostitos$6.82Delicious, crowd pleasing and convenient it ca...
\n", "

10812 rows × 4 columns

\n", "
" ], "text/plain": [ " asin brand price \\\n", "1 B00005BPJO Milk Duds $5.00 \n", "2 B0000DIF38 Goya \n", "3 B0000DIWNI Saran \n", "4 B0000DIWNZ Saran \n", "5 B0000GH6UG Ibarra \n", "... ... ... ... \n", "10808 B01HI76312 KIND $3.98 \n", "10809 B01HI76790 KIND $5.81 \n", "10810 B01HI76SA8 KIND $4.98 \n", "10811 B01HI76XS0 KIND $5.81 \n", "10812 B01HI8JVI8 Tostitos $6.82 \n", "\n", " description \n", "1 Sink your sweet tooth into MILK DUDS Candya de... \n", "2 A perfect Lentil soup starts with Goya Lentils... \n", "3 Saran Premium Wrap is an extra tough yet easy ... \n", "4 200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy... \n", "5 Ibarra Chocolate, 19 Oz \n", "... ... \n", "10808 These bars are where our journey started and i... \n", "10809 These bars are where our journey started and i... \n", "10810 These bars are where our journey started and i... \n", "10811 These bars are where our journey started and i... \n", "10812 Delicious, crowd pleasing and convenient it ca... \n", "\n", "[10812 rows x 4 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "items_df['description'] = items_df.apply(clean_and_reformat_description, axis=1)\n", "items_df" ] }, { "cell_type": "markdown", "id": "f9089ade", "metadata": {}, "source": [ "Next let's take a look at the `price` column and change its type from a string to a float." ] }, { "cell_type": "code", "execution_count": 17, "id": "06d8448c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ " 4063\n", "$2.99 114\n", "$3.99 113\n", "$4.99 103\n", "$5.99 87\n", " ... \n", "$20.42 1\n", "$32.32 1\n", "$1.52 1\n", "$27.89 1\n", "$39.10 1\n", "Name: price, Length: 1482, dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "items_df['price'].value_counts()" ] }, { "cell_type": "markdown", "id": "8f1ae9a2", "metadata": {}, "source": [ "The following cell with convert empty/non-numeric prices to `np.nan` and all others will have the `$` currency symbol removed. This will allow us to coerce the type to a float." ] }, { "cell_type": "code", "execution_count": 18, "id": "c233e0bb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
asinbrandpricedescription
1B00005BPJOMilk Duds5.00Sink your sweet tooth into MILK DUDS Candya de...
2B0000DIF38GoyaNaNA perfect Lentil soup starts with Goya Lentils...
3B0000DIWNISaranNaNSaran Premium Wrap is an extra tough yet easy ...
4B0000DIWNZSaranNaN200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy...
5B0000GH6UGIbarraNaNIbarra Chocolate, 19 Oz
...............
10808B01HI76312KIND3.98These bars are where our journey started and i...
10809B01HI76790KIND5.81These bars are where our journey started and i...
10810B01HI76SA8KIND4.98These bars are where our journey started and i...
10811B01HI76XS0KIND5.81These bars are where our journey started and i...
10812B01HI8JVI8Tostitos6.82Delicious, crowd pleasing and convenient it ca...
\n", "

10812 rows × 4 columns

\n", "
" ], "text/plain": [ " asin brand price \\\n", "1 B00005BPJO Milk Duds 5.00 \n", "2 B0000DIF38 Goya NaN \n", "3 B0000DIWNI Saran NaN \n", "4 B0000DIWNZ Saran NaN \n", "5 B0000GH6UG Ibarra NaN \n", "... ... ... ... \n", "10808 B01HI76312 KIND 3.98 \n", "10809 B01HI76790 KIND 5.81 \n", "10810 B01HI76SA8 KIND 4.98 \n", "10811 B01HI76XS0 KIND 5.81 \n", "10812 B01HI8JVI8 Tostitos 6.82 \n", "\n", " description \n", "1 Sink your sweet tooth into MILK DUDS Candya de... \n", "2 A perfect Lentil soup starts with Goya Lentils... \n", "3 Saran Premium Wrap is an extra tough yet easy ... \n", "4 200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy... \n", "5 Ibarra Chocolate, 19 Oz \n", "... ... \n", "10808 These bars are where our journey started and i... \n", "10809 These bars are where our journey started and i... \n", "10810 These bars are where our journey started and i... \n", "10811 These bars are where our journey started and i... \n", "10812 Delicious, crowd pleasing and convenient it ca... \n", "\n", "[10812 rows x 4 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def convert_price(row):\n", " v = str(row['price']).strip().replace('$', '')\n", " if v == '' or not v.lstrip('-').replace('.', '').isdigit():\n", " return np.nan\n", " return v\n", "\n", "items_df['price'] = items_df.apply(convert_price, axis=1)\n", "items_df" ] }, { "cell_type": "code", "execution_count": 19, "id": "cc688b06", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.99 114\n", "3.99 113\n", "4.99 103\n", "5.99 87\n", "2.98 76\n", " ... \n", "39.10 1\n", "1.84 1\n", "22.95 1\n", "12.17 1\n", "11.09 1\n", "Name: price, Length: 1480, dtype: int64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "items_df['price'].value_counts()" ] }, { "cell_type": "code", "execution_count": 20, "id": "00b61ccc", "metadata": {}, "outputs": [], "source": [ "items_df['price'] = items_df['price'].astype(float)" ] }, { "cell_type": "markdown", "id": "4e124320", "metadata": {}, "source": [ "Next we'll rename the columns to match the names and uppercase name format expected by Personalize." ] }, { "cell_type": "code", "execution_count": 21, "id": "191f558a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ITEM_IDBRANDPRICEDESCRIPTION
1B00005BPJOMilk Duds5.00Sink your sweet tooth into MILK DUDS Candya de...
2B0000DIF38GoyaNaNA perfect Lentil soup starts with Goya Lentils...
3B0000DIWNISaranNaNSaran Premium Wrap is an extra tough yet easy ...
4B0000DIWNZSaranNaN200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy...
5B0000GH6UGIbarraNaNIbarra Chocolate, 19 Oz
6B0000KC2BKKnorr3.09Knorr Granulated Chicken Flavor Bouillon is a ...
7B0001E1IN8CastilloNaNRed chili habanero sauces. They are present to...
8B00032E8XKChicken of the Sea1.48Chicken of the Sea Solid White Albacore Tuna i...
9B0005XMTHESmucker's2.29Helps build muscles with bcaa's amino acids, i...
10B0005XNE6ESnapple1.99At Snapple, we believe lifes a peach. Weve bee...
\n", "
" ], "text/plain": [ " ITEM_ID BRAND PRICE \\\n", "1 B00005BPJO Milk Duds 5.00 \n", "2 B0000DIF38 Goya NaN \n", "3 B0000DIWNI Saran NaN \n", "4 B0000DIWNZ Saran NaN \n", "5 B0000GH6UG Ibarra NaN \n", "6 B0000KC2BK Knorr 3.09 \n", "7 B0001E1IN8 Castillo NaN \n", "8 B00032E8XK Chicken of the Sea 1.48 \n", "9 B0005XMTHE Smucker's 2.29 \n", "10 B0005XNE6E Snapple 1.99 \n", "\n", " DESCRIPTION \n", "1 Sink your sweet tooth into MILK DUDS Candya de... \n", "2 A perfect Lentil soup starts with Goya Lentils... \n", "3 Saran Premium Wrap is an extra tough yet easy ... \n", "4 200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy... \n", "5 Ibarra Chocolate, 19 Oz \n", "6 Knorr Granulated Chicken Flavor Bouillon is a ... \n", "7 Red chili habanero sauces. They are present to... \n", "8 Chicken of the Sea Solid White Albacore Tuna i... \n", "9 Helps build muscles with bcaa's amino acids, i... \n", "10 At Snapple, we believe lifes a peach. Weve bee... " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "items_df.rename(columns = {'asin':'ITEM_ID', 'brand':'BRAND', \n", " 'price':'PRICE', 'description': 'DESCRIPTION'}, inplace = True)\n", "items_df.head(10)" ] }, { "cell_type": "markdown", "id": "517f9ad9", "metadata": {}, "source": [ "We'll be creating two items CSVs. One will have the description column and the other will not. We'll use each of these to train separate models with the same recipe so we can compare the offline metrics and do some online inspection of recommendations." ] }, { "cell_type": "code", "execution_count": 22, "id": "d4e21ac2", "metadata": {}, "outputs": [], "source": [ "items_with_desc_filename = \"items-with-desc.csv\"\n", "items_df.to_csv(items_with_desc_filename, index=False, float_format='%.2f')" ] }, { "cell_type": "markdown", "id": "15dee22c", "metadata": {}, "source": [ "Another items CSV with the description column removed." ] }, { "cell_type": "code", "execution_count": 23, "id": "2a01eb5f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ITEM_IDBRANDPRICE
1B00005BPJOMilk Duds5.0
2B0000DIF38GoyaNaN
3B0000DIWNISaranNaN
4B0000DIWNZSaranNaN
5B0000GH6UGIbarraNaN
\n", "
" ], "text/plain": [ " ITEM_ID BRAND PRICE\n", "1 B00005BPJO Milk Duds 5.0\n", "2 B0000DIF38 Goya NaN\n", "3 B0000DIWNI Saran NaN\n", "4 B0000DIWNZ Saran NaN\n", "5 B0000GH6UG Ibarra NaN" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "items_without_desc_df = items_df[['ITEM_ID', 'BRAND', 'PRICE']]\n", "items_without_desc_df.head()" ] }, { "cell_type": "code", "execution_count": 24, "id": "a73cf6ee", "metadata": {}, "outputs": [], "source": [ "items_without_desc_filename = \"items-without-desc.csv\"\n", "items_without_desc_df.to_csv(items_without_desc_filename, index=False, float_format='%.2f')" ] }, { "cell_type": "markdown", "id": "1d67baf5", "metadata": {}, "source": [ "## Create dataset groups and upload datasets\n", "\n", "With the datasets that we need built, now it's time to upload them to Personalize using dataset import jobs. Before we can upload the CSVs, we need to create dataset groups to hold our two dataset approaches (without and with descriptions), create schemas for our datasets, and create datasets.\n", "\n", "We'll start by creating SDK client that we'll need to interact with Personalize." ] }, { "cell_type": "code", "execution_count": 25, "id": "11a54883", "metadata": {}, "outputs": [], "source": [ "personalize = boto3.client('personalize')" ] }, { "cell_type": "markdown", "id": "cbc9f544", "metadata": {}, "source": [ "### Create dataset groups\n", "\n", "Let's create our two dataset groups." ] }, { "cell_type": "code", "execution_count": 26, "id": "7d433f5a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"datasetGroupArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset-group/amazon-pantry-without-desc\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"20bd153c-ebd0-432d-9ef3-522a0d2fd8d4\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:08:18 GMT\",\n", " \"x-amzn-requestid\": \"20bd153c-ebd0-432d-9ef3-522a0d2fd8d4\",\n", " \"content-length\": \"105\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "create_dataset_group_response = personalize.create_dataset_group(\n", " name = \"amazon-pantry-without-desc\"\n", ")\n", "\n", "dataset_group_without_desc_arn = create_dataset_group_response['datasetGroupArn']\n", "print(json.dumps(create_dataset_group_response, indent=2))" ] }, { "cell_type": "code", "execution_count": 27, "id": "1a071f53", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"datasetGroupArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset-group/amazon-pantry-with-desc\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"9cec53c8-28a3-40e5-bf0e-6f87a03113f7\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:08:18 GMT\",\n", " \"x-amzn-requestid\": \"9cec53c8-28a3-40e5-bf0e-6f87a03113f7\",\n", " \"content-length\": \"102\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "create_dataset_group_response = personalize.create_dataset_group(\n", " name = \"amazon-pantry-with-desc\"\n", ")\n", "\n", "dataset_group_with_desc_arn = create_dataset_group_response['datasetGroupArn']\n", "print(json.dumps(create_dataset_group_response, indent=2))" ] }, { "cell_type": "markdown", "id": "20481190", "metadata": {}, "source": [ "Since dataset groups can take a few seconds to be fully created, let's wait until they both have a status of ACTIVE." ] }, { "cell_type": "code", "execution_count": 28, "id": "8dfac385", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "At least one dataset group create is still in progress\n", "Dataset group create succeeded for arn:aws:personalize:us-east-1:224124347618:dataset-group/amazon-pantry-without-desc\n", "At least one dataset group create is still in progress\n", "Dataset group create succeeded for arn:aws:personalize:us-east-1:224124347618:dataset-group/amazon-pantry-with-desc\n" ] } ], "source": [ "in_progress_dataset_group_arns = [ dataset_group_without_desc_arn, dataset_group_with_desc_arn ]\n", "\n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " for dataset_group_arn in in_progress_dataset_group_arns:\n", " describe_dataset_group_response = personalize.describe_dataset_group(\n", " datasetGroupArn = dataset_group_arn\n", " )\n", " status = describe_dataset_group_response[\"datasetGroup\"][\"status\"]\n", " if status == \"ACTIVE\":\n", " print(\"Dataset group create succeeded for {}\".format(dataset_group_arn))\n", " in_progress_dataset_group_arns.remove(dataset_group_arn)\n", " elif status == \"CREATE FAILED\":\n", " print(\"Create failed for {}\".format(dataset_group_arn))\n", " in_progress_dataset_group_arns.remove(dataset_group_arn)\n", "\n", " if len(in_progress_dataset_group_arns) <= 0:\n", " break\n", " else:\n", " print(\"At least one dataset group create is still in progress\")\n", " \n", " time.sleep(10)" ] }, { "cell_type": "markdown", "id": "b16fdbdb", "metadata": {}, "source": [ "### Create Interactions dataset schema and datasets\n", "\n", "Since the interactions dataset will be the same for both dataset groups, we will be creating a single schema for the interactions dataset type and sharing it across both dataset groups. This is possible since schemas are global to your AWS account and not specific to a dataset group." ] }, { "cell_type": "code", "execution_count": 29, "id": "38851dc2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"schemaArn\": \"arn:aws:personalize:us-east-1:224124347618:schema/amazon-pantry-interactions\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"6be5e019-0c8e-487b-a158-6f089f9e79ca\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:08:40 GMT\",\n", " \"x-amzn-requestid\": \"6be5e019-0c8e-487b-a158-6f089f9e79ca\",\n", " \"content-length\": \"92\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "interactions_schema = schema = {\n", " \"type\": \"record\",\n", " \"name\": \"Interactions\",\n", " \"namespace\": \"com.amazonaws.personalize.schema\",\n", " \"fields\": [\n", " {\n", " \"name\": \"USER_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"ITEM_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"TIMESTAMP\",\n", " \"type\": \"long\"\n", " },\n", " {\n", " \"name\": \"EVENT_VALUE\",\n", " \"type\": \"float\"\n", " },\n", " {\n", " \"name\": \"EVENT_TYPE\",\n", " \"type\": \"string\"\n", " }\n", " ],\n", " \"version\": \"1.0\"\n", "}\n", " \n", "create_schema_response = personalize.create_schema(\n", " name = \"amazon-pantry-interactions\",\n", " schema = json.dumps(interactions_schema)\n", ")\n", "\n", "interaction_schema_arn = create_schema_response['schemaArn']\n", "print(json.dumps(create_schema_response, indent=2))" ] }, { "cell_type": "markdown", "id": "7e32abcd", "metadata": {}, "source": [ "Next we'll create an Interactions dataset in both dataset groups specifying the schema we just created." ] }, { "cell_type": "code", "execution_count": 30, "id": "b8641620", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"datasetArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset/amazon-pantry-without-desc/INTERACTIONS\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"ea1020bb-a7fa-4a64-a29e-f07e8e0f9ae9\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:08:41 GMT\",\n", " \"x-amzn-requestid\": \"ea1020bb-a7fa-4a64-a29e-f07e8e0f9ae9\",\n", " \"content-length\": \"107\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "dataset_type = \"INTERACTIONS\"\n", "create_dataset_response = personalize.create_dataset(\n", " name = \"amazon-pantry-without-desc-ints\",\n", " datasetType = dataset_type,\n", " datasetGroupArn = dataset_group_without_desc_arn,\n", " schemaArn = interaction_schema_arn\n", ")\n", "\n", "interactions_dataset_without_desc_arn = create_dataset_response['datasetArn']\n", "print(json.dumps(create_dataset_response, indent=2))" ] }, { "cell_type": "code", "execution_count": 31, "id": "fe84262a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"datasetArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset/amazon-pantry-with-desc/INTERACTIONS\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"614257cc-6344-408e-8e15-89d4f81ae927\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:08:41 GMT\",\n", " \"x-amzn-requestid\": \"614257cc-6344-408e-8e15-89d4f81ae927\",\n", " \"content-length\": \"104\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "create_dataset_response = personalize.create_dataset(\n", " name = \"amazon-pantry-with-desc-ints\",\n", " datasetType = dataset_type,\n", " datasetGroupArn = dataset_group_with_desc_arn,\n", " schemaArn = interaction_schema_arn\n", ")\n", "\n", "interactions_dataset_with_desc_arn = create_dataset_response['datasetArn']\n", "print(json.dumps(create_dataset_response, indent=2))" ] }, { "cell_type": "markdown", "id": "208ac794", "metadata": {}, "source": [ "### Stage Interactions CSV in S3\n", "\n", "Before we can upload the interactions CSV we created earlier into the Personalize datasets that we just created, we need to stage the CSV in an S3 bucket.\n", "\n", "Let's create an S3 bucket and copy the interactions CSV file to the bucket." ] }, { "cell_type": "code", "execution_count": 32, "id": "af016b5a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "us-east-1\n" ] } ], "source": [ "# Determine the current S3 region where this notebook is being hosted in SageMaker.\n", "with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:\n", " data = json.load(notebook_info)\n", " resource_arn = data['ResourceArn']\n", " region = resource_arn.split(':')[3]\n", "print(region)" ] }, { "cell_type": "code", "execution_count": 33, "id": "dcd7662a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "224124347618-us-east-1-amazon-pantry-personalize-text\n" ] } ], "source": [ "s3 = boto3.client('s3')\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "bucket_name = account_id + \"-\" + region + \"-\" + \"amazon-pantry-personalize-text\"\n", "print(bucket_name)\n", "if region == \"us-east-1\":\n", " s3.create_bucket(Bucket=bucket_name)\n", "else:\n", " s3.create_bucket(\n", " Bucket=bucket_name,\n", " CreateBucketConfiguration={'LocationConstraint': region}\n", " )" ] }, { "cell_type": "markdown", "id": "1129d901", "metadata": {}, "source": [ "#### Upload Interactions CSV to S3" ] }, { "cell_type": "code", "execution_count": 34, "id": "d416f815", "metadata": {}, "outputs": [], "source": [ "boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_filename)" ] }, { "cell_type": "markdown", "id": "5beabb6c", "metadata": {}, "source": [ "### Create S3 bucket policy and IAM role\n", "\n", "Before we can submit a dataset import job to Personalize, we have to create a bucket policy and IAM role that will give Personalize access to our bucket." ] }, { "cell_type": "code", "execution_count": 35, "id": "7cf21723", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'ResponseMetadata': {'RequestId': 'SBN10P9R7H7ST5RK',\n", " 'HostId': 'DD8fYEx27yBq6/rB7o9lMvkdCLOHOewN05NSq73g30jeFBdouLj5D+fWSnIZHvDuAKdCKEo7w3k=',\n", " 'HTTPStatusCode': 204,\n", " 'HTTPHeaders': {'x-amz-id-2': 'DD8fYEx27yBq6/rB7o9lMvkdCLOHOewN05NSq73g30jeFBdouLj5D+fWSnIZHvDuAKdCKEo7w3k=',\n", " 'x-amz-request-id': 'SBN10P9R7H7ST5RK',\n", " 'date': 'Tue, 13 Jul 2021 22:10:59 GMT',\n", " 'server': 'AmazonS3'},\n", " 'RetryAttempts': 0}}" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "policy = {\n", " \"Version\": \"2012-10-17\",\n", " \"Id\": \"PersonalizeS3BucketAccessPolicy\",\n", " \"Statement\": [\n", " {\n", " \"Sid\": \"PersonalizeS3BucketAccessPolicy\",\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"personalize.amazonaws.com\"\n", " },\n", " \"Action\": [\n", " \"s3:*Object\",\n", " \"s3:ListBucket\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:s3:::{}\".format(bucket_name),\n", " \"arn:aws:s3:::{}/*\".format(bucket_name)\n", " ]\n", " }\n", " ]\n", "}\n", "\n", "s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))" ] }, { "cell_type": "code", "execution_count": 36, "id": "fb5b7804", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "arn:aws:iam::224124347618:role/PersonalizeRoleAmazonPantry\n" ] } ], "source": [ "iam = boto3.client(\"iam\")\n", "\n", "role_name = \"PersonalizeRoleAmazonPantry\"\n", "assume_role_policy_document = {\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"personalize.amazonaws.com\"\n", " },\n", " \"Action\": \"sts:AssumeRole\"\n", " }\n", " ]\n", "}\n", "\n", "create_role_response = iam.create_role(\n", " RoleName = role_name,\n", " AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)\n", ")\n", "\n", "# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes \"personalize\" or \"Personalize\" \n", "# if you would like to use a bucket with a different name, please consider creating and attaching a new policy\n", "# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role\n", "policy_arn = \"arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess\"\n", "iam.attach_role_policy(\n", " RoleName = role_name,\n", " PolicyArn = policy_arn\n", ")\n", "\n", "# Now add S3 support\n", "iam.attach_role_policy(\n", " PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',\n", " RoleName=role_name\n", ")\n", "time.sleep(20) # wait for a minute to allow IAM role policy attachment to propagate\n", "\n", "role_arn = create_role_response[\"Role\"][\"Arn\"]\n", "print(role_arn)" ] }, { "cell_type": "markdown", "id": "ec1513dc", "metadata": {}, "source": [ "### Import Interactions datasets for each dataset group\n", "\n", "Now we're ready to import the staged Interactions CSV in our S3 bucket to the Personalize datasets we created in each dataset group. We'll submit both import jobs and wait for them both to complete." ] }, { "cell_type": "code", "execution_count": 37, "id": "f4fc686d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"datasetImportJobArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-without-desc-ints-import\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"84c4d71d-fe71-4ee7-bccf-4f8ed8b1e549\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:12:08 GMT\",\n", " \"x-amzn-requestid\": \"84c4d71d-fe71-4ee7-bccf-4f8ed8b1e549\",\n", " \"content-length\": \"126\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "create_dataset_import_job_response = personalize.create_dataset_import_job(\n", " jobName = \"amazon-pantry-without-desc-ints-import\",\n", " datasetArn = interactions_dataset_without_desc_arn,\n", " dataSource = {\n", " \"dataLocation\": \"s3://{}/{}\".format(bucket_name, interactions_filename)\n", " },\n", " roleArn = role_arn\n", ")\n", "\n", "dataset_import_job_without_ints_arn = create_dataset_import_job_response['datasetImportJobArn']\n", "print(json.dumps(create_dataset_import_job_response, indent=2))" ] }, { "cell_type": "code", "execution_count": 38, "id": "4d23015e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"datasetImportJobArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-with-desc-ints-import\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"eae39716-264b-48e8-bfb4-93f569c7a904\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:12:09 GMT\",\n", " \"x-amzn-requestid\": \"eae39716-264b-48e8-bfb4-93f569c7a904\",\n", " \"content-length\": \"123\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "create_dataset_import_job_response = personalize.create_dataset_import_job(\n", " jobName = \"amazon-pantry-with-desc-ints-import\",\n", " datasetArn = interactions_dataset_with_desc_arn,\n", " dataSource = {\n", " \"dataLocation\": \"s3://{}/{}\".format(bucket_name, interactions_filename)\n", " },\n", " roleArn = role_arn\n", ")\n", "\n", "dataset_import_job_with_ints_arn = create_dataset_import_job_response['datasetImportJobArn']\n", "print(json.dumps(create_dataset_import_job_response, indent=2))" ] }, { "cell_type": "markdown", "id": "fe0de9d0", "metadata": {}, "source": [ "### Wait for Interactions dataset import jobs to complete\n", "\n", "The following cell will wait for both import jobs to complete." ] }, { "cell_type": "code", "execution_count": 39, "id": "68b770b9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "At least one dataset import job is still in progress\n", "At least one dataset import job is still in progress\n", "At least one dataset import job is still in progress\n", "At least one dataset import job is still in progress\n", "Dataset import succeeded for arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-without-desc-ints-import\n", "At least one dataset import job is still in progress\n", "Dataset import succeeded for arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-with-desc-ints-import\n", "CPU times: user 42.3 ms, sys: 6.1 ms, total: 48.4 ms\n", "Wall time: 5min\n" ] } ], "source": [ "%%time\n", "\n", "in_progress_import_arns = [ dataset_import_job_without_ints_arn, dataset_import_job_with_ints_arn ]\n", "\n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " for import_arn in in_progress_import_arns:\n", " describe_dataset_import_job_response = personalize.describe_dataset_import_job(\n", " datasetImportJobArn = import_arn\n", " )\n", " status = describe_dataset_import_job_response[\"datasetImportJob\"]['status']\n", " if status == \"ACTIVE\":\n", " print(\"Dataset import succeeded for {}\".format(import_arn))\n", " in_progress_import_arns.remove(import_arn)\n", " elif status == \"CREATE FAILED\":\n", " print(\"Create failed for {}\".format(import_arn))\n", " in_progress_import_arns.remove(import_arn)\n", "\n", " if len(in_progress_import_arns) <= 0:\n", " break\n", " else:\n", " print(\"At least one dataset import job is still in progress\")\n", " \n", " time.sleep(60)" ] }, { "cell_type": "markdown", "id": "d6334309", "metadata": {}, "source": [ "### Create Items dataset schema and datasets\n", "\n", "Next we will repeat the process for the items datasets. This time, though, we will need to create two schemas since one items dataset includes the description column and the other does not. We'll start with the schema that does not include the description." ] }, { "cell_type": "code", "execution_count": 40, "id": "bbe80a96", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"schemaArn\": \"arn:aws:personalize:us-east-1:224124347618:schema/amazon-pantry-item-without-desc-schema\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"11a18b05-189b-4485-ba8b-58083e784321\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:17:16 GMT\",\n", " \"x-amzn-requestid\": \"11a18b05-189b-4485-ba8b-58083e784321\",\n", " \"content-length\": \"104\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "item_without_desc_schema = {\n", " \"type\": \"record\",\n", " \"name\": \"Items\",\n", " \"namespace\": \"com.amazonaws.personalize.schema\",\n", " \"fields\": [\n", " {\n", " \"name\": \"ITEM_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"BRAND\",\n", " \"type\": [ \"null\", \"string\" ],\n", " \"categorical\": True\n", " },{\n", " \"name\": \"PRICE\",\n", " \"type\": [ \"null\", \"float\" ],\n", " }\n", " ],\n", " \"version\": \"1.0\"\n", "}\n", "\n", "create_schema_response = personalize.create_schema(\n", " name = \"amazon-pantry-item-without-desc-schema\",\n", " schema = json.dumps(item_without_desc_schema)\n", ")\n", "\n", "item_without_desc_schema_arn = create_schema_response['schemaArn']\n", "print(json.dumps(create_schema_response, indent=2))" ] }, { "cell_type": "markdown", "id": "75bcd9b8", "metadata": {}, "source": [ "Next we'll create a schema that includes the description. Be sure to take note of the `\"textual\": True` attribute on the `DESCRIPTION` field. This is how you differentiate unstructured text fields from categorical and string fields. Without this attribute, Personalize will not apply natural language processing techniques to extract features from this text." ] }, { "cell_type": "code", "execution_count": 41, "id": "052d149e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"schemaArn\": \"arn:aws:personalize:us-east-1:224124347618:schema/amazon-pantry-item-with-desc-schema\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"05c60e80-1d7c-45f1-a881-d08af62a8432\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:17:16 GMT\",\n", " \"x-amzn-requestid\": \"05c60e80-1d7c-45f1-a881-d08af62a8432\",\n", " \"content-length\": \"101\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "item_with_desc_schema = {\n", " \"type\": \"record\",\n", " \"name\": \"Items\",\n", " \"namespace\": \"com.amazonaws.personalize.schema\",\n", " \"fields\": [\n", " {\n", " \"name\": \"ITEM_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"BRAND\",\n", " \"type\": [ \"null\", \"string\" ],\n", " \"categorical\": True\n", " },{\n", " \"name\": \"PRICE\",\n", " \"type\": [ \"null\", \"float\" ],\n", " },{\n", " \"name\": \"DESCRIPTION\",\n", " \"type\": [ \"null\", \"string\" ],\n", " \"textual\": True\n", " }\n", " ],\n", " \"version\": \"1.0\"\n", "}\n", "\n", "create_schema_response = personalize.create_schema(\n", " name = \"amazon-pantry-item-with-desc-schema\",\n", " schema = json.dumps(item_with_desc_schema)\n", ")\n", "\n", "item_with_desc_schema_arn = create_schema_response['schemaArn']\n", "print(json.dumps(create_schema_response, indent=2))" ] }, { "cell_type": "markdown", "id": "43255f2e", "metadata": {}, "source": [ "Next we will create Personalize datasets in each dataset group, taking special care to specify the approprate schema ARN for each dataset." ] }, { "cell_type": "code", "execution_count": 42, "id": "1b53563d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"datasetArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset/amazon-pantry-without-desc/ITEMS\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"f9b87f6c-22c8-42a7-8a3a-8c203300e3eb\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:18:13 GMT\",\n", " \"x-amzn-requestid\": \"f9b87f6c-22c8-42a7-8a3a-8c203300e3eb\",\n", " \"content-length\": \"100\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "dataset_type = \"ITEMS\"\n", "create_dataset_response = personalize.create_dataset(\n", " name = \"amazon-pantry-without-desc-items\",\n", " datasetType = dataset_type,\n", " datasetGroupArn = dataset_group_without_desc_arn,\n", " schemaArn = item_without_desc_schema_arn\n", ")\n", "\n", "items_dataset_without_desc_arn = create_dataset_response['datasetArn']\n", "print(json.dumps(create_dataset_response, indent=2))" ] }, { "cell_type": "code", "execution_count": 43, "id": "1763e36a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"datasetArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset/amazon-pantry-with-desc/ITEMS\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"cbba4d2a-17c9-4a9b-a6a3-a2df934e2de1\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:18:16 GMT\",\n", " \"x-amzn-requestid\": \"cbba4d2a-17c9-4a9b-a6a3-a2df934e2de1\",\n", " \"content-length\": \"97\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "create_dataset_response = personalize.create_dataset(\n", " name = \"amazon-pantry-with-desc-items\",\n", " datasetType = dataset_type,\n", " datasetGroupArn = dataset_group_with_desc_arn,\n", " schemaArn = item_with_desc_schema_arn\n", ")\n", "\n", "items_dataset_with_desc_arn = create_dataset_response['datasetArn']\n", "print(json.dumps(create_dataset_response, indent=2))" ] }, { "cell_type": "markdown", "id": "45684b47", "metadata": {}, "source": [ "#### Stage Items CSV in S3\n", "\n", "Next we'll copy our two items CSV files to the same S3 bucket create above." ] }, { "cell_type": "code", "execution_count": 44, "id": "9dfd8f9b", "metadata": {}, "outputs": [], "source": [ "boto3.Session().resource('s3').Bucket(bucket_name).Object(items_without_desc_filename).upload_file(items_without_desc_filename)" ] }, { "cell_type": "code", "execution_count": 45, "id": "3b7c8c32", "metadata": {}, "outputs": [], "source": [ "boto3.Session().resource('s3').Bucket(bucket_name).Object(items_with_desc_filename).upload_file(items_with_desc_filename)" ] }, { "cell_type": "markdown", "id": "05130046", "metadata": {}, "source": [ "### Import Items datasets for each dataset group\n", "\n", "Since the S3 bucket policy and IAM role are already setup, we can just submit two dataset import jobs to import the Items CSVs." ] }, { "cell_type": "code", "execution_count": 46, "id": "e05d6a34", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"datasetImportJobArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-without-desc-items-import\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"c6ea207b-b8eb-4565-8ce2-6eb44e0fa18f\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:18:37 GMT\",\n", " \"x-amzn-requestid\": \"c6ea207b-b8eb-4565-8ce2-6eb44e0fa18f\",\n", " \"content-length\": \"127\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "create_dataset_import_job_response = personalize.create_dataset_import_job(\n", " jobName = \"amazon-pantry-without-desc-items-import\",\n", " datasetArn = items_dataset_without_desc_arn,\n", " dataSource = {\n", " \"dataLocation\": \"s3://{}/{}\".format(bucket_name, items_without_desc_filename)\n", " },\n", " roleArn = role_arn\n", ")\n", "\n", "dataset_import_job_without_items_arn = create_dataset_import_job_response['datasetImportJobArn']\n", "print(json.dumps(create_dataset_import_job_response, indent=2))" ] }, { "cell_type": "code", "execution_count": 47, "id": "b34a8b81", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"datasetImportJobArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-with-desc-items-import\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"cf95206f-1527-4247-a618-6c8c832fa05f\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:18:38 GMT\",\n", " \"x-amzn-requestid\": \"cf95206f-1527-4247-a618-6c8c832fa05f\",\n", " \"content-length\": \"124\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "create_dataset_import_job_response = personalize.create_dataset_import_job(\n", " jobName = \"amazon-pantry-with-desc-items-import\",\n", " datasetArn = items_dataset_with_desc_arn,\n", " dataSource = {\n", " \"dataLocation\": \"s3://{}/{}\".format(bucket_name, items_with_desc_filename)\n", " },\n", " roleArn = role_arn\n", ")\n", "\n", "dataset_import_job_with_items_arn = create_dataset_import_job_response['datasetImportJobArn']\n", "print(json.dumps(create_dataset_import_job_response, indent=2))" ] }, { "cell_type": "markdown", "id": "1f558185", "metadata": {}, "source": [ "### Wait for Items import jobs to complete\n", "\n", "The following logic will wait until both items datasets are fully imported into each dataset group." ] }, { "cell_type": "code", "execution_count": 48, "id": "18a35716", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "At least one dataset import job is still in progress\n", "At least one dataset import job is still in progress\n", "At least one dataset import job is still in progress\n", "At least one dataset import job is still in progress\n", "Dataset import succeeded for arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-with-desc-items-import\n", "At least one dataset import job is still in progress\n", "At least one dataset import job is still in progress\n", "At least one dataset import job is still in progress\n", "Dataset import succeeded for arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-without-desc-items-import\n", "CPU times: user 57.6 ms, sys: 5.06 ms, total: 62.7 ms\n", "Wall time: 7min\n" ] } ], "source": [ "%%time\n", "\n", "in_progress_import_arns = [ dataset_import_job_without_items_arn, dataset_import_job_with_items_arn ]\n", "\n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " for import_arn in in_progress_import_arns:\n", " describe_dataset_import_job_response = personalize.describe_dataset_import_job(\n", " datasetImportJobArn = import_arn\n", " )\n", " status = describe_dataset_import_job_response[\"datasetImportJob\"]['status']\n", " if status == \"ACTIVE\":\n", " print(\"Dataset import succeeded for {}\".format(import_arn))\n", " in_progress_import_arns.remove(import_arn)\n", " elif status == \"CREATE FAILED\":\n", " print(\"Create failed for {}\".format(import_arn))\n", " in_progress_import_arns.remove(import_arn)\n", "\n", " if len(in_progress_import_arns) <= 0:\n", " break\n", " else:\n", " print(\"At least one dataset import job is still in progress\")\n", " \n", " time.sleep(60)" ] }, { "cell_type": "markdown", "id": "66c3c35c", "metadata": {}, "source": [ "## Create solutions and solution versions\n", "\n", "With the interactions and items datasets imported into each dataset group, we will next create solutions and solution versions using the user-personalization recipe for the data in each dataset group.\n", "\n", "First, let's list the Personalize recipes available." ] }, { "cell_type": "code", "execution_count": 49, "id": "fa25cbf7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'recipes': [{'name': 'aws-hrnn',\n", " 'recipeArn': 'arn:aws:personalize:::recipe/aws-hrnn',\n", " 'status': 'ACTIVE',\n", " 'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),\n", " 'lastUpdatedDateTime': datetime.datetime(2021, 2, 6, 19, 6, 40, 447000, tzinfo=tzlocal())},\n", " {'name': 'aws-hrnn-coldstart',\n", " 'recipeArn': 'arn:aws:personalize:::recipe/aws-hrnn-coldstart',\n", " 'status': 'ACTIVE',\n", " 'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),\n", " 'lastUpdatedDateTime': datetime.datetime(2021, 2, 6, 19, 6, 40, 447000, tzinfo=tzlocal())},\n", " {'name': 'aws-hrnn-metadata',\n", " 'recipeArn': 'arn:aws:personalize:::recipe/aws-hrnn-metadata',\n", " 'status': 'ACTIVE',\n", " 'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),\n", " 'lastUpdatedDateTime': datetime.datetime(2021, 2, 6, 19, 6, 40, 447000, tzinfo=tzlocal())},\n", " {'name': 'aws-personalized-ranking',\n", " 'recipeArn': 'arn:aws:personalize:::recipe/aws-personalized-ranking',\n", " 'status': 'ACTIVE',\n", " 'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),\n", " 'lastUpdatedDateTime': datetime.datetime(2021, 2, 6, 19, 6, 40, 447000, tzinfo=tzlocal())},\n", " {'name': 'aws-popularity-count',\n", " 'recipeArn': 'arn:aws:personalize:::recipe/aws-popularity-count',\n", " 'status': 'ACTIVE',\n", " 'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),\n", " 'lastUpdatedDateTime': datetime.datetime(2021, 2, 6, 19, 6, 40, 447000, tzinfo=tzlocal())},\n", " {'name': 'aws-sims',\n", " 'recipeArn': 'arn:aws:personalize:::recipe/aws-sims',\n", " 'status': 'ACTIVE',\n", " 'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),\n", " 'lastUpdatedDateTime': datetime.datetime(2021, 2, 6, 19, 6, 40, 447000, tzinfo=tzlocal())},\n", " {'name': 'aws-user-personalization',\n", " 'recipeArn': 'arn:aws:personalize:::recipe/aws-user-personalization',\n", " 'status': 'ACTIVE',\n", " 'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),\n", " 'lastUpdatedDateTime': datetime.datetime(2021, 2, 6, 19, 6, 40, 447000, tzinfo=tzlocal())}],\n", " 'ResponseMetadata': {'RequestId': '36ed9d13-64e9-4905-8528-136d414cb606',\n", " 'HTTPStatusCode': 200,\n", " 'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',\n", " 'date': 'Tue, 13 Jul 2021 22:26:03 GMT',\n", " 'x-amzn-requestid': '36ed9d13-64e9-4905-8528-136d414cb606',\n", " 'content-length': '1259',\n", " 'connection': 'keep-alive'},\n", " 'RetryAttempts': 0}}" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "personalize.list_recipes()" ] }, { "cell_type": "markdown", "id": "8cd368df", "metadata": {}, "source": [ "We will use the user-personalization recipe for this notebook since it is one of the recipes that uses item metadata. This recipe supports the canonical personalization use case where given a user, you want Personalize to recommend items that this user would be interested in. " ] }, { "cell_type": "code", "execution_count": 50, "id": "cbb8d479", "metadata": {}, "outputs": [], "source": [ "user_personalization_recipe_arn = \"arn:aws:personalize:::recipe/aws-user-personalization\"" ] }, { "cell_type": "markdown", "id": "ae991241", "metadata": {}, "source": [ "First, we will create a solution and solution version in the dataset group that does not include item descriptions." ] }, { "cell_type": "code", "execution_count": 51, "id": "99e0b49d", "metadata": {}, "outputs": [], "source": [ "user_personalization_create_solution_response = personalize.create_solution(\n", " name = \"amazon-pantry-without-desc-userpersonalization\",\n", " datasetGroupArn = dataset_group_without_desc_arn,\n", " recipeArn = user_personalization_recipe_arn\n", ")\n", "\n", "user_personalization_without_desc_solution_arn = user_personalization_create_solution_response['solutionArn']" ] }, { "cell_type": "code", "execution_count": 52, "id": "04b6fbda", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-without-desc-userpersonalization\n" ] } ], "source": [ "print(user_personalization_without_desc_solution_arn)" ] }, { "cell_type": "code", "execution_count": 53, "id": "6ac0b147", "metadata": {}, "outputs": [], "source": [ "user_personalization_solution_version_response = personalize.create_solution_version(\n", " solutionArn = user_personalization_without_desc_solution_arn\n", ")" ] }, { "cell_type": "code", "execution_count": 54, "id": "7819e32a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"solutionVersionArn\": \"arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-without-desc-userpersonalization/0b76212f\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"018b3fcb-10d5-4290-a17a-3970723abacd\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:26:14 GMT\",\n", " \"x-amzn-requestid\": \"018b3fcb-10d5-4290-a17a-3970723abacd\",\n", " \"content-length\": \"132\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "user_personalization_without_solution_version_arn = user_personalization_solution_version_response['solutionVersionArn']\n", "print(json.dumps(user_personalization_solution_version_response, indent=2))" ] }, { "cell_type": "markdown", "id": "ecadbfb1", "metadata": {}, "source": [ "Next we will create a solution and solution version in the dataset group that includes the item descriptions." ] }, { "cell_type": "code", "execution_count": 55, "id": "58571178", "metadata": {}, "outputs": [], "source": [ "user_personalization_create_solution_response = personalize.create_solution(\n", " name = \"amazon-pantry-with-desc-userpersonalization\",\n", " datasetGroupArn = dataset_group_with_desc_arn,\n", " recipeArn = user_personalization_recipe_arn\n", ")\n", "\n", "user_personalization_with_desc_solution_arn = user_personalization_create_solution_response['solutionArn']" ] }, { "cell_type": "code", "execution_count": 56, "id": "ab2b246d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-with-desc-userpersonalization\n" ] } ], "source": [ "print(user_personalization_with_desc_solution_arn)" ] }, { "cell_type": "code", "execution_count": 57, "id": "1ce9ded8", "metadata": {}, "outputs": [], "source": [ "user_personalization_solution_version_response = personalize.create_solution_version(\n", " solutionArn = user_personalization_with_desc_solution_arn\n", ")" ] }, { "cell_type": "code", "execution_count": 58, "id": "cdc817b8", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"solutionVersionArn\": \"arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-with-desc-userpersonalization/f178990f\",\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"f630b862-6fa9-4eb7-a0d2-71d0b9637e80\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 22:26:29 GMT\",\n", " \"x-amzn-requestid\": \"f630b862-6fa9-4eb7-a0d2-71d0b9637e80\",\n", " \"content-length\": \"129\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "user_personalization_with_solution_version_arn = user_personalization_solution_version_response['solutionVersionArn']\n", "print(json.dumps(user_personalization_solution_version_response, indent=2))" ] }, { "cell_type": "markdown", "id": "b33fd2f0", "metadata": {}, "source": [ "### Wait for solution versions to become active\n", "\n", "Finally, we'll wait for the solution versions to finish being created. This step is where Personalize trains machine learning models based on the datasets and selected recipe. Personalize will also split the interactions datasets into training and evaluation portions so it can evaluate the quality of recommendations against the trained model using held out data.\n", "\n", "You will notice that the solution version in the dataset group that includes the description data will take longer to train than the one without the description." ] }, { "cell_type": "code", "execution_count": 59, "id": "e87ccb3d", "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "Build succeeded for arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-without-desc-userpersonalization/0b76212f\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "At least one solution build is still in progress\n", "Build succeeded for arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-with-desc-userpersonalization/f178990f\n", "CPU times: user 363 ms, sys: 65.4 ms, total: 428 ms\n", "Wall time: 48min 4s\n" ] } ], "source": [ "%%time\n", "\n", "in_progress_solution_versions = [\n", " user_personalization_without_solution_version_arn,\n", " user_personalization_with_solution_version_arn\n", "]\n", "\n", "max_time = time.time() + 10*60*60 # 10 hours\n", "while time.time() < max_time:\n", " for solution_version_arn in in_progress_solution_versions:\n", " version_response = personalize.describe_solution_version(\n", " solutionVersionArn = solution_version_arn\n", " )\n", " status = version_response[\"solutionVersion\"][\"status\"]\n", " \n", " if status == \"ACTIVE\":\n", " print(\"Build succeeded for {}\".format(solution_version_arn))\n", " in_progress_solution_versions.remove(solution_version_arn)\n", " elif status == \"CREATE FAILED\":\n", " print(\"Build failed for {}\".format(solution_version_arn))\n", " in_progress_solution_versions.remove(solution_version_arn)\n", " \n", " if len(in_progress_solution_versions) <= 0:\n", " break\n", " else:\n", " print(\"At least one solution build is still in progress\")\n", " \n", " time.sleep(60)" ] }, { "cell_type": "markdown", "id": "ccd2e728", "metadata": {}, "source": [ "Generally speaking, the addition of text-based unstructured meta will increase training time. In our case, you can see above that the solution version that trained on the dataset with product descriptions took about 15 minutes longer than the solution version trained on the dataset without production descriptions. This difference will vary based on the composition and text values for your datasets.\n", "\n", "Let's inspect the training hours for each solution version and compare them as well." ] }, { "cell_type": "code", "execution_count": 60, "id": "7f34c142", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training hours without description: 4.199\n", "Training hours with description: 5.346\n", "Difference of 27.32%\n" ] } ], "source": [ "response = personalize.describe_solution_version(solutionVersionArn = user_personalization_without_solution_version_arn)\n", "training_hours_without_desc = response['solutionVersion']['trainingHours']\n", "\n", "response = personalize.describe_solution_version(solutionVersionArn = user_personalization_with_solution_version_arn)\n", "training_hours_with_desc = response['solutionVersion']['trainingHours']\n", "training_diff = (training_hours_with_desc - training_hours_without_desc) / training_hours_without_desc\n", "\n", "print(f\"Training hours without description: {training_hours_without_desc}\")\n", "print(f\"Training hours with description: {training_hours_with_desc}\")\n", "\n", "print(\"Difference of {:.2%}\".format(training_diff))" ] }, { "cell_type": "markdown", "id": "baba62c8", "metadata": {}, "source": [ "The training hours used for cost calculations was about 27% higher for training with the description column. \n", "\n", "The wall/clock time and training hours will vary depending on the size of your datasets but this information can help you assess the trade off when considering adding unstructured text to your datasets." ] }, { "cell_type": "markdown", "id": "fa0bb048", "metadata": {}, "source": [ "### Inspect offline metrics\n", "\n", "Now that the solution versions have finished building, let's inspect and compare the offline metrics for each solution version to see how the inclusion of unstructured text has impacted these metrics." ] }, { "cell_type": "code", "execution_count": 61, "id": "73e915f5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"solutionVersionArn\": \"arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-without-desc-userpersonalization/0b76212f\",\n", " \"metrics\": {\n", " \"coverage\": 0.0914,\n", " \"mean_reciprocal_rank_at_25\": 0.0268,\n", " \"normalized_discounted_cumulative_gain_at_10\": 0.0376,\n", " \"normalized_discounted_cumulative_gain_at_25\": 0.0464,\n", " \"normalized_discounted_cumulative_gain_at_5\": 0.0309,\n", " \"precision_at_10\": 0.0058,\n", " \"precision_at_25\": 0.0037,\n", " \"precision_at_5\": 0.0076\n", " },\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"8c61339a-f929-47e0-81f0-a9660ebd589f\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 23:16:07 GMT\",\n", " \"x-amzn-requestid\": \"8c61339a-f929-47e0-81f0-a9660ebd589f\",\n", " \"content-length\": \"430\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "metrics_response = personalize.get_solution_metrics(\n", " solutionVersionArn = user_personalization_without_solution_version_arn\n", ")\n", "\n", "print(json.dumps(metrics_response, indent=2))" ] }, { "cell_type": "markdown", "id": "671ea36b", "metadata": {}, "source": [ "Let's save these off in a dictionary so we can more easily compare the metrics between both solution versions." ] }, { "cell_type": "code", "execution_count": 62, "id": "9eed50ff", "metadata": {}, "outputs": [], "source": [ "metrics = {\n", " 'Coverage': [ metrics_response['metrics']['coverage'] ],\n", " 'MRR-25': [ metrics_response['metrics']['mean_reciprocal_rank_at_25'] ],\n", " 'NDCG-5': [ metrics_response['metrics']['normalized_discounted_cumulative_gain_at_5'] ],\n", " 'NDCG-10': [ metrics_response['metrics']['normalized_discounted_cumulative_gain_at_10'] ],\n", " 'NDCG-25': [ metrics_response['metrics']['normalized_discounted_cumulative_gain_at_25'] ], \n", " 'Precision-5': [ metrics_response['metrics']['precision_at_5'] ],\n", " 'Precision-10': [ metrics_response['metrics']['precision_at_10'] ],\n", " 'Precision-25': [ metrics_response['metrics']['precision_at_25'] ], \n", "}" ] }, { "cell_type": "markdown", "id": "a0a751c2", "metadata": {}, "source": [ "Next, get the offline metrics for the solution version that included the description column and save them off too." ] }, { "cell_type": "code", "execution_count": 63, "id": "13d8b2f9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"solutionVersionArn\": \"arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-with-desc-userpersonalization/f178990f\",\n", " \"metrics\": {\n", " \"coverage\": 0.1323,\n", " \"mean_reciprocal_rank_at_25\": 0.0367,\n", " \"normalized_discounted_cumulative_gain_at_10\": 0.049,\n", " \"normalized_discounted_cumulative_gain_at_25\": 0.0591,\n", " \"normalized_discounted_cumulative_gain_at_5\": 0.0425,\n", " \"precision_at_10\": 0.0071,\n", " \"precision_at_25\": 0.0045,\n", " \"precision_at_5\": 0.0104\n", " },\n", " \"ResponseMetadata\": {\n", " \"RequestId\": \"b54df693-4378-4194-96c7-cec3a9d934cf\",\n", " \"HTTPStatusCode\": 200,\n", " \"HTTPHeaders\": {\n", " \"content-type\": \"application/x-amz-json-1.1\",\n", " \"date\": \"Tue, 13 Jul 2021 23:16:14 GMT\",\n", " \"x-amzn-requestid\": \"b54df693-4378-4194-96c7-cec3a9d934cf\",\n", " \"content-length\": \"426\",\n", " \"connection\": \"keep-alive\"\n", " },\n", " \"RetryAttempts\": 0\n", " }\n", "}\n" ] } ], "source": [ "metrics_response = personalize.get_solution_metrics(\n", " solutionVersionArn = user_personalization_with_solution_version_arn\n", ")\n", "\n", "print(json.dumps(metrics_response, indent=2))" ] }, { "cell_type": "code", "execution_count": 64, "id": "81c4adbb", "metadata": {}, "outputs": [], "source": [ "metrics['Coverage'].append(metrics_response['metrics']['coverage'])\n", "metrics['MRR-25'].append(metrics_response['metrics']['mean_reciprocal_rank_at_25'])\n", "metrics['NDCG-5'].append(metrics_response['metrics']['normalized_discounted_cumulative_gain_at_5'])\n", "metrics['NDCG-10'].append(metrics_response['metrics']['normalized_discounted_cumulative_gain_at_10'])\n", "metrics['NDCG-25'].append(metrics_response['metrics']['normalized_discounted_cumulative_gain_at_25'])\n", "metrics['Precision-5'].append(metrics_response['metrics']['precision_at_5'])\n", "metrics['Precision-10'].append(metrics_response['metrics']['precision_at_10'])\n", "metrics['Precision-25'].append(metrics_response['metrics']['precision_at_25'])" ] }, { "cell_type": "markdown", "id": "6c93993e", "metadata": {}, "source": [ "Calculate the percentage change in each metric for with and without text and display the results." ] }, { "cell_type": "code", "execution_count": 65, "id": "37d850f9", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Without TextWith Text% Change
Coverage0.09140.132344.75%
MRR-250.02680.036736.94%
NDCG-50.03090.042537.54%
NDCG-100.03760.049030.32%
NDCG-250.04640.059127.37%
Precision-50.00760.010436.84%
Precision-100.00580.007122.41%
Precision-250.00370.004521.62%
\n", "
" ], "text/plain": [ " Without Text With Text % Change\n", "Coverage 0.0914 0.1323 44.75%\n", "MRR-25 0.0268 0.0367 36.94%\n", "NDCG-5 0.0309 0.0425 37.54%\n", "NDCG-10 0.0376 0.0490 30.32%\n", "NDCG-25 0.0464 0.0591 27.37%\n", "Precision-5 0.0076 0.0104 36.84%\n", "Precision-10 0.0058 0.0071 22.41%\n", "Precision-25 0.0037 0.0045 21.62%" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for key in metrics:\n", " metrics[key].append(\"{:.2%}\".format((metrics[key][1] - metrics[key][0])/metrics[key][0]))\n", "\n", "metrics_df = pd.DataFrame.from_dict(metrics,orient='index',columns=['Without Text', 'With Text', '% Change'])\n", "metrics_df" ] }, { "cell_type": "markdown", "id": "502af56d", "metadata": {}, "source": [ "These metrics clearly show that recommendations from the solution version that includes the item descriptions were significantly better across the board. For more sparse interactions datasets where users and items have fewer interactions will benefit more from the addition of text than datasets that already have a higher number of interactions per item and/or user." ] }, { "cell_type": "markdown", "id": "75acfe18", "metadata": {}, "source": [ "## Cleanup\n", "\n", "Personalize resources created by this notebook can be deleted from the Personalize service page in the AWS console. \n", "\n", "Alternatively, the following script can be run locally to delete all resources in each dataset group.\n", "\n", "https://gist.github.com/james-jory/62ddddf2f9180b77dd2a42e645b9d3b0" ] }, { "cell_type": "markdown", "id": "0520ba10", "metadata": {}, "source": [ "In addition, the IAM Role and S3 bucket can be deleted from the IAM and S3 service pages, respectively." ] }, { "cell_type": "code", "execution_count": null, "id": "93e64371", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 5 }