{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8ea891f6",
   "metadata": {},
   "source": [
    "# Amazon Personalize: using text as unstructured item metadata\n",
    "\n",
    "The relevance of the recommendations you deliver with Amazon Personalize depends on the data available when the recommendations are generated. Amazon Personalize uses your users’ historical interactions, the attributes of your items, and your users’ metadata to learn what items are most relevant for each user. The primary data required by Amazon Personalize is user-item interactions. The interactions users have with items in your catalog, such as clicking on a product, reading an article, watching a video, or purchasing a product, are an important signal of what they have found relevant in the past. Including item and user attributes, also known as metadata, can enhance the relevance of recommendations; especially for new items that are similar to what your users have found relevant. However, structured metadata such as an item’s category, style, or genre may not always be readily available or doesn’t provide all the information that you have in your narrative descriptions. Now Amazon Personalize allows you to add unstructured metadata such as product descriptions, video transcripts, or article text with your other item attributes. Amazon Personalize hosts, manages, and automatically uses natural language processing (NLP) models to process your text and use it to improve the performance of your Amazon Personalize solutions.\n",
    "\n",
    "This notebook will demonstrate how text in the form of product descriptions can be included as unstructured item metadata to improve the relevance of recommendations.\n",
    "\n",
    "Amazon Reviews data from the Amazon Prime Pantry category are used for the interactions and items datasets.\n",
    "\n",
    "When considering including text in your items dataset, keep the following best practices in mind.\n",
    "- Text that is editorially validated to be concise, relevant, and informative to each item, where the most pertinent details are mentioned earlier in the text, is preferred over user generated content that may be less relevant or consistent\n",
    "- A sparsely populated text column will diminish the positive impact of including text in the items dataset\n",
    "- Cleanse all text of markup and superfluous whitespace formatting before adding to it to the items dataset\n",
    "- English is currently the only supported language for the text field\n",
    "- Text fields are currently only considered for the User-Personalization and Personalized-Ranking recipes\n",
    "\n",
    "Two dataset groups will be created that will include data with and without item descriptions so we can train separate models and compare their offline and online results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "89197fc8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import json\n",
    "import numpy as np\n",
    "from datetime import datetime\n",
    "import boto3\n",
    "import time\n",
    "from time import sleep\n",
    "from lxml import html"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dda155e",
   "metadata": {},
   "source": [
    "## Load and inspect datasets\n",
    "\n",
    "We'll start by loading the Prime Pantry reviews dataset. You will need to fill out the form for access to the data files:\n",
    "\n",
    "http://deepyeti.ucsd.edu/jianmo/amazon/index.html\n",
    "\n",
    "Citation:\n",
    "> Justifying recommendations using distantly-labeled reviews and fined-grained aspects  \n",
    "> Jianmo Ni, Jiacheng Li, Julian McAuley  \n",
    "> Empirical Methods in Natural Language Processing (EMNLP), 2019 [pdf](http://cseweb.ucsd.edu/~jmcauley/pdfs/emnlp19a.pdf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "2d4ca5e3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mkdir: cannot create directory ‘raw_data’: File exists\n",
      "--2021-07-13 22:06:52--  http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Prime_Pantry.json.gz\n",
      "Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50\n",
      "Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 45435146 (43M) [application/octet-stream]\n",
      "Saving to: ‘Prime_Pantry.json.gz’\n",
      "\n",
      "Prime_Pantry.json.g 100%[===================>]  43.33M  12.3MB/s    in 4.8s    \n",
      "\n",
      "2021-07-13 22:06:57 (9.01 MB/s) - ‘Prime_Pantry.json.gz’ saved [45435146/45435146]\n",
      "\n",
      "--2021-07-13 22:06:57--  http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_Prime_Pantry.json.gz\n",
      "Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50\n",
      "Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 5281662 (5.0M) [application/octet-stream]\n",
      "Saving to: ‘meta_Prime_Pantry.json.gz’\n",
      "\n",
      "meta_Prime_Pantry.j 100%[===================>]   5.04M  6.49MB/s    in 0.8s    \n",
      "\n",
      "2021-07-13 22:06:58 (6.49 MB/s) - ‘meta_Prime_Pantry.json.gz’ saved [5281662/5281662]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "data_dir = 'raw_data'\n",
    "!mkdir $data_dir\n",
    "\n",
    "!cd $data_dir && \\\n",
    "    wget http://deepyeti.ucsd.edu/jianmo/amazon/categoryFiles/Prime_Pantry.json.gz && \\\n",
    "    wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles2/meta_Prime_Pantry.json.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e66d003e",
   "metadata": {},
   "source": [
    "### Load and inspect reviews data\n",
    "\n",
    "We'll start by loading the reviews dataset for the Prime Pantry products and running some commands to see what we have to work with."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "8289e424",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>overall</th>\n",
       "      <th>verified</th>\n",
       "      <th>reviewTime</th>\n",
       "      <th>reviewerID</th>\n",
       "      <th>asin</th>\n",
       "      <th>reviewerName</th>\n",
       "      <th>reviewText</th>\n",
       "      <th>summary</th>\n",
       "      <th>unixReviewTime</th>\n",
       "      <th>vote</th>\n",
       "      <th>image</th>\n",
       "      <th>style</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5</td>\n",
       "      <td>True</td>\n",
       "      <td>12 14, 2014</td>\n",
       "      <td>A1NKJW0TNRVS7O</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>Tamara M.</td>\n",
       "      <td>Good clinging</td>\n",
       "      <td>Clings  well</td>\n",
       "      <td>1418515200</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>True</td>\n",
       "      <td>11 20, 2014</td>\n",
       "      <td>A2L6X37E8TFTCC</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>Amazon Customer</td>\n",
       "      <td>Fantastic buy and a good plastic wrap.  Even t...</td>\n",
       "      <td>Saran could use more Plus to Cling better.</td>\n",
       "      <td>1416441600</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>True</td>\n",
       "      <td>10 11, 2014</td>\n",
       "      <td>A2WPR4W6V48121</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>noname</td>\n",
       "      <td>ok</td>\n",
       "      <td>Four Stars</td>\n",
       "      <td>1412985600</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>False</td>\n",
       "      <td>09 1, 2014</td>\n",
       "      <td>A27EE7X7L29UMU</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>ZapNZs</td>\n",
       "      <td>Saran Cling Plus is kind of like most of the C...</td>\n",
       "      <td>The wrap is fantastic, but the dispensing, cut...</td>\n",
       "      <td>1409529600</td>\n",
       "      <td>4</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>True</td>\n",
       "      <td>08 10, 2014</td>\n",
       "      <td>A1OWT4YZGB5GV9</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>Amy Rogers</td>\n",
       "      <td>This is my go to plastic wrap so there isn't m...</td>\n",
       "      <td>has been doing it's job for years</td>\n",
       "      <td>1407628800</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   overall  verified   reviewTime      reviewerID        asin  \\\n",
       "0        5      True  12 14, 2014  A1NKJW0TNRVS7O  B0000DIWNZ   \n",
       "1        4      True  11 20, 2014  A2L6X37E8TFTCC  B0000DIWNZ   \n",
       "2        4      True  10 11, 2014  A2WPR4W6V48121  B0000DIWNZ   \n",
       "3        3     False   09 1, 2014  A27EE7X7L29UMU  B0000DIWNZ   \n",
       "4        4      True  08 10, 2014  A1OWT4YZGB5GV9  B0000DIWNZ   \n",
       "\n",
       "      reviewerName                                         reviewText  \\\n",
       "0        Tamara M.                                      Good clinging   \n",
       "1  Amazon Customer  Fantastic buy and a good plastic wrap.  Even t...   \n",
       "2           noname                                                 ok   \n",
       "3           ZapNZs  Saran Cling Plus is kind of like most of the C...   \n",
       "4       Amy Rogers  This is my go to plastic wrap so there isn't m...   \n",
       "\n",
       "                                             summary  unixReviewTime vote  \\\n",
       "0                                       Clings  well      1418515200  NaN   \n",
       "1         Saran could use more Plus to Cling better.      1416441600  NaN   \n",
       "2                                         Four Stars      1412985600  NaN   \n",
       "3  The wrap is fantastic, but the dispensing, cut...      1409529600    4   \n",
       "4                  has been doing it's job for years      1407628800  NaN   \n",
       "\n",
       "  image style  \n",
       "0   NaN   NaN  \n",
       "1   NaN   NaN  \n",
       "2   NaN   NaN  \n",
       "3   NaN   NaN  \n",
       "4   NaN   NaN  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pantry_df = pd.read_json(data_dir + '/Prime_Pantry.json.gz', lines=True, compression='infer')\n",
    "pantry_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "365cbe16",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(471614, 12)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pantry_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c65ae637",
   "metadata": {},
   "source": [
    "What can we learn from this output? There are over 471K reviews and 12 columns of data. The `asin` column is our unique item identifier, `reviewerID` is our unique user identifier, `unixReviewTime` is our timestamp for the review, and `overall` indicates the positivity of the review on a scale of 1-5. We will use this file as the basis for our interactions dataset for Personalize. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "891c6405",
   "metadata": {},
   "source": [
    "### Build and save interactions dataset\n",
    "\n",
    "Let's start building our interactions dataset by narrowing down the rows we want to include. The first step is to isolate only the positive reviews. For this we will assume any reviews with an overall rating of 4 or higher is a positive review. Anything rating of 3 or below are either mediocre or negative reviews."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ecfd527b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(387692, 12)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "positive_reviews_df = pantry_df[pantry_df['overall'] > 3]\n",
    "positive_reviews_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13abaf9d",
   "metadata": {},
   "source": [
    "We're down to 387K positive reviews. Still plenty for training a model in Personalize.\n",
    "\n",
    "Next let's narrow down the dataset to just the columns we need and add an `EVENT_TYPE` column to indicate the type of events we're capturing. Adding an `EVENT_TYPE` column now will make it easier to explore testing real-time events later if you choose to do so (since `eventType` is a required field for the [PutEvents](https://docs.aws.amazon.com/personalize/latest/dg/API_UBS_PutEvents.html) API)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "3ab7d273",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>reviewerID</th>\n",
       "      <th>asin</th>\n",
       "      <th>unixReviewTime</th>\n",
       "      <th>overall</th>\n",
       "      <th>EVENT_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1NKJW0TNRVS7O</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>1418515200</td>\n",
       "      <td>5</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2L6X37E8TFTCC</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>1416441600</td>\n",
       "      <td>4</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A2WPR4W6V48121</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>1412985600</td>\n",
       "      <td>4</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>A1OWT4YZGB5GV9</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>1407628800</td>\n",
       "      <td>4</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>A1GN2ADKF1IE7K</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>1405296000</td>\n",
       "      <td>5</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       reviewerID        asin  unixReviewTime  overall EVENT_TYPE\n",
       "0  A1NKJW0TNRVS7O  B0000DIWNZ      1418515200        5   reviewed\n",
       "1  A2L6X37E8TFTCC  B0000DIWNZ      1416441600        4   reviewed\n",
       "2  A2WPR4W6V48121  B0000DIWNZ      1412985600        4   reviewed\n",
       "4  A1OWT4YZGB5GV9  B0000DIWNZ      1407628800        4   reviewed\n",
       "5  A1GN2ADKF1IE7K  B0000DIWNZ      1405296000        5   reviewed"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "positive_reviews_df = positive_reviews_df[['reviewerID', 'asin', 'unixReviewTime', 'overall']]\n",
    "positive_reviews_df['EVENT_TYPE']='reviewed'\n",
    "\n",
    "positive_reviews_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f05c33a8",
   "metadata": {},
   "source": [
    "One last check we should do is sanity check a `unixReviewTime` column value. Since Personalize builds sequence models based on the date and time of each interaction, it's important that the timestamp of each interaction is represented in the expected format so that it is interpreted correctly.\n",
    "\n",
    "Let's pick a value for the `unixReviewTime` column and parse it into a human-readable date so we can verify that it's reasonable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "0b51f569",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1321488000\n",
      "2011-11-17 00:00:00\n"
     ]
    }
   ],
   "source": [
    "time_stamp = positive_reviews_df.iloc[50]['unixReviewTime']\n",
    "print(time_stamp)\n",
    "print(datetime.utcfromtimestamp(time_stamp).strftime('%Y-%m-%d %H:%M:%S'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29fe21fd",
   "metadata": {},
   "source": [
    "The timestamp value looks good. Let's get some final summary information for our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "31739017",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>reviewerID</th>\n",
       "      <th>asin</th>\n",
       "      <th>unixReviewTime</th>\n",
       "      <th>overall</th>\n",
       "      <th>EVENT_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>387692</td>\n",
       "      <td>387692</td>\n",
       "      <td>3.876920e+05</td>\n",
       "      <td>387692.000000</td>\n",
       "      <td>387692</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>202254</td>\n",
       "      <td>10584</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>A35Q0RBM3YNQNF</td>\n",
       "      <td>B00XA9DADC</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>176</td>\n",
       "      <td>5288</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>387692</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.468847e+09</td>\n",
       "      <td>4.847227</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.314975e+07</td>\n",
       "      <td>0.359769</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.073693e+09</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.447200e+09</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.474718e+09</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.498435e+09</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1.538611e+09</td>\n",
       "      <td>5.000000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            reviewerID        asin  unixReviewTime        overall EVENT_TYPE\n",
       "count           387692      387692    3.876920e+05  387692.000000     387692\n",
       "unique          202254       10584             NaN            NaN          1\n",
       "top     A35Q0RBM3YNQNF  B00XA9DADC             NaN            NaN   reviewed\n",
       "freq               176        5288             NaN            NaN     387692\n",
       "mean               NaN         NaN    1.468847e+09       4.847227        NaN\n",
       "std                NaN         NaN    4.314975e+07       0.359769        NaN\n",
       "min                NaN         NaN    1.073693e+09       4.000000        NaN\n",
       "25%                NaN         NaN    1.447200e+09       5.000000        NaN\n",
       "50%                NaN         NaN    1.474718e+09       5.000000        NaN\n",
       "75%                NaN         NaN    1.498435e+09       5.000000        NaN\n",
       "max                NaN         NaN    1.538611e+09       5.000000        NaN"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "positive_reviews_df.describe(include='all')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e54247f",
   "metadata": {},
   "source": [
    "We have 387K reviews for 202K distinct reviewers/users across 10K unique products. This is basis of our interactions dataset.\n",
    "\n",
    "Before we can use this as our interactions dataset, though, we need to rename the columns to match those expected by Personalize."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "174de8be",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>USER_ID</th>\n",
       "      <th>ITEM_ID</th>\n",
       "      <th>TIMESTAMP</th>\n",
       "      <th>EVENT_VALUE</th>\n",
       "      <th>EVENT_TYPE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A1NKJW0TNRVS7O</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>1418515200</td>\n",
       "      <td>5</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A2L6X37E8TFTCC</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>1416441600</td>\n",
       "      <td>4</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A2WPR4W6V48121</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>1412985600</td>\n",
       "      <td>4</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>A1OWT4YZGB5GV9</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>1407628800</td>\n",
       "      <td>4</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>A1GN2ADKF1IE7K</td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>1405296000</td>\n",
       "      <td>5</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>471609</th>\n",
       "      <td>A19GSVHXVT5NNF</td>\n",
       "      <td>B01HI8JVI8</td>\n",
       "      <td>1494892800</td>\n",
       "      <td>5</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>471610</th>\n",
       "      <td>ABSCTKLX9F9IU</td>\n",
       "      <td>B01HI8JVI8</td>\n",
       "      <td>1493769600</td>\n",
       "      <td>5</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>471611</th>\n",
       "      <td>A2R33RCWKDHZ3L</td>\n",
       "      <td>B01HI8JVI8</td>\n",
       "      <td>1492646400</td>\n",
       "      <td>5</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>471612</th>\n",
       "      <td>A2INGHYEXZDHMC</td>\n",
       "      <td>B01HI8JVI8</td>\n",
       "      <td>1492560000</td>\n",
       "      <td>5</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>471613</th>\n",
       "      <td>AJ8M6GVFGIRLS</td>\n",
       "      <td>B01HI8JVI8</td>\n",
       "      <td>1487808000</td>\n",
       "      <td>5</td>\n",
       "      <td>reviewed</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>387692 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "               USER_ID     ITEM_ID   TIMESTAMP  EVENT_VALUE EVENT_TYPE\n",
       "0       A1NKJW0TNRVS7O  B0000DIWNZ  1418515200            5   reviewed\n",
       "1       A2L6X37E8TFTCC  B0000DIWNZ  1416441600            4   reviewed\n",
       "2       A2WPR4W6V48121  B0000DIWNZ  1412985600            4   reviewed\n",
       "4       A1OWT4YZGB5GV9  B0000DIWNZ  1407628800            4   reviewed\n",
       "5       A1GN2ADKF1IE7K  B0000DIWNZ  1405296000            5   reviewed\n",
       "...                ...         ...         ...          ...        ...\n",
       "471609  A19GSVHXVT5NNF  B01HI8JVI8  1494892800            5   reviewed\n",
       "471610   ABSCTKLX9F9IU  B01HI8JVI8  1493769600            5   reviewed\n",
       "471611  A2R33RCWKDHZ3L  B01HI8JVI8  1492646400            5   reviewed\n",
       "471612  A2INGHYEXZDHMC  B01HI8JVI8  1492560000            5   reviewed\n",
       "471613   AJ8M6GVFGIRLS  B01HI8JVI8  1487808000            5   reviewed\n",
       "\n",
       "[387692 rows x 5 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "positive_reviews_df.rename(columns = {'reviewerID':'USER_ID', 'asin':'ITEM_ID', \n",
    "                              'unixReviewTime':'TIMESTAMP', 'overall': 'EVENT_VALUE'}, inplace = True)\n",
    "positive_reviews_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6fa5b5c",
   "metadata": {},
   "source": [
    "Finally, let's save our positive reviews dataframe as a CSV. We'll upload this CSV to Personalize later in this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "299f803c",
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_filename = \"interactions.csv\"\n",
    "positive_reviews_df.to_csv(interactions_filename, index=False, float_format='%.0f')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f816d13d",
   "metadata": {},
   "source": [
    "### Load and inspect item metadata\n",
    "\n",
    "Now that we have the interactions dataset established, let's turn to the items dataset. This is where we will find the unstructured text value that we will include in the model.\n",
    "\n",
    "Like the reviews dataset, the Prime Pantry item metadata file is also represented in JSON. Due to the nested nature of this file, this will present some challenges in getting our data formatted the way we need it.\n",
    "\n",
    "Let's start by loading the metadata file into a dataframe and taking a look the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "6821a075",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category</th>\n",
       "      <th>tech1</th>\n",
       "      <th>description</th>\n",
       "      <th>fit</th>\n",
       "      <th>title</th>\n",
       "      <th>also_buy</th>\n",
       "      <th>tech2</th>\n",
       "      <th>brand</th>\n",
       "      <th>feature</th>\n",
       "      <th>rank</th>\n",
       "      <th>also_view</th>\n",
       "      <th>details</th>\n",
       "      <th>main_cat</th>\n",
       "      <th>similar_item</th>\n",
       "      <th>date</th>\n",
       "      <th>price</th>\n",
       "      <th>asin</th>\n",
       "      <th>imageURL</th>\n",
       "      <th>imageURLHighRes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>[Sink your sweet tooth into MILK DUDS Candya d...</td>\n",
       "      <td></td>\n",
       "      <td>HERSHEY'S Milk Duds Candy, 5 Ounce(Halloween C...</td>\n",
       "      <td>[B019KE37WO, B007NQSWEU]</td>\n",
       "      <td></td>\n",
       "      <td>Milk Duds</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>{'ASIN: ': 'B00005BPJO', 'Item model number:':...</td>\n",
       "      <td>&lt;img src=\"https://m.media-amazon.com/images/G/...</td>\n",
       "      <td></td>\n",
       "      <td>NaT</td>\n",
       "      <td>$5.00</td>\n",
       "      <td>B00005BPJO</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>[Sink your sweet tooth into MILK DUDS Candya d...</td>\n",
       "      <td></td>\n",
       "      <td>HERSHEY'S Milk Duds Candy, 5 Ounce(Halloween C...</td>\n",
       "      <td>[B019KE37WO, B007NQSWEU]</td>\n",
       "      <td></td>\n",
       "      <td>Milk Duds</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>{'ASIN: ': 'B00005BPJO', 'Item model number:':...</td>\n",
       "      <td>&lt;img src=\"https://m.media-amazon.com/images/G/...</td>\n",
       "      <td></td>\n",
       "      <td>NaT</td>\n",
       "      <td>$5.00</td>\n",
       "      <td>B00005BPJO</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>[A perfect Lentil soup starts with Goya Lentil...</td>\n",
       "      <td></td>\n",
       "      <td>Goya Dry Lentils, 16 oz</td>\n",
       "      <td>[B003SI144W, B000VDRKEK]</td>\n",
       "      <td></td>\n",
       "      <td>Goya</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[B074MFVZG7, B079PTH69L, B000VDRKEK, B074M9T81...</td>\n",
       "      <td>{'ASIN: ': 'B0000DIF38'}</td>\n",
       "      <td>&lt;img src=\"https://images-na.ssl-images-amazon....</td>\n",
       "      <td></td>\n",
       "      <td>NaT</td>\n",
       "      <td></td>\n",
       "      <td>B0000DIF38</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>[Saran Premium Wrap is an extra tough yet easy...</td>\n",
       "      <td></td>\n",
       "      <td>Saran Premium Plastic Wrap, 100 Sq Ft</td>\n",
       "      <td>[B01MY5FHT6, B000PYF8VM, B000SRMDFA, B07CX6LN8...</td>\n",
       "      <td></td>\n",
       "      <td>Saran</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[B077QLSLRQ, B00JPKW1RQ, B000FE2IK6, B00XUJHJ9...</td>\n",
       "      <td>{'Domestic Shipping: ': 'This item can only be...</td>\n",
       "      <td>&lt;img src=\"https://images-na.ssl-images-amazon....</td>\n",
       "      <td></td>\n",
       "      <td>NaT</td>\n",
       "      <td></td>\n",
       "      <td>B0000DIWNI</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>[200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Eas...</td>\n",
       "      <td></td>\n",
       "      <td>Saran Cling Plus Plastic Wrap, 200 Sq Ft</td>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>Saran</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[B0014CZ0TE]</td>\n",
       "      <td>{'Domestic Shipping: ': 'This item can only be...</td>\n",
       "      <td>&lt;img src=\"https://images-na.ssl-images-amazon....</td>\n",
       "      <td></td>\n",
       "      <td>NaT</td>\n",
       "      <td></td>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10808</th>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>[These bars are where our journey started and ...</td>\n",
       "      <td></td>\n",
       "      <td>KIND Bars, Caramel Almond &amp;amp; Sea Salt, Glut...</td>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>KIND</td>\n",
       "      <td>[]</td>\n",
       "      <td>26,259 in Grocery &amp; Gourmet Food (</td>\n",
       "      <td>[B00JQQAN60, B00JQQAWSY, B0111K7V54, B0111K8L9...</td>\n",
       "      <td>{'ASIN: ': 'B01HI76312', 'Item model number:':...</td>\n",
       "      <td>&lt;img src=\"https://images-na.ssl-images-amazon....</td>\n",
       "      <td></td>\n",
       "      <td>NaT</td>\n",
       "      <td>$3.98</td>\n",
       "      <td>B01HI76312</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10809</th>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>[These bars are where our journey started and ...</td>\n",
       "      <td></td>\n",
       "      <td>KIND Bars, Maple Glazed Pecan &amp;amp; Sea Salt, ...</td>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>KIND</td>\n",
       "      <td>[]</td>\n",
       "      <td>16,822 in Grocery &amp; Gourmet Food (</td>\n",
       "      <td>[B0111K97JC, B00JQQAN60, B0111K8L9Y, B01HI7631...</td>\n",
       "      <td>{'ASIN: ': 'B01HI76790'}</td>\n",
       "      <td>&lt;img src=\"https://images-na.ssl-images-amazon....</td>\n",
       "      <td></td>\n",
       "      <td>NaT</td>\n",
       "      <td>$5.81</td>\n",
       "      <td>B01HI76790</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10810</th>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>[These bars are where our journey started and ...</td>\n",
       "      <td></td>\n",
       "      <td>KIND Bars, Dark Chocolate Almond &amp;amp; Coconut...</td>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>KIND</td>\n",
       "      <td>[]</td>\n",
       "      <td>107,057 in Grocery &amp; Gourmet Food (</td>\n",
       "      <td>[B0111K7V54, B01HI76312, B00JQQAL0S, B0111K97J...</td>\n",
       "      <td>{'ASIN: ': 'B01HI76SA8'}</td>\n",
       "      <td>&lt;img src=\"https://images-na.ssl-images-amazon....</td>\n",
       "      <td></td>\n",
       "      <td>NaT</td>\n",
       "      <td>$4.98</td>\n",
       "      <td>B01HI76SA8</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10811</th>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>[These bars are where our journey started and ...</td>\n",
       "      <td></td>\n",
       "      <td>KIND Bars, Honey Roasted Nuts &amp;amp; Sea Salt, ...</td>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>KIND</td>\n",
       "      <td>[]</td>\n",
       "      <td>24,648 in Grocery &amp; Gourmet Food (</td>\n",
       "      <td>[B00JQQAN60, B0111K7V54, B01HI76312, B0111K97J...</td>\n",
       "      <td>{'ASIN: ': 'B01HI76XS0'}</td>\n",
       "      <td>&lt;img src=\"https://images-na.ssl-images-amazon....</td>\n",
       "      <td></td>\n",
       "      <td>NaT</td>\n",
       "      <td>$5.81</td>\n",
       "      <td>B01HI76XS0</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "      <td>[https://images-na.ssl-images-amazon.com/image...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10812</th>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>[Delicious, crowd pleasing and convenient it c...</td>\n",
       "      <td></td>\n",
       "      <td>Tostitos Salsa, Mild, 69 Ounce</td>\n",
       "      <td>[B00ID0EBP4, B00ID0EB1S, B00N0WZSOS, B00IAE65L...</td>\n",
       "      <td></td>\n",
       "      <td>Tostitos</td>\n",
       "      <td>[]</td>\n",
       "      <td>259,965 in Grocery &amp; Gourmet Food (</td>\n",
       "      <td>[]</td>\n",
       "      <td>{'ASIN: ': 'B01HI8JVI8'}</td>\n",
       "      <td>&lt;img src=\"https://images-na.ssl-images-amazon....</td>\n",
       "      <td></td>\n",
       "      <td>NaT</td>\n",
       "      <td>$6.82</td>\n",
       "      <td>B01HI8JVI8</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10813 rows × 19 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      category tech1                                        description fit  \\\n",
       "0           []        [Sink your sweet tooth into MILK DUDS Candya d...       \n",
       "1           []        [Sink your sweet tooth into MILK DUDS Candya d...       \n",
       "2           []        [A perfect Lentil soup starts with Goya Lentil...       \n",
       "3           []        [Saran Premium Wrap is an extra tough yet easy...       \n",
       "4           []        [200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Eas...       \n",
       "...        ...   ...                                                ...  ..   \n",
       "10808       []        [These bars are where our journey started and ...       \n",
       "10809       []        [These bars are where our journey started and ...       \n",
       "10810       []        [These bars are where our journey started and ...       \n",
       "10811       []        [These bars are where our journey started and ...       \n",
       "10812       []        [Delicious, crowd pleasing and convenient it c...       \n",
       "\n",
       "                                                   title  \\\n",
       "0      HERSHEY'S Milk Duds Candy, 5 Ounce(Halloween C...   \n",
       "1      HERSHEY'S Milk Duds Candy, 5 Ounce(Halloween C...   \n",
       "2                                Goya Dry Lentils, 16 oz   \n",
       "3                  Saran Premium Plastic Wrap, 100 Sq Ft   \n",
       "4               Saran Cling Plus Plastic Wrap, 200 Sq Ft   \n",
       "...                                                  ...   \n",
       "10808  KIND Bars, Caramel Almond &amp; Sea Salt, Glut...   \n",
       "10809  KIND Bars, Maple Glazed Pecan &amp; Sea Salt, ...   \n",
       "10810  KIND Bars, Dark Chocolate Almond &amp; Coconut...   \n",
       "10811  KIND Bars, Honey Roasted Nuts &amp; Sea Salt, ...   \n",
       "10812                     Tostitos Salsa, Mild, 69 Ounce   \n",
       "\n",
       "                                                also_buy tech2      brand  \\\n",
       "0                               [B019KE37WO, B007NQSWEU]        Milk Duds   \n",
       "1                               [B019KE37WO, B007NQSWEU]        Milk Duds   \n",
       "2                               [B003SI144W, B000VDRKEK]             Goya   \n",
       "3      [B01MY5FHT6, B000PYF8VM, B000SRMDFA, B07CX6LN8...            Saran   \n",
       "4                                                     []            Saran   \n",
       "...                                                  ...   ...        ...   \n",
       "10808                                                 []             KIND   \n",
       "10809                                                 []             KIND   \n",
       "10810                                                 []             KIND   \n",
       "10811                                                 []             KIND   \n",
       "10812  [B00ID0EBP4, B00ID0EB1S, B00N0WZSOS, B00IAE65L...         Tostitos   \n",
       "\n",
       "      feature                                 rank  \\\n",
       "0          []                                   []   \n",
       "1          []                                   []   \n",
       "2          []                                   []   \n",
       "3          []                                   []   \n",
       "4          []                                   []   \n",
       "...       ...                                  ...   \n",
       "10808      []   26,259 in Grocery & Gourmet Food (   \n",
       "10809      []   16,822 in Grocery & Gourmet Food (   \n",
       "10810      []  107,057 in Grocery & Gourmet Food (   \n",
       "10811      []   24,648 in Grocery & Gourmet Food (   \n",
       "10812      []  259,965 in Grocery & Gourmet Food (   \n",
       "\n",
       "                                               also_view  \\\n",
       "0                                                     []   \n",
       "1                                                     []   \n",
       "2      [B074MFVZG7, B079PTH69L, B000VDRKEK, B074M9T81...   \n",
       "3      [B077QLSLRQ, B00JPKW1RQ, B000FE2IK6, B00XUJHJ9...   \n",
       "4                                           [B0014CZ0TE]   \n",
       "...                                                  ...   \n",
       "10808  [B00JQQAN60, B00JQQAWSY, B0111K7V54, B0111K8L9...   \n",
       "10809  [B0111K97JC, B00JQQAN60, B0111K8L9Y, B01HI7631...   \n",
       "10810  [B0111K7V54, B01HI76312, B00JQQAL0S, B0111K97J...   \n",
       "10811  [B00JQQAN60, B0111K7V54, B01HI76312, B0111K97J...   \n",
       "10812                                                 []   \n",
       "\n",
       "                                                 details  \\\n",
       "0      {'ASIN: ': 'B00005BPJO', 'Item model number:':...   \n",
       "1      {'ASIN: ': 'B00005BPJO', 'Item model number:':...   \n",
       "2                               {'ASIN: ': 'B0000DIF38'}   \n",
       "3      {'Domestic Shipping: ': 'This item can only be...   \n",
       "4      {'Domestic Shipping: ': 'This item can only be...   \n",
       "...                                                  ...   \n",
       "10808  {'ASIN: ': 'B01HI76312', 'Item model number:':...   \n",
       "10809                           {'ASIN: ': 'B01HI76790'}   \n",
       "10810                           {'ASIN: ': 'B01HI76SA8'}   \n",
       "10811                           {'ASIN: ': 'B01HI76XS0'}   \n",
       "10812                           {'ASIN: ': 'B01HI8JVI8'}   \n",
       "\n",
       "                                                main_cat similar_item date  \\\n",
       "0      <img src=\"https://m.media-amazon.com/images/G/...               NaT   \n",
       "1      <img src=\"https://m.media-amazon.com/images/G/...               NaT   \n",
       "2      <img src=\"https://images-na.ssl-images-amazon....               NaT   \n",
       "3      <img src=\"https://images-na.ssl-images-amazon....               NaT   \n",
       "4      <img src=\"https://images-na.ssl-images-amazon....               NaT   \n",
       "...                                                  ...          ...  ...   \n",
       "10808  <img src=\"https://images-na.ssl-images-amazon....               NaT   \n",
       "10809  <img src=\"https://images-na.ssl-images-amazon....               NaT   \n",
       "10810  <img src=\"https://images-na.ssl-images-amazon....               NaT   \n",
       "10811  <img src=\"https://images-na.ssl-images-amazon....               NaT   \n",
       "10812  <img src=\"https://images-na.ssl-images-amazon....               NaT   \n",
       "\n",
       "       price        asin                                           imageURL  \\\n",
       "0      $5.00  B00005BPJO  [https://images-na.ssl-images-amazon.com/image...   \n",
       "1      $5.00  B00005BPJO  [https://images-na.ssl-images-amazon.com/image...   \n",
       "2             B0000DIF38  [https://images-na.ssl-images-amazon.com/image...   \n",
       "3             B0000DIWNI  [https://images-na.ssl-images-amazon.com/image...   \n",
       "4             B0000DIWNZ  [https://images-na.ssl-images-amazon.com/image...   \n",
       "...      ...         ...                                                ...   \n",
       "10808  $3.98  B01HI76312  [https://images-na.ssl-images-amazon.com/image...   \n",
       "10809  $5.81  B01HI76790  [https://images-na.ssl-images-amazon.com/image...   \n",
       "10810  $4.98  B01HI76SA8                                                 []   \n",
       "10811  $5.81  B01HI76XS0  [https://images-na.ssl-images-amazon.com/image...   \n",
       "10812  $6.82  B01HI8JVI8                                                 []   \n",
       "\n",
       "                                         imageURLHighRes  \n",
       "0      [https://images-na.ssl-images-amazon.com/image...  \n",
       "1      [https://images-na.ssl-images-amazon.com/image...  \n",
       "2      [https://images-na.ssl-images-amazon.com/image...  \n",
       "3      [https://images-na.ssl-images-amazon.com/image...  \n",
       "4      [https://images-na.ssl-images-amazon.com/image...  \n",
       "...                                                  ...  \n",
       "10808  [https://images-na.ssl-images-amazon.com/image...  \n",
       "10809  [https://images-na.ssl-images-amazon.com/image...  \n",
       "10810                                                 []  \n",
       "10811  [https://images-na.ssl-images-amazon.com/image...  \n",
       "10812                                                 []  \n",
       "\n",
       "[10813 rows x 19 columns]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pantry_meta_df = pd.read_json('raw_data/meta_Prime_Pantry.json.gz', lines=True, compression='infer')\n",
    "pantry_meta_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "7a8a975e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category</th>\n",
       "      <th>tech1</th>\n",
       "      <th>description</th>\n",
       "      <th>fit</th>\n",
       "      <th>title</th>\n",
       "      <th>also_buy</th>\n",
       "      <th>tech2</th>\n",
       "      <th>brand</th>\n",
       "      <th>feature</th>\n",
       "      <th>rank</th>\n",
       "      <th>also_view</th>\n",
       "      <th>details</th>\n",
       "      <th>main_cat</th>\n",
       "      <th>similar_item</th>\n",
       "      <th>date</th>\n",
       "      <th>price</th>\n",
       "      <th>asin</th>\n",
       "      <th>imageURL</th>\n",
       "      <th>imageURLHighRes</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>0</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>unique</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>9409</td>\n",
       "      <td>1</td>\n",
       "      <td>10782</td>\n",
       "      <td>3957</td>\n",
       "      <td>1</td>\n",
       "      <td>1960</td>\n",
       "      <td>763</td>\n",
       "      <td>4828</td>\n",
       "      <td>5940</td>\n",
       "      <td>10786</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1482</td>\n",
       "      <td>10812</td>\n",
       "      <td>8940</td>\n",
       "      <td>8940</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>top</th>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>Infants' Motrin Concentrated Drops, Fever Redu...</td>\n",
       "      <td>[]</td>\n",
       "      <td></td>\n",
       "      <td>L'Oreal Paris</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "      <td>{}</td>\n",
       "      <td>&lt;img src=\"https://images-na.ssl-images-amazon....</td>\n",
       "      <td></td>\n",
       "      <td>NaN</td>\n",
       "      <td></td>\n",
       "      <td>B00005BPJO</td>\n",
       "      <td>[]</td>\n",
       "      <td>[]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>freq</th>\n",
       "      <td>10813</td>\n",
       "      <td>10813</td>\n",
       "      <td>98</td>\n",
       "      <td>10813</td>\n",
       "      <td>2</td>\n",
       "      <td>6754</td>\n",
       "      <td>10813</td>\n",
       "      <td>171</td>\n",
       "      <td>9777</td>\n",
       "      <td>5937</td>\n",
       "      <td>4835</td>\n",
       "      <td>24</td>\n",
       "      <td>10621</td>\n",
       "      <td>10813</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4063</td>\n",
       "      <td>2</td>\n",
       "      <td>1781</td>\n",
       "      <td>1781</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       category  tech1 description    fit  \\\n",
       "count     10813  10813       10813  10813   \n",
       "unique        1      1        9409      1   \n",
       "top          []                 []          \n",
       "freq      10813  10813          98  10813   \n",
       "\n",
       "                                                    title also_buy  tech2  \\\n",
       "count                                               10813    10813  10813   \n",
       "unique                                              10782     3957      1   \n",
       "top     Infants' Motrin Concentrated Drops, Fever Redu...       []          \n",
       "freq                                                    2     6754  10813   \n",
       "\n",
       "                brand feature   rank also_view details  \\\n",
       "count           10813   10813  10813     10813   10813   \n",
       "unique           1960     763   4828      5940   10786   \n",
       "top     L'Oreal Paris      []     []        []      {}   \n",
       "freq              171    9777   5937      4835      24   \n",
       "\n",
       "                                                 main_cat similar_item date  \\\n",
       "count                                               10813        10813    0   \n",
       "unique                                                  4            1    0   \n",
       "top     <img src=\"https://images-na.ssl-images-amazon....               NaN   \n",
       "freq                                                10621        10813  NaN   \n",
       "\n",
       "        price        asin imageURL imageURLHighRes  \n",
       "count   10813       10813    10813           10813  \n",
       "unique   1482       10812     8940            8940  \n",
       "top            B00005BPJO       []              []  \n",
       "freq     4063           2     1781            1781  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pantry_meta_df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a80c116",
   "metadata": {},
   "source": [
    "So what can we learn from this information? First, there are over 10K products represented in the metadata file. Most of the columns will be of little value to us for Personalize since they aren't relevant as features (image URLs, `details`, `also_viewed`, `also_buy`, etc) or are mostly blank/sparse (`category`, `fit`, `tech1`, etc). The `asin` column is our unique identifier for each item (although there looks to be one duplicate) and `brand` and `price` look like they may be useful. The `description` column is what we will use for unstructured text.\n",
    "\n",
    "However, we have to do some cleanup and reformatting of the fields we want to use in our items dataset. For example, the `price` field is a formatted currency value (string) and not numeric and the `description` field was loaded as an array of strings due to how they values were represented and parsed from the original JSON file. Lastly, the `description` values also contain HTML markup that needs to stripped.\n",
    "\n",
    "Let's start by creating a dataframe with just the columns we need for the items dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "c24bdc6e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>asin</th>\n",
       "      <th>brand</th>\n",
       "      <th>price</th>\n",
       "      <th>description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>B00005BPJO</td>\n",
       "      <td>Milk Duds</td>\n",
       "      <td>$5.00</td>\n",
       "      <td>[Sink your sweet tooth into MILK DUDS Candya d...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B00005BPJO</td>\n",
       "      <td>Milk Duds</td>\n",
       "      <td>$5.00</td>\n",
       "      <td>[Sink your sweet tooth into MILK DUDS Candya d...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B0000DIF38</td>\n",
       "      <td>Goya</td>\n",
       "      <td></td>\n",
       "      <td>[A perfect Lentil soup starts with Goya Lentil...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B0000DIWNI</td>\n",
       "      <td>Saran</td>\n",
       "      <td></td>\n",
       "      <td>[Saran Premium Wrap is an extra tough yet easy...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>Saran</td>\n",
       "      <td></td>\n",
       "      <td>[200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Eas...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>B0000GH6UG</td>\n",
       "      <td>Ibarra</td>\n",
       "      <td></td>\n",
       "      <td>[Ibarra Chocolate, 19 Oz, , ]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>B0000KC2BK</td>\n",
       "      <td>Knorr</td>\n",
       "      <td>$3.09</td>\n",
       "      <td>[Knorr Granulated Chicken Flavor Bouillon is a...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>B0001E1IN8</td>\n",
       "      <td>Castillo</td>\n",
       "      <td></td>\n",
       "      <td>[Red chili habanero sauces. They are present t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>B00032E8XK</td>\n",
       "      <td>Chicken of the Sea</td>\n",
       "      <td>$1.48</td>\n",
       "      <td>[Chicken of the Sea Solid White Albacore Tuna ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>B0005XMTHE</td>\n",
       "      <td>Smucker's</td>\n",
       "      <td>$2.29</td>\n",
       "      <td>[Helps build muscles with bcaa's amino acids, ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         asin               brand  price  \\\n",
       "0  B00005BPJO           Milk Duds  $5.00   \n",
       "1  B00005BPJO           Milk Duds  $5.00   \n",
       "2  B0000DIF38                Goya          \n",
       "3  B0000DIWNI               Saran          \n",
       "4  B0000DIWNZ               Saran          \n",
       "5  B0000GH6UG              Ibarra          \n",
       "6  B0000KC2BK               Knorr  $3.09   \n",
       "7  B0001E1IN8            Castillo          \n",
       "8  B00032E8XK  Chicken of the Sea  $1.48   \n",
       "9  B0005XMTHE           Smucker's  $2.29   \n",
       "\n",
       "                                         description  \n",
       "0  [Sink your sweet tooth into MILK DUDS Candya d...  \n",
       "1  [Sink your sweet tooth into MILK DUDS Candya d...  \n",
       "2  [A perfect Lentil soup starts with Goya Lentil...  \n",
       "3  [Saran Premium Wrap is an extra tough yet easy...  \n",
       "4  [200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Eas...  \n",
       "5                      [Ibarra Chocolate, 19 Oz, , ]  \n",
       "6  [Knorr Granulated Chicken Flavor Bouillon is a...  \n",
       "7  [Red chili habanero sauces. They are present t...  \n",
       "8  [Chicken of the Sea Solid White Albacore Tuna ...  \n",
       "9  [Helps build muscles with bcaa's amino acids, ...  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "items_df = pantry_meta_df.copy()\n",
    "items_df = items_df[['asin', 'brand', 'price', 'description']]\n",
    "items_df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0bf2443",
   "metadata": {},
   "source": [
    "Next let's drop duplicate rows based on the `asin` column value. There should only be one duplicate based on the `describe()` output above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "c53122a5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(10812, 4)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "items_df = items_df.drop_duplicates(subset=['asin'], keep='last')\n",
    "items_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "090316bb",
   "metadata": {},
   "source": [
    "Next let's focus on reformatting and cleaning up the `description` column values. As you can see above, the `description` is currently represented as an array of strings (because that's how it is represented in the JSON file). We need to flatten this array into a single string and strip all HTML markup from each fragment.\n",
    "\n",
    "We'll start by creating two utility functions that will be used to clean the `description` (and later the `title` column in the original dataset when we want to display titles for recommended products)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "32c14d03",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Strips and cleans a value of HTML markup and whitespace.\n",
    "def clean_markup(value):\n",
    "    s = str(value).strip()\n",
    "    if s != '':\n",
    "        s = str(html.fromstring(s).text_content())\n",
    "        s = ' '.join(s.split())\n",
    "                \n",
    "    return s.strip()\n",
    "\n",
    "# Cleans and reformats the description column value for a dataframe row.\n",
    "def clean_and_reformat_description(row):\n",
    "    s = ''\n",
    "    for el in row['description']:\n",
    "        el = clean_markup(el)\n",
    "        if el != '':\n",
    "            s += ' ' + el\n",
    "                \n",
    "    return s.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "aa51af53",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>asin</th>\n",
       "      <th>brand</th>\n",
       "      <th>price</th>\n",
       "      <th>description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B00005BPJO</td>\n",
       "      <td>Milk Duds</td>\n",
       "      <td>$5.00</td>\n",
       "      <td>Sink your sweet tooth into MILK DUDS Candya de...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B0000DIF38</td>\n",
       "      <td>Goya</td>\n",
       "      <td></td>\n",
       "      <td>A perfect Lentil soup starts with Goya Lentils...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B0000DIWNI</td>\n",
       "      <td>Saran</td>\n",
       "      <td></td>\n",
       "      <td>Saran Premium Wrap is an extra tough yet easy ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>Saran</td>\n",
       "      <td></td>\n",
       "      <td>200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>B0000GH6UG</td>\n",
       "      <td>Ibarra</td>\n",
       "      <td></td>\n",
       "      <td>Ibarra Chocolate, 19 Oz</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10808</th>\n",
       "      <td>B01HI76312</td>\n",
       "      <td>KIND</td>\n",
       "      <td>$3.98</td>\n",
       "      <td>These bars are where our journey started and i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10809</th>\n",
       "      <td>B01HI76790</td>\n",
       "      <td>KIND</td>\n",
       "      <td>$5.81</td>\n",
       "      <td>These bars are where our journey started and i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10810</th>\n",
       "      <td>B01HI76SA8</td>\n",
       "      <td>KIND</td>\n",
       "      <td>$4.98</td>\n",
       "      <td>These bars are where our journey started and i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10811</th>\n",
       "      <td>B01HI76XS0</td>\n",
       "      <td>KIND</td>\n",
       "      <td>$5.81</td>\n",
       "      <td>These bars are where our journey started and i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10812</th>\n",
       "      <td>B01HI8JVI8</td>\n",
       "      <td>Tostitos</td>\n",
       "      <td>$6.82</td>\n",
       "      <td>Delicious, crowd pleasing and convenient it ca...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10812 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "             asin      brand  price  \\\n",
       "1      B00005BPJO  Milk Duds  $5.00   \n",
       "2      B0000DIF38       Goya          \n",
       "3      B0000DIWNI      Saran          \n",
       "4      B0000DIWNZ      Saran          \n",
       "5      B0000GH6UG     Ibarra          \n",
       "...           ...        ...    ...   \n",
       "10808  B01HI76312       KIND  $3.98   \n",
       "10809  B01HI76790       KIND  $5.81   \n",
       "10810  B01HI76SA8       KIND  $4.98   \n",
       "10811  B01HI76XS0       KIND  $5.81   \n",
       "10812  B01HI8JVI8   Tostitos  $6.82   \n",
       "\n",
       "                                             description  \n",
       "1      Sink your sweet tooth into MILK DUDS Candya de...  \n",
       "2      A perfect Lentil soup starts with Goya Lentils...  \n",
       "3      Saran Premium Wrap is an extra tough yet easy ...  \n",
       "4      200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy...  \n",
       "5                                Ibarra Chocolate, 19 Oz  \n",
       "...                                                  ...  \n",
       "10808  These bars are where our journey started and i...  \n",
       "10809  These bars are where our journey started and i...  \n",
       "10810  These bars are where our journey started and i...  \n",
       "10811  These bars are where our journey started and i...  \n",
       "10812  Delicious, crowd pleasing and convenient it ca...  \n",
       "\n",
       "[10812 rows x 4 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "items_df['description'] = items_df.apply(clean_and_reformat_description, axis=1)\n",
    "items_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9089ade",
   "metadata": {},
   "source": [
    "Next let's take a look at the `price` column and change its type from a string to a float."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "06d8448c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "          4063\n",
       "$2.99      114\n",
       "$3.99      113\n",
       "$4.99      103\n",
       "$5.99       87\n",
       "          ... \n",
       "$20.42       1\n",
       "$32.32       1\n",
       "$1.52        1\n",
       "$27.89       1\n",
       "$39.10       1\n",
       "Name: price, Length: 1482, dtype: int64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "items_df['price'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f1ae9a2",
   "metadata": {},
   "source": [
    "The following cell with convert empty/non-numeric prices to `np.nan` and all others will have the `$` currency symbol removed. This will allow us to coerce the type to a float."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "c233e0bb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>asin</th>\n",
       "      <th>brand</th>\n",
       "      <th>price</th>\n",
       "      <th>description</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B00005BPJO</td>\n",
       "      <td>Milk Duds</td>\n",
       "      <td>5.00</td>\n",
       "      <td>Sink your sweet tooth into MILK DUDS Candya de...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B0000DIF38</td>\n",
       "      <td>Goya</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A perfect Lentil soup starts with Goya Lentils...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B0000DIWNI</td>\n",
       "      <td>Saran</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Saran Premium Wrap is an extra tough yet easy ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>Saran</td>\n",
       "      <td>NaN</td>\n",
       "      <td>200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>B0000GH6UG</td>\n",
       "      <td>Ibarra</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Ibarra Chocolate, 19 Oz</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10808</th>\n",
       "      <td>B01HI76312</td>\n",
       "      <td>KIND</td>\n",
       "      <td>3.98</td>\n",
       "      <td>These bars are where our journey started and i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10809</th>\n",
       "      <td>B01HI76790</td>\n",
       "      <td>KIND</td>\n",
       "      <td>5.81</td>\n",
       "      <td>These bars are where our journey started and i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10810</th>\n",
       "      <td>B01HI76SA8</td>\n",
       "      <td>KIND</td>\n",
       "      <td>4.98</td>\n",
       "      <td>These bars are where our journey started and i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10811</th>\n",
       "      <td>B01HI76XS0</td>\n",
       "      <td>KIND</td>\n",
       "      <td>5.81</td>\n",
       "      <td>These bars are where our journey started and i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10812</th>\n",
       "      <td>B01HI8JVI8</td>\n",
       "      <td>Tostitos</td>\n",
       "      <td>6.82</td>\n",
       "      <td>Delicious, crowd pleasing and convenient it ca...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10812 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "             asin      brand price  \\\n",
       "1      B00005BPJO  Milk Duds  5.00   \n",
       "2      B0000DIF38       Goya   NaN   \n",
       "3      B0000DIWNI      Saran   NaN   \n",
       "4      B0000DIWNZ      Saran   NaN   \n",
       "5      B0000GH6UG     Ibarra   NaN   \n",
       "...           ...        ...   ...   \n",
       "10808  B01HI76312       KIND  3.98   \n",
       "10809  B01HI76790       KIND  5.81   \n",
       "10810  B01HI76SA8       KIND  4.98   \n",
       "10811  B01HI76XS0       KIND  5.81   \n",
       "10812  B01HI8JVI8   Tostitos  6.82   \n",
       "\n",
       "                                             description  \n",
       "1      Sink your sweet tooth into MILK DUDS Candya de...  \n",
       "2      A perfect Lentil soup starts with Goya Lentils...  \n",
       "3      Saran Premium Wrap is an extra tough yet easy ...  \n",
       "4      200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy...  \n",
       "5                                Ibarra Chocolate, 19 Oz  \n",
       "...                                                  ...  \n",
       "10808  These bars are where our journey started and i...  \n",
       "10809  These bars are where our journey started and i...  \n",
       "10810  These bars are where our journey started and i...  \n",
       "10811  These bars are where our journey started and i...  \n",
       "10812  Delicious, crowd pleasing and convenient it ca...  \n",
       "\n",
       "[10812 rows x 4 columns]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def convert_price(row):\n",
    "    v = str(row['price']).strip().replace('$', '')\n",
    "    if v == '' or not v.lstrip('-').replace('.', '').isdigit():\n",
    "        return np.nan\n",
    "    return v\n",
    "\n",
    "items_df['price'] = items_df.apply(convert_price, axis=1)\n",
    "items_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "cc688b06",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.99     114\n",
       "3.99     113\n",
       "4.99     103\n",
       "5.99      87\n",
       "2.98      76\n",
       "        ... \n",
       "39.10      1\n",
       "1.84       1\n",
       "22.95      1\n",
       "12.17      1\n",
       "11.09      1\n",
       "Name: price, Length: 1480, dtype: int64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "items_df['price'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "00b61ccc",
   "metadata": {},
   "outputs": [],
   "source": [
    "items_df['price'] = items_df['price'].astype(float)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e124320",
   "metadata": {},
   "source": [
    "Next we'll rename the columns to match the names and uppercase name format expected by Personalize."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "191f558a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ITEM_ID</th>\n",
       "      <th>BRAND</th>\n",
       "      <th>PRICE</th>\n",
       "      <th>DESCRIPTION</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B00005BPJO</td>\n",
       "      <td>Milk Duds</td>\n",
       "      <td>5.00</td>\n",
       "      <td>Sink your sweet tooth into MILK DUDS Candya de...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B0000DIF38</td>\n",
       "      <td>Goya</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A perfect Lentil soup starts with Goya Lentils...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B0000DIWNI</td>\n",
       "      <td>Saran</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Saran Premium Wrap is an extra tough yet easy ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>Saran</td>\n",
       "      <td>NaN</td>\n",
       "      <td>200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>B0000GH6UG</td>\n",
       "      <td>Ibarra</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Ibarra Chocolate, 19 Oz</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>B0000KC2BK</td>\n",
       "      <td>Knorr</td>\n",
       "      <td>3.09</td>\n",
       "      <td>Knorr Granulated Chicken Flavor Bouillon is a ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>B0001E1IN8</td>\n",
       "      <td>Castillo</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Red chili habanero sauces. They are present to...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>B00032E8XK</td>\n",
       "      <td>Chicken of the Sea</td>\n",
       "      <td>1.48</td>\n",
       "      <td>Chicken of the Sea Solid White Albacore Tuna i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>B0005XMTHE</td>\n",
       "      <td>Smucker's</td>\n",
       "      <td>2.29</td>\n",
       "      <td>Helps build muscles with bcaa's amino acids, i...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>B0005XNE6E</td>\n",
       "      <td>Snapple</td>\n",
       "      <td>1.99</td>\n",
       "      <td>At Snapple, we believe lifes a peach. Weve bee...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       ITEM_ID               BRAND  PRICE  \\\n",
       "1   B00005BPJO           Milk Duds   5.00   \n",
       "2   B0000DIF38                Goya    NaN   \n",
       "3   B0000DIWNI               Saran    NaN   \n",
       "4   B0000DIWNZ               Saran    NaN   \n",
       "5   B0000GH6UG              Ibarra    NaN   \n",
       "6   B0000KC2BK               Knorr   3.09   \n",
       "7   B0001E1IN8            Castillo    NaN   \n",
       "8   B00032E8XK  Chicken of the Sea   1.48   \n",
       "9   B0005XMTHE           Smucker's   2.29   \n",
       "10  B0005XNE6E             Snapple   1.99   \n",
       "\n",
       "                                          DESCRIPTION  \n",
       "1   Sink your sweet tooth into MILK DUDS Candya de...  \n",
       "2   A perfect Lentil soup starts with Goya Lentils...  \n",
       "3   Saran Premium Wrap is an extra tough yet easy ...  \n",
       "4   200 sq ft (285 ft x 11-3/4 in x 18.6 m2). Easy...  \n",
       "5                             Ibarra Chocolate, 19 Oz  \n",
       "6   Knorr Granulated Chicken Flavor Bouillon is a ...  \n",
       "7   Red chili habanero sauces. They are present to...  \n",
       "8   Chicken of the Sea Solid White Albacore Tuna i...  \n",
       "9   Helps build muscles with bcaa's amino acids, i...  \n",
       "10  At Snapple, we believe lifes a peach. Weve bee...  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "items_df.rename(columns = {'asin':'ITEM_ID', 'brand':'BRAND', \n",
    "                              'price':'PRICE', 'description': 'DESCRIPTION'}, inplace = True)\n",
    "items_df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "517f9ad9",
   "metadata": {},
   "source": [
    "We'll be creating two items CSVs. One will have the description column and the other will not. We'll use each of these to train separate models with the same recipe so we can compare the offline metrics and do some online inspection of recommendations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "d4e21ac2",
   "metadata": {},
   "outputs": [],
   "source": [
    "items_with_desc_filename = \"items-with-desc.csv\"\n",
    "items_df.to_csv(items_with_desc_filename, index=False, float_format='%.2f')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15dee22c",
   "metadata": {},
   "source": [
    "Another items CSV with the description column removed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "2a01eb5f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ITEM_ID</th>\n",
       "      <th>BRAND</th>\n",
       "      <th>PRICE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>B00005BPJO</td>\n",
       "      <td>Milk Duds</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>B0000DIF38</td>\n",
       "      <td>Goya</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>B0000DIWNI</td>\n",
       "      <td>Saran</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>B0000DIWNZ</td>\n",
       "      <td>Saran</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>B0000GH6UG</td>\n",
       "      <td>Ibarra</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      ITEM_ID      BRAND  PRICE\n",
       "1  B00005BPJO  Milk Duds    5.0\n",
       "2  B0000DIF38       Goya    NaN\n",
       "3  B0000DIWNI      Saran    NaN\n",
       "4  B0000DIWNZ      Saran    NaN\n",
       "5  B0000GH6UG     Ibarra    NaN"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "items_without_desc_df = items_df[['ITEM_ID', 'BRAND', 'PRICE']]\n",
    "items_without_desc_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "a73cf6ee",
   "metadata": {},
   "outputs": [],
   "source": [
    "items_without_desc_filename = \"items-without-desc.csv\"\n",
    "items_without_desc_df.to_csv(items_without_desc_filename, index=False, float_format='%.2f')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d67baf5",
   "metadata": {},
   "source": [
    "## Create dataset groups and upload datasets\n",
    "\n",
    "With the datasets that we need built, now it's time to upload them to Personalize using dataset import jobs. Before we can upload the CSVs, we need to create dataset groups to hold our two dataset approaches (without and with descriptions), create schemas for our datasets, and create datasets.\n",
    "\n",
    "We'll start by creating SDK client that we'll need to interact with Personalize."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "11a54883",
   "metadata": {},
   "outputs": [],
   "source": [
    "personalize = boto3.client('personalize')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cbc9f544",
   "metadata": {},
   "source": [
    "### Create dataset groups\n",
    "\n",
    "Let's create our two dataset groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "7d433f5a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"datasetGroupArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset-group/amazon-pantry-without-desc\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"20bd153c-ebd0-432d-9ef3-522a0d2fd8d4\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:08:18 GMT\",\n",
      "      \"x-amzn-requestid\": \"20bd153c-ebd0-432d-9ef3-522a0d2fd8d4\",\n",
      "      \"content-length\": \"105\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "create_dataset_group_response = personalize.create_dataset_group(\n",
    "    name = \"amazon-pantry-without-desc\"\n",
    ")\n",
    "\n",
    "dataset_group_without_desc_arn = create_dataset_group_response['datasetGroupArn']\n",
    "print(json.dumps(create_dataset_group_response, indent=2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "1a071f53",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"datasetGroupArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset-group/amazon-pantry-with-desc\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"9cec53c8-28a3-40e5-bf0e-6f87a03113f7\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:08:18 GMT\",\n",
      "      \"x-amzn-requestid\": \"9cec53c8-28a3-40e5-bf0e-6f87a03113f7\",\n",
      "      \"content-length\": \"102\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "create_dataset_group_response = personalize.create_dataset_group(\n",
    "    name = \"amazon-pantry-with-desc\"\n",
    ")\n",
    "\n",
    "dataset_group_with_desc_arn = create_dataset_group_response['datasetGroupArn']\n",
    "print(json.dumps(create_dataset_group_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20481190",
   "metadata": {},
   "source": [
    "Since dataset groups can take a few seconds to be fully created, let's wait until they both have a status of ACTIVE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "8dfac385",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "At least one dataset group create is still in progress\n",
      "Dataset group create succeeded for arn:aws:personalize:us-east-1:224124347618:dataset-group/amazon-pantry-without-desc\n",
      "At least one dataset group create is still in progress\n",
      "Dataset group create succeeded for arn:aws:personalize:us-east-1:224124347618:dataset-group/amazon-pantry-with-desc\n"
     ]
    }
   ],
   "source": [
    "in_progress_dataset_group_arns = [ dataset_group_without_desc_arn, dataset_group_with_desc_arn ]\n",
    "\n",
    "max_time = time.time() + 3*60*60 # 3 hours\n",
    "while time.time() < max_time:\n",
    "    for dataset_group_arn in in_progress_dataset_group_arns:\n",
    "        describe_dataset_group_response = personalize.describe_dataset_group(\n",
    "            datasetGroupArn = dataset_group_arn\n",
    "        )\n",
    "        status = describe_dataset_group_response[\"datasetGroup\"][\"status\"]\n",
    "        if status == \"ACTIVE\":\n",
    "            print(\"Dataset group create succeeded for {}\".format(dataset_group_arn))\n",
    "            in_progress_dataset_group_arns.remove(dataset_group_arn)\n",
    "        elif status == \"CREATE FAILED\":\n",
    "            print(\"Create failed for {}\".format(dataset_group_arn))\n",
    "            in_progress_dataset_group_arns.remove(dataset_group_arn)\n",
    "\n",
    "    if len(in_progress_dataset_group_arns) <= 0:\n",
    "        break\n",
    "    else:\n",
    "        print(\"At least one dataset group create is still in progress\")\n",
    "                \n",
    "    time.sleep(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b16fdbdb",
   "metadata": {},
   "source": [
    "### Create Interactions dataset schema and datasets\n",
    "\n",
    "Since the interactions dataset will be the same for both dataset groups, we will be creating a single schema for the interactions dataset type and sharing it across both dataset groups. This is possible since schemas are global to your AWS account and not specific to a dataset group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "38851dc2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"schemaArn\": \"arn:aws:personalize:us-east-1:224124347618:schema/amazon-pantry-interactions\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"6be5e019-0c8e-487b-a158-6f089f9e79ca\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:08:40 GMT\",\n",
      "      \"x-amzn-requestid\": \"6be5e019-0c8e-487b-a158-6f089f9e79ca\",\n",
      "      \"content-length\": \"92\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "interactions_schema = schema = {\n",
    "    \"type\": \"record\",\n",
    "    \"name\": \"Interactions\",\n",
    "    \"namespace\": \"com.amazonaws.personalize.schema\",\n",
    "    \"fields\": [\n",
    "        {\n",
    "            \"name\": \"USER_ID\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"ITEM_ID\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"TIMESTAMP\",\n",
    "            \"type\": \"long\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"EVENT_VALUE\",\n",
    "            \"type\": \"float\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"EVENT_TYPE\",\n",
    "            \"type\": \"string\"\n",
    "        }\n",
    "    ],\n",
    "    \"version\": \"1.0\"\n",
    "}\n",
    "            \n",
    "create_schema_response = personalize.create_schema(\n",
    "    name = \"amazon-pantry-interactions\",\n",
    "    schema = json.dumps(interactions_schema)\n",
    ")\n",
    "\n",
    "interaction_schema_arn = create_schema_response['schemaArn']\n",
    "print(json.dumps(create_schema_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e32abcd",
   "metadata": {},
   "source": [
    "Next we'll create an Interactions dataset in both dataset groups specifying the schema we just created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "b8641620",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"datasetArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset/amazon-pantry-without-desc/INTERACTIONS\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"ea1020bb-a7fa-4a64-a29e-f07e8e0f9ae9\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:08:41 GMT\",\n",
      "      \"x-amzn-requestid\": \"ea1020bb-a7fa-4a64-a29e-f07e8e0f9ae9\",\n",
      "      \"content-length\": \"107\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "dataset_type = \"INTERACTIONS\"\n",
    "create_dataset_response = personalize.create_dataset(\n",
    "    name = \"amazon-pantry-without-desc-ints\",\n",
    "    datasetType = dataset_type,\n",
    "    datasetGroupArn = dataset_group_without_desc_arn,\n",
    "    schemaArn = interaction_schema_arn\n",
    ")\n",
    "\n",
    "interactions_dataset_without_desc_arn = create_dataset_response['datasetArn']\n",
    "print(json.dumps(create_dataset_response, indent=2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "fe84262a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"datasetArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset/amazon-pantry-with-desc/INTERACTIONS\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"614257cc-6344-408e-8e15-89d4f81ae927\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:08:41 GMT\",\n",
      "      \"x-amzn-requestid\": \"614257cc-6344-408e-8e15-89d4f81ae927\",\n",
      "      \"content-length\": \"104\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "create_dataset_response = personalize.create_dataset(\n",
    "    name = \"amazon-pantry-with-desc-ints\",\n",
    "    datasetType = dataset_type,\n",
    "    datasetGroupArn = dataset_group_with_desc_arn,\n",
    "    schemaArn = interaction_schema_arn\n",
    ")\n",
    "\n",
    "interactions_dataset_with_desc_arn = create_dataset_response['datasetArn']\n",
    "print(json.dumps(create_dataset_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "208ac794",
   "metadata": {},
   "source": [
    "### Stage Interactions CSV in S3\n",
    "\n",
    "Before we can upload the interactions CSV we created earlier into the Personalize datasets that we just created, we need to stage the CSV in an S3 bucket.\n",
    "\n",
    "Let's create an S3 bucket and copy the interactions CSV file to the bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "af016b5a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "us-east-1\n"
     ]
    }
   ],
   "source": [
    "# Determine the current S3 region where this notebook is being hosted in SageMaker.\n",
    "with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:\n",
    "    data = json.load(notebook_info)\n",
    "    resource_arn = data['ResourceArn']\n",
    "    region = resource_arn.split(':')[3]\n",
    "print(region)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "dcd7662a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "224124347618-us-east-1-amazon-pantry-personalize-text\n"
     ]
    }
   ],
   "source": [
    "s3 = boto3.client('s3')\n",
    "account_id = boto3.client('sts').get_caller_identity().get('Account')\n",
    "bucket_name = account_id + \"-\" + region + \"-\" + \"amazon-pantry-personalize-text\"\n",
    "print(bucket_name)\n",
    "if region == \"us-east-1\":\n",
    "    s3.create_bucket(Bucket=bucket_name)\n",
    "else:\n",
    "    s3.create_bucket(\n",
    "        Bucket=bucket_name,\n",
    "        CreateBucketConfiguration={'LocationConstraint': region}\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1129d901",
   "metadata": {},
   "source": [
    "#### Upload Interactions CSV to S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "d416f815",
   "metadata": {},
   "outputs": [],
   "source": [
    "boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5beabb6c",
   "metadata": {},
   "source": [
    "### Create S3 bucket policy and IAM role\n",
    "\n",
    "Before we can submit a dataset import job to Personalize, we have to create a bucket policy and IAM role that will give Personalize access to our bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "7cf21723",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'ResponseMetadata': {'RequestId': 'SBN10P9R7H7ST5RK',\n",
       "  'HostId': 'DD8fYEx27yBq6/rB7o9lMvkdCLOHOewN05NSq73g30jeFBdouLj5D+fWSnIZHvDuAKdCKEo7w3k=',\n",
       "  'HTTPStatusCode': 204,\n",
       "  'HTTPHeaders': {'x-amz-id-2': 'DD8fYEx27yBq6/rB7o9lMvkdCLOHOewN05NSq73g30jeFBdouLj5D+fWSnIZHvDuAKdCKEo7w3k=',\n",
       "   'x-amz-request-id': 'SBN10P9R7H7ST5RK',\n",
       "   'date': 'Tue, 13 Jul 2021 22:10:59 GMT',\n",
       "   'server': 'AmazonS3'},\n",
       "  'RetryAttempts': 0}}"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "policy = {\n",
    "    \"Version\": \"2012-10-17\",\n",
    "    \"Id\": \"PersonalizeS3BucketAccessPolicy\",\n",
    "    \"Statement\": [\n",
    "        {\n",
    "            \"Sid\": \"PersonalizeS3BucketAccessPolicy\",\n",
    "            \"Effect\": \"Allow\",\n",
    "            \"Principal\": {\n",
    "                \"Service\": \"personalize.amazonaws.com\"\n",
    "            },\n",
    "            \"Action\": [\n",
    "                \"s3:*Object\",\n",
    "                \"s3:ListBucket\"\n",
    "            ],\n",
    "            \"Resource\": [\n",
    "                \"arn:aws:s3:::{}\".format(bucket_name),\n",
    "                \"arn:aws:s3:::{}/*\".format(bucket_name)\n",
    "            ]\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "\n",
    "s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "fb5b7804",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "arn:aws:iam::224124347618:role/PersonalizeRoleAmazonPantry\n"
     ]
    }
   ],
   "source": [
    "iam = boto3.client(\"iam\")\n",
    "\n",
    "role_name = \"PersonalizeRoleAmazonPantry\"\n",
    "assume_role_policy_document = {\n",
    "    \"Version\": \"2012-10-17\",\n",
    "    \"Statement\": [\n",
    "        {\n",
    "          \"Effect\": \"Allow\",\n",
    "          \"Principal\": {\n",
    "            \"Service\": \"personalize.amazonaws.com\"\n",
    "          },\n",
    "          \"Action\": \"sts:AssumeRole\"\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "\n",
    "create_role_response = iam.create_role(\n",
    "    RoleName = role_name,\n",
    "    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)\n",
    ")\n",
    "\n",
    "# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes \"personalize\" or \"Personalize\" \n",
    "# if you would like to use a bucket with a different name, please consider creating and attaching a new policy\n",
    "# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role\n",
    "policy_arn = \"arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess\"\n",
    "iam.attach_role_policy(\n",
    "    RoleName = role_name,\n",
    "    PolicyArn = policy_arn\n",
    ")\n",
    "\n",
    "# Now add S3 support\n",
    "iam.attach_role_policy(\n",
    "    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',\n",
    "    RoleName=role_name\n",
    ")\n",
    "time.sleep(20) # wait for a minute to allow IAM role policy attachment to propagate\n",
    "\n",
    "role_arn = create_role_response[\"Role\"][\"Arn\"]\n",
    "print(role_arn)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec1513dc",
   "metadata": {},
   "source": [
    "### Import Interactions datasets for each dataset group\n",
    "\n",
    "Now we're ready to import the staged Interactions CSV in our S3 bucket to the Personalize datasets we created in each dataset group. We'll submit both import jobs and wait for them both to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "f4fc686d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"datasetImportJobArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-without-desc-ints-import\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"84c4d71d-fe71-4ee7-bccf-4f8ed8b1e549\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:12:08 GMT\",\n",
      "      \"x-amzn-requestid\": \"84c4d71d-fe71-4ee7-bccf-4f8ed8b1e549\",\n",
      "      \"content-length\": \"126\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "create_dataset_import_job_response = personalize.create_dataset_import_job(\n",
    "    jobName = \"amazon-pantry-without-desc-ints-import\",\n",
    "    datasetArn = interactions_dataset_without_desc_arn,\n",
    "    dataSource = {\n",
    "        \"dataLocation\": \"s3://{}/{}\".format(bucket_name, interactions_filename)\n",
    "    },\n",
    "    roleArn = role_arn\n",
    ")\n",
    "\n",
    "dataset_import_job_without_ints_arn = create_dataset_import_job_response['datasetImportJobArn']\n",
    "print(json.dumps(create_dataset_import_job_response, indent=2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "4d23015e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"datasetImportJobArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-with-desc-ints-import\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"eae39716-264b-48e8-bfb4-93f569c7a904\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:12:09 GMT\",\n",
      "      \"x-amzn-requestid\": \"eae39716-264b-48e8-bfb4-93f569c7a904\",\n",
      "      \"content-length\": \"123\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "create_dataset_import_job_response = personalize.create_dataset_import_job(\n",
    "    jobName = \"amazon-pantry-with-desc-ints-import\",\n",
    "    datasetArn = interactions_dataset_with_desc_arn,\n",
    "    dataSource = {\n",
    "        \"dataLocation\": \"s3://{}/{}\".format(bucket_name, interactions_filename)\n",
    "    },\n",
    "    roleArn = role_arn\n",
    ")\n",
    "\n",
    "dataset_import_job_with_ints_arn = create_dataset_import_job_response['datasetImportJobArn']\n",
    "print(json.dumps(create_dataset_import_job_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe0de9d0",
   "metadata": {},
   "source": [
    "### Wait for Interactions dataset import jobs to complete\n",
    "\n",
    "The following cell will wait for both import jobs to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "68b770b9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "At least one dataset import job is still in progress\n",
      "At least one dataset import job is still in progress\n",
      "At least one dataset import job is still in progress\n",
      "At least one dataset import job is still in progress\n",
      "Dataset import succeeded for arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-without-desc-ints-import\n",
      "At least one dataset import job is still in progress\n",
      "Dataset import succeeded for arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-with-desc-ints-import\n",
      "CPU times: user 42.3 ms, sys: 6.1 ms, total: 48.4 ms\n",
      "Wall time: 5min\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "in_progress_import_arns = [ dataset_import_job_without_ints_arn, dataset_import_job_with_ints_arn ]\n",
    "\n",
    "max_time = time.time() + 3*60*60 # 3 hours\n",
    "while time.time() < max_time:\n",
    "    for import_arn in in_progress_import_arns:\n",
    "        describe_dataset_import_job_response = personalize.describe_dataset_import_job(\n",
    "            datasetImportJobArn = import_arn\n",
    "        )\n",
    "        status = describe_dataset_import_job_response[\"datasetImportJob\"]['status']\n",
    "        if status == \"ACTIVE\":\n",
    "            print(\"Dataset import succeeded for {}\".format(import_arn))\n",
    "            in_progress_import_arns.remove(import_arn)\n",
    "        elif status == \"CREATE FAILED\":\n",
    "            print(\"Create failed for {}\".format(import_arn))\n",
    "            in_progress_import_arns.remove(import_arn)\n",
    "\n",
    "    if len(in_progress_import_arns) <= 0:\n",
    "        break\n",
    "    else:\n",
    "        print(\"At least one dataset import job is still in progress\")\n",
    "                \n",
    "    time.sleep(60)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6334309",
   "metadata": {},
   "source": [
    "### Create Items dataset schema and datasets\n",
    "\n",
    "Next we will repeat the process for the items datasets. This time, though, we will need to create two schemas since one items dataset includes the description column and the other does not. We'll start with the schema that does not include the description."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "bbe80a96",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"schemaArn\": \"arn:aws:personalize:us-east-1:224124347618:schema/amazon-pantry-item-without-desc-schema\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"11a18b05-189b-4485-ba8b-58083e784321\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:17:16 GMT\",\n",
      "      \"x-amzn-requestid\": \"11a18b05-189b-4485-ba8b-58083e784321\",\n",
      "      \"content-length\": \"104\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "item_without_desc_schema = {\n",
    "    \"type\": \"record\",\n",
    "    \"name\": \"Items\",\n",
    "    \"namespace\": \"com.amazonaws.personalize.schema\",\n",
    "    \"fields\": [\n",
    "        {\n",
    "            \"name\": \"ITEM_ID\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"BRAND\",\n",
    "            \"type\": [ \"null\", \"string\" ],\n",
    "            \"categorical\": True\n",
    "        },{\n",
    "            \"name\": \"PRICE\",\n",
    "            \"type\": [ \"null\", \"float\" ],\n",
    "        }\n",
    "    ],\n",
    "    \"version\": \"1.0\"\n",
    "}\n",
    "\n",
    "create_schema_response = personalize.create_schema(\n",
    "    name = \"amazon-pantry-item-without-desc-schema\",\n",
    "    schema = json.dumps(item_without_desc_schema)\n",
    ")\n",
    "\n",
    "item_without_desc_schema_arn = create_schema_response['schemaArn']\n",
    "print(json.dumps(create_schema_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75bcd9b8",
   "metadata": {},
   "source": [
    "Next we'll create a schema that includes the description. Be sure to take note of the `\"textual\": True` attribute on the `DESCRIPTION` field. This is how you differentiate unstructured text fields from categorical and string fields. Without this attribute, Personalize will not apply natural language processing techniques to extract features from this text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "052d149e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"schemaArn\": \"arn:aws:personalize:us-east-1:224124347618:schema/amazon-pantry-item-with-desc-schema\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"05c60e80-1d7c-45f1-a881-d08af62a8432\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:17:16 GMT\",\n",
      "      \"x-amzn-requestid\": \"05c60e80-1d7c-45f1-a881-d08af62a8432\",\n",
      "      \"content-length\": \"101\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "item_with_desc_schema = {\n",
    "    \"type\": \"record\",\n",
    "    \"name\": \"Items\",\n",
    "    \"namespace\": \"com.amazonaws.personalize.schema\",\n",
    "    \"fields\": [\n",
    "        {\n",
    "            \"name\": \"ITEM_ID\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"BRAND\",\n",
    "            \"type\": [ \"null\", \"string\" ],\n",
    "            \"categorical\": True\n",
    "        },{\n",
    "            \"name\": \"PRICE\",\n",
    "            \"type\": [ \"null\", \"float\" ],\n",
    "        },{\n",
    "            \"name\": \"DESCRIPTION\",\n",
    "            \"type\": [ \"null\", \"string\" ],\n",
    "            \"textual\": True\n",
    "        }\n",
    "    ],\n",
    "    \"version\": \"1.0\"\n",
    "}\n",
    "\n",
    "create_schema_response = personalize.create_schema(\n",
    "    name = \"amazon-pantry-item-with-desc-schema\",\n",
    "    schema = json.dumps(item_with_desc_schema)\n",
    ")\n",
    "\n",
    "item_with_desc_schema_arn = create_schema_response['schemaArn']\n",
    "print(json.dumps(create_schema_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43255f2e",
   "metadata": {},
   "source": [
    "Next we will create Personalize datasets in each dataset group, taking special care to specify the approprate schema ARN for each dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "1b53563d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"datasetArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset/amazon-pantry-without-desc/ITEMS\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"f9b87f6c-22c8-42a7-8a3a-8c203300e3eb\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:18:13 GMT\",\n",
      "      \"x-amzn-requestid\": \"f9b87f6c-22c8-42a7-8a3a-8c203300e3eb\",\n",
      "      \"content-length\": \"100\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "dataset_type = \"ITEMS\"\n",
    "create_dataset_response = personalize.create_dataset(\n",
    "    name = \"amazon-pantry-without-desc-items\",\n",
    "    datasetType = dataset_type,\n",
    "    datasetGroupArn = dataset_group_without_desc_arn,\n",
    "    schemaArn = item_without_desc_schema_arn\n",
    ")\n",
    "\n",
    "items_dataset_without_desc_arn = create_dataset_response['datasetArn']\n",
    "print(json.dumps(create_dataset_response, indent=2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "1763e36a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"datasetArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset/amazon-pantry-with-desc/ITEMS\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"cbba4d2a-17c9-4a9b-a6a3-a2df934e2de1\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:18:16 GMT\",\n",
      "      \"x-amzn-requestid\": \"cbba4d2a-17c9-4a9b-a6a3-a2df934e2de1\",\n",
      "      \"content-length\": \"97\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "create_dataset_response = personalize.create_dataset(\n",
    "    name = \"amazon-pantry-with-desc-items\",\n",
    "    datasetType = dataset_type,\n",
    "    datasetGroupArn = dataset_group_with_desc_arn,\n",
    "    schemaArn = item_with_desc_schema_arn\n",
    ")\n",
    "\n",
    "items_dataset_with_desc_arn = create_dataset_response['datasetArn']\n",
    "print(json.dumps(create_dataset_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45684b47",
   "metadata": {},
   "source": [
    "#### Stage Items CSV in S3\n",
    "\n",
    "Next we'll copy our two items CSV files to the same S3 bucket create above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "9dfd8f9b",
   "metadata": {},
   "outputs": [],
   "source": [
    "boto3.Session().resource('s3').Bucket(bucket_name).Object(items_without_desc_filename).upload_file(items_without_desc_filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "3b7c8c32",
   "metadata": {},
   "outputs": [],
   "source": [
    "boto3.Session().resource('s3').Bucket(bucket_name).Object(items_with_desc_filename).upload_file(items_with_desc_filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05130046",
   "metadata": {},
   "source": [
    "### Import Items datasets for each dataset group\n",
    "\n",
    "Since the S3 bucket policy and IAM role are already setup, we can just submit two dataset import jobs to import the Items CSVs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "e05d6a34",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"datasetImportJobArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-without-desc-items-import\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"c6ea207b-b8eb-4565-8ce2-6eb44e0fa18f\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:18:37 GMT\",\n",
      "      \"x-amzn-requestid\": \"c6ea207b-b8eb-4565-8ce2-6eb44e0fa18f\",\n",
      "      \"content-length\": \"127\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "create_dataset_import_job_response = personalize.create_dataset_import_job(\n",
    "    jobName = \"amazon-pantry-without-desc-items-import\",\n",
    "    datasetArn = items_dataset_without_desc_arn,\n",
    "    dataSource = {\n",
    "        \"dataLocation\": \"s3://{}/{}\".format(bucket_name, items_without_desc_filename)\n",
    "    },\n",
    "    roleArn = role_arn\n",
    ")\n",
    "\n",
    "dataset_import_job_without_items_arn = create_dataset_import_job_response['datasetImportJobArn']\n",
    "print(json.dumps(create_dataset_import_job_response, indent=2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "b34a8b81",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"datasetImportJobArn\": \"arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-with-desc-items-import\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"cf95206f-1527-4247-a618-6c8c832fa05f\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:18:38 GMT\",\n",
      "      \"x-amzn-requestid\": \"cf95206f-1527-4247-a618-6c8c832fa05f\",\n",
      "      \"content-length\": \"124\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "create_dataset_import_job_response = personalize.create_dataset_import_job(\n",
    "    jobName = \"amazon-pantry-with-desc-items-import\",\n",
    "    datasetArn = items_dataset_with_desc_arn,\n",
    "    dataSource = {\n",
    "        \"dataLocation\": \"s3://{}/{}\".format(bucket_name, items_with_desc_filename)\n",
    "    },\n",
    "    roleArn = role_arn\n",
    ")\n",
    "\n",
    "dataset_import_job_with_items_arn = create_dataset_import_job_response['datasetImportJobArn']\n",
    "print(json.dumps(create_dataset_import_job_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f558185",
   "metadata": {},
   "source": [
    "### Wait for Items import jobs to complete\n",
    "\n",
    "The following logic will wait until both items datasets are fully imported into each dataset group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "18a35716",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "At least one dataset import job is still in progress\n",
      "At least one dataset import job is still in progress\n",
      "At least one dataset import job is still in progress\n",
      "At least one dataset import job is still in progress\n",
      "Dataset import succeeded for arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-with-desc-items-import\n",
      "At least one dataset import job is still in progress\n",
      "At least one dataset import job is still in progress\n",
      "At least one dataset import job is still in progress\n",
      "Dataset import succeeded for arn:aws:personalize:us-east-1:224124347618:dataset-import-job/amazon-pantry-without-desc-items-import\n",
      "CPU times: user 57.6 ms, sys: 5.06 ms, total: 62.7 ms\n",
      "Wall time: 7min\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "in_progress_import_arns = [ dataset_import_job_without_items_arn, dataset_import_job_with_items_arn ]\n",
    "\n",
    "max_time = time.time() + 3*60*60 # 3 hours\n",
    "while time.time() < max_time:\n",
    "    for import_arn in in_progress_import_arns:\n",
    "        describe_dataset_import_job_response = personalize.describe_dataset_import_job(\n",
    "            datasetImportJobArn = import_arn\n",
    "        )\n",
    "        status = describe_dataset_import_job_response[\"datasetImportJob\"]['status']\n",
    "        if status == \"ACTIVE\":\n",
    "            print(\"Dataset import succeeded for {}\".format(import_arn))\n",
    "            in_progress_import_arns.remove(import_arn)\n",
    "        elif status == \"CREATE FAILED\":\n",
    "            print(\"Create failed for {}\".format(import_arn))\n",
    "            in_progress_import_arns.remove(import_arn)\n",
    "\n",
    "    if len(in_progress_import_arns) <= 0:\n",
    "        break\n",
    "    else:\n",
    "        print(\"At least one dataset import job is still in progress\")\n",
    "                \n",
    "    time.sleep(60)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66c3c35c",
   "metadata": {},
   "source": [
    "## Create solutions and solution versions\n",
    "\n",
    "With the interactions and items datasets imported into each dataset group, we will next create solutions and solution versions using the user-personalization recipe for the data in each dataset group.\n",
    "\n",
    "First, let's list the Personalize recipes available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "fa25cbf7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'recipes': [{'name': 'aws-hrnn',\n",
       "   'recipeArn': 'arn:aws:personalize:::recipe/aws-hrnn',\n",
       "   'status': 'ACTIVE',\n",
       "   'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),\n",
       "   'lastUpdatedDateTime': datetime.datetime(2021, 2, 6, 19, 6, 40, 447000, tzinfo=tzlocal())},\n",
       "  {'name': 'aws-hrnn-coldstart',\n",
       "   'recipeArn': 'arn:aws:personalize:::recipe/aws-hrnn-coldstart',\n",
       "   'status': 'ACTIVE',\n",
       "   'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),\n",
       "   'lastUpdatedDateTime': datetime.datetime(2021, 2, 6, 19, 6, 40, 447000, tzinfo=tzlocal())},\n",
       "  {'name': 'aws-hrnn-metadata',\n",
       "   'recipeArn': 'arn:aws:personalize:::recipe/aws-hrnn-metadata',\n",
       "   'status': 'ACTIVE',\n",
       "   'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),\n",
       "   'lastUpdatedDateTime': datetime.datetime(2021, 2, 6, 19, 6, 40, 447000, tzinfo=tzlocal())},\n",
       "  {'name': 'aws-personalized-ranking',\n",
       "   'recipeArn': 'arn:aws:personalize:::recipe/aws-personalized-ranking',\n",
       "   'status': 'ACTIVE',\n",
       "   'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),\n",
       "   'lastUpdatedDateTime': datetime.datetime(2021, 2, 6, 19, 6, 40, 447000, tzinfo=tzlocal())},\n",
       "  {'name': 'aws-popularity-count',\n",
       "   'recipeArn': 'arn:aws:personalize:::recipe/aws-popularity-count',\n",
       "   'status': 'ACTIVE',\n",
       "   'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),\n",
       "   'lastUpdatedDateTime': datetime.datetime(2021, 2, 6, 19, 6, 40, 447000, tzinfo=tzlocal())},\n",
       "  {'name': 'aws-sims',\n",
       "   'recipeArn': 'arn:aws:personalize:::recipe/aws-sims',\n",
       "   'status': 'ACTIVE',\n",
       "   'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),\n",
       "   'lastUpdatedDateTime': datetime.datetime(2021, 2, 6, 19, 6, 40, 447000, tzinfo=tzlocal())},\n",
       "  {'name': 'aws-user-personalization',\n",
       "   'recipeArn': 'arn:aws:personalize:::recipe/aws-user-personalization',\n",
       "   'status': 'ACTIVE',\n",
       "   'creationDateTime': datetime.datetime(2019, 6, 10, 0, 0, tzinfo=tzlocal()),\n",
       "   'lastUpdatedDateTime': datetime.datetime(2021, 2, 6, 19, 6, 40, 447000, tzinfo=tzlocal())}],\n",
       " 'ResponseMetadata': {'RequestId': '36ed9d13-64e9-4905-8528-136d414cb606',\n",
       "  'HTTPStatusCode': 200,\n",
       "  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',\n",
       "   'date': 'Tue, 13 Jul 2021 22:26:03 GMT',\n",
       "   'x-amzn-requestid': '36ed9d13-64e9-4905-8528-136d414cb606',\n",
       "   'content-length': '1259',\n",
       "   'connection': 'keep-alive'},\n",
       "  'RetryAttempts': 0}}"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "personalize.list_recipes()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cd368df",
   "metadata": {},
   "source": [
    "We will use the user-personalization recipe for this notebook since it is one of the recipes that uses item metadata. This recipe supports the canonical personalization use case where given a user, you want Personalize to recommend items that this user would be interested in. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "cbb8d479",
   "metadata": {},
   "outputs": [],
   "source": [
    "user_personalization_recipe_arn = \"arn:aws:personalize:::recipe/aws-user-personalization\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae991241",
   "metadata": {},
   "source": [
    "First, we will create a solution and solution version in the dataset group that does not include item descriptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "99e0b49d",
   "metadata": {},
   "outputs": [],
   "source": [
    "user_personalization_create_solution_response = personalize.create_solution(\n",
    "    name = \"amazon-pantry-without-desc-userpersonalization\",\n",
    "    datasetGroupArn = dataset_group_without_desc_arn,\n",
    "    recipeArn = user_personalization_recipe_arn\n",
    ")\n",
    "\n",
    "user_personalization_without_desc_solution_arn = user_personalization_create_solution_response['solutionArn']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "04b6fbda",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-without-desc-userpersonalization\n"
     ]
    }
   ],
   "source": [
    "print(user_personalization_without_desc_solution_arn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "6ac0b147",
   "metadata": {},
   "outputs": [],
   "source": [
    "user_personalization_solution_version_response = personalize.create_solution_version(\n",
    "    solutionArn = user_personalization_without_desc_solution_arn\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "7819e32a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"solutionVersionArn\": \"arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-without-desc-userpersonalization/0b76212f\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"018b3fcb-10d5-4290-a17a-3970723abacd\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:26:14 GMT\",\n",
      "      \"x-amzn-requestid\": \"018b3fcb-10d5-4290-a17a-3970723abacd\",\n",
      "      \"content-length\": \"132\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "user_personalization_without_solution_version_arn = user_personalization_solution_version_response['solutionVersionArn']\n",
    "print(json.dumps(user_personalization_solution_version_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecadbfb1",
   "metadata": {},
   "source": [
    "Next we will create a solution and solution version in the dataset group that includes the item descriptions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "58571178",
   "metadata": {},
   "outputs": [],
   "source": [
    "user_personalization_create_solution_response = personalize.create_solution(\n",
    "    name = \"amazon-pantry-with-desc-userpersonalization\",\n",
    "    datasetGroupArn = dataset_group_with_desc_arn,\n",
    "    recipeArn = user_personalization_recipe_arn\n",
    ")\n",
    "\n",
    "user_personalization_with_desc_solution_arn = user_personalization_create_solution_response['solutionArn']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "ab2b246d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-with-desc-userpersonalization\n"
     ]
    }
   ],
   "source": [
    "print(user_personalization_with_desc_solution_arn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "1ce9ded8",
   "metadata": {},
   "outputs": [],
   "source": [
    "user_personalization_solution_version_response = personalize.create_solution_version(\n",
    "    solutionArn = user_personalization_with_desc_solution_arn\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "cdc817b8",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"solutionVersionArn\": \"arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-with-desc-userpersonalization/f178990f\",\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"f630b862-6fa9-4eb7-a0d2-71d0b9637e80\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 22:26:29 GMT\",\n",
      "      \"x-amzn-requestid\": \"f630b862-6fa9-4eb7-a0d2-71d0b9637e80\",\n",
      "      \"content-length\": \"129\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "user_personalization_with_solution_version_arn = user_personalization_solution_version_response['solutionVersionArn']\n",
    "print(json.dumps(user_personalization_solution_version_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b33fd2f0",
   "metadata": {},
   "source": [
    "### Wait for solution versions to become active\n",
    "\n",
    "Finally, we'll wait for the solution versions to finish being created. This step is where Personalize trains machine learning models based on the datasets and selected recipe. Personalize will also split the interactions datasets into training and evaluation portions so it can evaluate the quality of recommendations against the trained model using held out data.\n",
    "\n",
    "You will notice that the solution version in the dataset group that includes the description data will take longer to train than the one without the description."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "e87ccb3d",
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "Build succeeded for arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-without-desc-userpersonalization/0b76212f\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "At least one solution build is still in progress\n",
      "Build succeeded for arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-with-desc-userpersonalization/f178990f\n",
      "CPU times: user 363 ms, sys: 65.4 ms, total: 428 ms\n",
      "Wall time: 48min 4s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "in_progress_solution_versions = [\n",
    "    user_personalization_without_solution_version_arn,\n",
    "    user_personalization_with_solution_version_arn\n",
    "]\n",
    "\n",
    "max_time = time.time() + 10*60*60 # 10 hours\n",
    "while time.time() < max_time:\n",
    "    for solution_version_arn in in_progress_solution_versions:\n",
    "        version_response = personalize.describe_solution_version(\n",
    "            solutionVersionArn = solution_version_arn\n",
    "        )\n",
    "        status = version_response[\"solutionVersion\"][\"status\"]\n",
    "        \n",
    "        if status == \"ACTIVE\":\n",
    "            print(\"Build succeeded for {}\".format(solution_version_arn))\n",
    "            in_progress_solution_versions.remove(solution_version_arn)\n",
    "        elif status == \"CREATE FAILED\":\n",
    "            print(\"Build failed for {}\".format(solution_version_arn))\n",
    "            in_progress_solution_versions.remove(solution_version_arn)\n",
    "    \n",
    "    if len(in_progress_solution_versions) <= 0:\n",
    "        break\n",
    "    else:\n",
    "        print(\"At least one solution build is still in progress\")\n",
    "        \n",
    "    time.sleep(60)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccd2e728",
   "metadata": {},
   "source": [
    "Generally speaking, the addition of text-based unstructured meta will increase training time. In our case, you can see above that the solution version that trained on the dataset with product descriptions took about 15 minutes longer than the solution version trained on the dataset without production descriptions. This difference will vary based on the composition and text values for your datasets.\n",
    "\n",
    "Let's inspect the training hours for each solution version and compare them as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "7f34c142",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training hours without description: 4.199\n",
      "Training hours with description: 5.346\n",
      "Difference of 27.32%\n"
     ]
    }
   ],
   "source": [
    "response = personalize.describe_solution_version(solutionVersionArn = user_personalization_without_solution_version_arn)\n",
    "training_hours_without_desc = response['solutionVersion']['trainingHours']\n",
    "\n",
    "response = personalize.describe_solution_version(solutionVersionArn = user_personalization_with_solution_version_arn)\n",
    "training_hours_with_desc = response['solutionVersion']['trainingHours']\n",
    "training_diff = (training_hours_with_desc - training_hours_without_desc) / training_hours_without_desc\n",
    "\n",
    "print(f\"Training hours without description: {training_hours_without_desc}\")\n",
    "print(f\"Training hours with description: {training_hours_with_desc}\")\n",
    "\n",
    "print(\"Difference of {:.2%}\".format(training_diff))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "baba62c8",
   "metadata": {},
   "source": [
    "The training hours used for cost calculations was about 27% higher for training with the description column. \n",
    "\n",
    "The wall/clock time and training hours will vary depending on the size of your datasets but this information can help you assess the trade off when considering adding unstructured text to your datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa0bb048",
   "metadata": {},
   "source": [
    "### Inspect offline metrics\n",
    "\n",
    "Now that the solution versions have finished building, let's inspect and compare the offline metrics for each solution version to see how the inclusion of unstructured text has impacted these metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "73e915f5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"solutionVersionArn\": \"arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-without-desc-userpersonalization/0b76212f\",\n",
      "  \"metrics\": {\n",
      "    \"coverage\": 0.0914,\n",
      "    \"mean_reciprocal_rank_at_25\": 0.0268,\n",
      "    \"normalized_discounted_cumulative_gain_at_10\": 0.0376,\n",
      "    \"normalized_discounted_cumulative_gain_at_25\": 0.0464,\n",
      "    \"normalized_discounted_cumulative_gain_at_5\": 0.0309,\n",
      "    \"precision_at_10\": 0.0058,\n",
      "    \"precision_at_25\": 0.0037,\n",
      "    \"precision_at_5\": 0.0076\n",
      "  },\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"8c61339a-f929-47e0-81f0-a9660ebd589f\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 23:16:07 GMT\",\n",
      "      \"x-amzn-requestid\": \"8c61339a-f929-47e0-81f0-a9660ebd589f\",\n",
      "      \"content-length\": \"430\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "metrics_response = personalize.get_solution_metrics(\n",
    "    solutionVersionArn = user_personalization_without_solution_version_arn\n",
    ")\n",
    "\n",
    "print(json.dumps(metrics_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "671ea36b",
   "metadata": {},
   "source": [
    "Let's save these off in a dictionary so we can more easily compare the metrics between both solution versions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "9eed50ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "metrics = {\n",
    "    'Coverage': [ metrics_response['metrics']['coverage'] ],\n",
    "    'MRR-25': [ metrics_response['metrics']['mean_reciprocal_rank_at_25'] ],\n",
    "    'NDCG-5': [ metrics_response['metrics']['normalized_discounted_cumulative_gain_at_5'] ],\n",
    "    'NDCG-10': [ metrics_response['metrics']['normalized_discounted_cumulative_gain_at_10'] ],\n",
    "    'NDCG-25': [ metrics_response['metrics']['normalized_discounted_cumulative_gain_at_25'] ],    \n",
    "    'Precision-5': [ metrics_response['metrics']['precision_at_5'] ],\n",
    "    'Precision-10': [ metrics_response['metrics']['precision_at_10'] ],\n",
    "    'Precision-25': [ metrics_response['metrics']['precision_at_25'] ],    \n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0a751c2",
   "metadata": {},
   "source": [
    "Next, get the offline metrics for the solution version that included the description column and save them off too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "13d8b2f9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"solutionVersionArn\": \"arn:aws:personalize:us-east-1:224124347618:solution/amazon-pantry-with-desc-userpersonalization/f178990f\",\n",
      "  \"metrics\": {\n",
      "    \"coverage\": 0.1323,\n",
      "    \"mean_reciprocal_rank_at_25\": 0.0367,\n",
      "    \"normalized_discounted_cumulative_gain_at_10\": 0.049,\n",
      "    \"normalized_discounted_cumulative_gain_at_25\": 0.0591,\n",
      "    \"normalized_discounted_cumulative_gain_at_5\": 0.0425,\n",
      "    \"precision_at_10\": 0.0071,\n",
      "    \"precision_at_25\": 0.0045,\n",
      "    \"precision_at_5\": 0.0104\n",
      "  },\n",
      "  \"ResponseMetadata\": {\n",
      "    \"RequestId\": \"b54df693-4378-4194-96c7-cec3a9d934cf\",\n",
      "    \"HTTPStatusCode\": 200,\n",
      "    \"HTTPHeaders\": {\n",
      "      \"content-type\": \"application/x-amz-json-1.1\",\n",
      "      \"date\": \"Tue, 13 Jul 2021 23:16:14 GMT\",\n",
      "      \"x-amzn-requestid\": \"b54df693-4378-4194-96c7-cec3a9d934cf\",\n",
      "      \"content-length\": \"426\",\n",
      "      \"connection\": \"keep-alive\"\n",
      "    },\n",
      "    \"RetryAttempts\": 0\n",
      "  }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "metrics_response = personalize.get_solution_metrics(\n",
    "    solutionVersionArn = user_personalization_with_solution_version_arn\n",
    ")\n",
    "\n",
    "print(json.dumps(metrics_response, indent=2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "81c4adbb",
   "metadata": {},
   "outputs": [],
   "source": [
    "metrics['Coverage'].append(metrics_response['metrics']['coverage'])\n",
    "metrics['MRR-25'].append(metrics_response['metrics']['mean_reciprocal_rank_at_25'])\n",
    "metrics['NDCG-5'].append(metrics_response['metrics']['normalized_discounted_cumulative_gain_at_5'])\n",
    "metrics['NDCG-10'].append(metrics_response['metrics']['normalized_discounted_cumulative_gain_at_10'])\n",
    "metrics['NDCG-25'].append(metrics_response['metrics']['normalized_discounted_cumulative_gain_at_25'])\n",
    "metrics['Precision-5'].append(metrics_response['metrics']['precision_at_5'])\n",
    "metrics['Precision-10'].append(metrics_response['metrics']['precision_at_10'])\n",
    "metrics['Precision-25'].append(metrics_response['metrics']['precision_at_25'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c93993e",
   "metadata": {},
   "source": [
    "Calculate the percentage change in each metric for with and without text and display the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "id": "37d850f9",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Without Text</th>\n",
       "      <th>With Text</th>\n",
       "      <th>% Change</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Coverage</th>\n",
       "      <td>0.0914</td>\n",
       "      <td>0.1323</td>\n",
       "      <td>44.75%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MRR-25</th>\n",
       "      <td>0.0268</td>\n",
       "      <td>0.0367</td>\n",
       "      <td>36.94%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NDCG-5</th>\n",
       "      <td>0.0309</td>\n",
       "      <td>0.0425</td>\n",
       "      <td>37.54%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NDCG-10</th>\n",
       "      <td>0.0376</td>\n",
       "      <td>0.0490</td>\n",
       "      <td>30.32%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NDCG-25</th>\n",
       "      <td>0.0464</td>\n",
       "      <td>0.0591</td>\n",
       "      <td>27.37%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Precision-5</th>\n",
       "      <td>0.0076</td>\n",
       "      <td>0.0104</td>\n",
       "      <td>36.84%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Precision-10</th>\n",
       "      <td>0.0058</td>\n",
       "      <td>0.0071</td>\n",
       "      <td>22.41%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Precision-25</th>\n",
       "      <td>0.0037</td>\n",
       "      <td>0.0045</td>\n",
       "      <td>21.62%</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Without Text  With Text % Change\n",
       "Coverage            0.0914     0.1323   44.75%\n",
       "MRR-25              0.0268     0.0367   36.94%\n",
       "NDCG-5              0.0309     0.0425   37.54%\n",
       "NDCG-10             0.0376     0.0490   30.32%\n",
       "NDCG-25             0.0464     0.0591   27.37%\n",
       "Precision-5         0.0076     0.0104   36.84%\n",
       "Precision-10        0.0058     0.0071   22.41%\n",
       "Precision-25        0.0037     0.0045   21.62%"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "for key in metrics:\n",
    "    metrics[key].append(\"{:.2%}\".format((metrics[key][1] - metrics[key][0])/metrics[key][0]))\n",
    "\n",
    "metrics_df = pd.DataFrame.from_dict(metrics,orient='index',columns=['Without Text', 'With Text', '% Change'])\n",
    "metrics_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "502af56d",
   "metadata": {},
   "source": [
    "These metrics clearly show that recommendations from the solution version that includes the item descriptions were significantly better across the board. For more sparse interactions datasets where users and items have fewer interactions will benefit more from the addition of text than datasets that already have a higher number of interactions per item and/or user."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75acfe18",
   "metadata": {},
   "source": [
    "## Cleanup\n",
    "\n",
    "Personalize resources created by this notebook can be deleted from the Personalize service page in the AWS console. \n",
    "\n",
    "Alternatively, the following script can be run locally to delete all resources in each dataset group.\n",
    "\n",
    "https://gist.github.com/james-jory/62ddddf2f9180b77dd2a42e645b9d3b0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0520ba10",
   "metadata": {},
   "source": [
    "In addition, the IAM Role and S3 bucket can be deleted from the IAM and S3 service pages, respectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93e64371",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}