{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Retail Demo Store - Personalization Workshop - Lab 1\n",
    "\n",
    "Welcome to the Retail Demo Store Personalization Workshop. In this module we're going to be implementing multiple personalization use cases, powered by [Amazon Personalize](https://aws.amazon.com/personalize/), to enhance the base Retail Demo Store storefront. We will break up the process into multiple labs that are designed to progressively build on top of each other. Therefore, you should complete the labs in the order specified below. Some labs are optional, as indicated below.\n",
    "\n",
    "## Workshop overview\n",
    "\n",
    "### Core/required labs\n",
    "\n",
    "The following labs guide you step-by-step through building the core personalization capabilities used in the Retail Demo Store storefront as well as foundational components leveraged in other workshops such as experimentation, messaging, and conversational. \n",
    "\n",
    "- **Lab 1**: Introduction and dataset preparation (this lab) (_20 minutes_)\n",
    "- **Lab 2**: Prepare Amazon Personalize and import datasets (_25 minutes_)\n",
    "- **Lab 3**: Create e-commerce domain recommenders and custom solutions in Amazon Personalize (_70-90 minutes_)\n",
    "- **Lab 4**: Evaluate recommendations from Amazon Personalize recommenders and campaigns, activate real-time events using an Event Tracker, and apply business rules using filters (_30-45 minutes_)\n",
    "\n",
    "\n",
    "### Optional  labs\n",
    "\n",
    "The following labs allow you to further explore how to pair Amazon Personalize with popular customer data platforms (CDPs); using Personalize for user segementation; and alternative use cases such as personalized offers.\n",
    "\n",
    "- Real-time events with CDPs\n",
    "    - **Lab 5**: Segment real-time data (optional) (_45 minutes_)\n",
    "    - **Lab 6**: mParticle (optional) (_45 minutes_)\n",
    "- **Lab 7**: Intelligent user segmentation using Amazon Personalize (_20-30 minutes_)\n",
    "- **Lab 8**: Personalized offers (optional) (_120 minutes_)\n",
    "\n",
    "### Cleanup\n",
    "\n",
    "The cleanup lab will tear down all of the Personalize resources created by the labs in this workshop.\n",
    "\n",
    "- **Lab 9**: Cleanup resources (_15 minutes_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview of use cases\n",
    "\n",
    "Through the labs in the workshop we will be adding personalization to several views in the Retail Demo Store.\n",
    "\n",
    "- **Navigation**:\n",
    "    - Personalized search: the order of search results displayed in the auto-complete search dropdown are personalized to each user's interests.\n",
    "- **Home page**:\n",
    "    - Recommended For You: the \"Inspired by your shopping trends\" grid is populated with personalized recommendations for the current user (if emulating a shopper or after a few interactions have been collected for a cold user).\n",
    "    - Featured products personalized ranking: order featured products based on the current user's interests.\n",
    "- **Product detail**:\n",
    "    - Related product recommendations: the \"Compare similar items\" carousel is populated with related item recommendations based on co-interactions of other items and thematic similarities between items.\n",
    "- **Category detail**:\n",
    "    - Personalized ranking: the order of items displayed on the category detail pages are personalized to the current user's interests.\n",
    "- **Interactive video stream**:\n",
    "    - Related product recommendations: the \"Compare similar items\" carousel is populated with related item recommendations based on co-interactions of other items and thematic similarities between items.\n",
    "\n",
    "In addition, the recommenders we build in this workshop will be foundational components of other Retail Demo Store workshops including the customer messaging workshop, conversational AI workshop, and location services workshop."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction to Amazon Personalize\n",
    "\n",
    "Amazon Personalize makes it easy for customers to develop applications with a wide array of personalization use cases, including real time product recommendations and customized direct marketing. Amazon Personalize brings the same machine learning technology used by Amazon.com to everyone for use in their applications – with no machine learning experience required. Amazon Personalize customers pay for what they use, with no minimum fees or upfront commitment. You can start using Amazon Personalize with a simple three step process, which only takes a few clicks in the AWS console, or a set of simple API calls. First, point Amazon Personalize to user data, catalog data, and activity stream of views, clicks, purchases, etc. in Amazon S3 or upload using a simple API call. Second, with a single click in the console or an API call, train a custom private recommendation model for your data. Third, retrieve personalized recommendations for any user by creating a recommender or campaign, and using the GetRecommendations API.\n",
    "\n",
    "In this lab we are going to focus on inspecting and preparing the data we will need to provide to Personalize."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "The workshop will be using the python programming language and the AWS SDK for python. Even if you are not fluent in python, the code cells should be reasonably intuitive. In practice, you can use any programming language supported by the AWS SDK to complete the same steps from this workshop in your application environment.\n",
    "\n",
    "### Update dependencies\n",
    "\n",
    "To get started, we need to perform a bit of setup. First, we need to ensure that a current version of botocore is locally installed. The botocore library is used by boto3, the AWS SDK library for python. We need a current version to be able to access some of the newer Amazon Personalize features.\n",
    "\n",
    "The following cell will update pip and install the latest botocore library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "!{sys.executable} -m pip install --upgrade pip\n",
    "!{sys.executable} -m pip install --upgrade --no-deps --force-reinstall botocore"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import dependencies\n",
    "\n",
    "Next we need to import some dependencies/libraries needed to complete this lab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import json\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "import time\n",
    "import requests\n",
    "import csv  \n",
    "import os\n",
    "import notebook_util\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create cients\n",
    "\n",
    "Next we need to create the AWS service clients needed for this workshop.\n",
    "\n",
    "- **personalize**: this client is used to create resources in Amazon Personalize\n",
    "- **servicediscovery**: this client is used to lookup the local IP addresses of the Retail Demo Store microservices that we'll need in the workshop\n",
    "- **ssm**: this client is used to access application configuration details stored in the Systems Manager parameter store\n",
    "\n",
    "Finally we'll lookup an identifier, stored as a resource tag in the SageMaker instance at deployment time. We need this tag to lookup resources we need, such as the IAM role needed by Personalize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setup clients\n",
    "personalize = boto3.client('personalize')\n",
    "servicediscovery = boto3.client('servicediscovery')\n",
    "ssm = boto3.client('ssm')\n",
    "\n",
    "Uid = notebook_util.lookup_uid()\n",
    "assert Uid is not None, 'Uid could not be determined from notebook instance tags'\n",
    "print('Uid:', Uid)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lookup S3 bucket and initialize file names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will be configuring some variables that will store the location of our source data. When the Retail Demo Store stack was deployed in this account, an Amazon S3 bucket was created for you and the name of this bucket was stored in Systems Manager Parameter Store. Using the `ssm` boto3 client we created above, we can retrieve the name of the bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bucketresponse = ssm.get_parameter(\n",
    "    Name='retaildemostore-stack-bucket'\n",
    ")\n",
    "\n",
    "# We will use this bucket to store our training data:\n",
    "bucket = bucketresponse['Parameter']['Value']\n",
    "\n",
    "# We will build and upload our training data in these files:\n",
    "items_filename = \"items.csv\"\n",
    "users_filename = \"users.csv\"\n",
    "interactions_filename = \"interactions.csv\"\n",
    "\n",
    "print('Bucket: {}'.format(bucket))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fetch, Prepare, and Upload User, Product, and Interaction Data\n",
    "\n",
    "Amazon Personalize provides predefined recipes, based on common use cases, for training models. A recipe is a machine learning algorithm that you use with settings, or hyperparameters, and the data you provide to train an Amazon Personalize model. The data you provide to train a model are organized into separate datasets by the type of data being provided. A collection of datasets are organized into a dataset group. The three dataset types supported by Personalize are items, users, and interactions. Depending on the recipe type you choose, a different combination of dataset types are required. For all recipe types, an interactions dataset is required. Interactions represent how users interact with items. For example, viewing a product, watching a video, listening to a recording, or reading an article. For this workshop, we will be using a recipes that supports all three dataset types.\n",
    "\n",
    "When we deployed the Retail Demo Store, it was deployed with an initial seed of fictitious User and Product data. We will use this data to train multiple models (recommenders and solutions) in the Amazon Personalize service which will be used to serve product recommendations, related items, and to rerank product lists for our users. The User and Product data can be accessed from the Retail Demo Store's [Users](https://github.com/aws-samples/retail-demo-store/tree/master/src/users) and [Products](https://github.com/aws-samples/retail-demo-store/tree/master/src/products) microservices, respectively. We will access our data through microservice data APIs, process the data, and upload them as CSVs to S3. Once our datasets are in S3, we can import them into the Amazon Personalize service.\n",
    "\n",
    "Let's get started."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Products Service Instance\n",
    "\n",
    "We will be pulling our Product data from the [Products Service](https://github.com/aws-samples/retail-demo-store/tree/master/src/products) that was deployed in Amazon Elastic Container Service as part of the Retail Demo Store. To connect to this service we will use [AWS Cloud Map](https://aws.amazon.com/cloud-map/)'s Service Discovery to discover an instance of the Product Service running in ECS, and then connect directly to that service instances to access our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = servicediscovery.discover_instances(\n",
    "    NamespaceName='retaildemostore.local',\n",
    "    ServiceName='products',\n",
    "    MaxResults=1,\n",
    "    HealthStatus='HEALTHY'\n",
    ")\n",
    "\n",
    "assert len(response['Instances']) > 0, 'Products service instance not found; check ECS to ensure it launched cleanly'\n",
    "\n",
    "products_service_instance = response['Instances'][0]['Attributes']['AWS_INSTANCE_IPV4']\n",
    "print('Products Service Instance IP: {}'.format(products_service_instance))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Download and Explore the Products Dataset\n",
    "\n",
    "Next, let's call the Products service to fetch all products in the catalog and then load that product information into a [Pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). Once in a dataframe, we can display, inspect, and manipulate data as needed. A snapshot of the dataframe is displayed to give you an idea of the fields available in the catalog for each product."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = requests.get('http://{}/products/all'.format(products_service_instance))\n",
    "products = response.json()\n",
    "products_df = pd.DataFrame(products)\n",
    "pd.set_option('display.max_rows', 5)\n",
    "\n",
    "products_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Prepare items dataset\n",
    "\n",
    "When training models in Amazon Personalize, we can provide structured and unstructured metadata about our items. This data helps improve the relevance of recommendations and is particularly useful when recommending new/cold items added to your catalog. \n",
    "\n",
    "For this workshop we will be using retail domain recommenders for some of our personalization use cases as well as creating some custom solutions for other use cases. The retail domain recommenders were released at re:Invent 2021. You can read more about them in the [Personalize blog](https://aws.amazon.com/blogs/machine-learning/amazon-personalize-announces-recommenders-optimized-for-retail-and-media-entertainment/).\n",
    "\n",
    "The retail domain recommenders stipulate some [reserved fields/columns](https://docs.aws.amazon.com/personalize/latest/dg/ECOMMERCE-datasets-and-schemas.html) that we must conform to. For this workshop we will use the `price`, `category`, `style`, `description`, and `gender_affinity` fields from the Retail Demo Store catalog add map them to the `PRICE`, `CATEGORY_L1`, `CATEGORY_L2`, `PRODUCT_DESCRIPTION`, and `GENDER` columns expected by Personalize. Personalize will automatically apply a natural language processing (NLP) machine learning model to the product description column to extract features from the text. The product's unique identifier is required. \n",
    "\n",
    "The following cell creates a new dataframe populated with just the columns we intend to use for the items dataset and then renames the columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "products_dataset_df = products_df[['id','price','category','style','description','gender_affinity','promoted']]\n",
    "products_dataset_df = products_dataset_df.rename(columns = {'id':'ITEM_ID',\n",
    "                                                            'price':'PRICE',\n",
    "                                                            'category':'CATEGORY_L1',\n",
    "                                                            'style':'CATEGORY_L2',\n",
    "                                                            'description':'PRODUCT_DESCRIPTION',\n",
    "                                                            'gender_affinity':'GENDER',\n",
    "                                                            'promoted': 'PROMOTED'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inspect items dataset\n",
    "\n",
    "Let's first get a sense for the dimensions of the item catalog dataframe. The following cell displays the number of rows and columns in the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "products_dataset_df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's make sure that we don't have any missing/null values for each column. Otherwise, Personalize will return an error for columns/fields that it expects to be fully populated. Note that Personalize supports specifying fields that are not fully populated but for the Retail domain recommender fields in the items dataset, all fields for all rows must be fully populated. The following cell will display the list the count of non-null values for each field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "products_dataset_df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the `GENDER` column has a null value for many rows. This is because the source `gender_affinity` field in the product catalog is used to indicate men's and women's products in the clothing, accessories, and beauty categories and is null for products from other categories such as furniture and electronics. So we must replace all null values for this column with a suitable value indicating that the product is appropriate for any gender. The following cell we will replace all null column values with `Any`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "products_dataset_df['GENDER'].fillna('Any', inplace = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also notice that the `PROMOTED` column has a null value for many rows. This column is used to indicate which products should be promoted within each category. For this demo, we're simply using `true` to indicate that a product should be promoted when making recommendations. We'll see how this is used with Personalize promotional filters in the lab on inference. For now, let's do some data normalization to make sure the data is consistent and predictable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert all \"true\" values to \"Y\"\n",
    "products_dataset_df.loc[products_dataset_df['PROMOTED'] == 'true', 'PROMOTED'] = 'Y'\n",
    "# Fill in any null values with \"N\"\n",
    "products_dataset_df['PROMOTED'].fillna('N', inplace = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's check details on the dataframe again to confirm that all columns are fully populated. Note that Personalize does not require that metadata columns be fully populated but sparsely populated columns that are included in training can negatively impact accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "products_dataset_df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "products_dataset_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save to CSV and upload to S3 bucket\n",
    "\n",
    "Our items dataset is now ready. The following cell will save the dataframe to a local CSV file and then upload the CSV to the staging S3 bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "products_dataset_df.to_csv(items_filename, index=False)\n",
    "boto3.Session().resource('s3').Bucket(bucket).Object(items_filename).upload_file(items_filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Users Service Instance\n",
    "\n",
    "Next, let's prepare the users dataset with information on the users pre-loaded into the Retail Detail Store.\n",
    "\n",
    "We will be pulling our User data from the [Users Service](https://github.com/aws-samples/retail-demo-store/tree/master/src/users) that is deployed as part of the Retail Demo Store. To connect to this service we will use Service Discovery to discover an instance of the User Service, and then connect directly to that service instance to access our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = servicediscovery.discover_instances(\n",
    "    NamespaceName='retaildemostore.local',\n",
    "    ServiceName='users',\n",
    "    MaxResults=1,\n",
    "    HealthStatus='HEALTHY'\n",
    ")\n",
    "\n",
    "assert len(response['Instances']) > 0, 'Users service instance not found; check ECS to ensure it launched cleanly'\n",
    "\n",
    "users_service_instance = response['Instances'][0]['Attributes']['AWS_INSTANCE_IPV4']\n",
    "print('Users Service Instance IP: {}'.format(users_service_instance))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Download and Explore the Users Dataset\n",
    "\n",
    "Similar to the Products microservice above, we'll query the Users microservice to retrieve information on all pre-loaded users. Along the way we'll remove all users without a gender or persona (i.e. users added to the storefront after deployment)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = requests.get('http://{}/users/all?count=10000'.format(users_service_instance))\n",
    "users = response.json()\n",
    "users_df = pd.DataFrame(users)\n",
    "# Remove any users without a persona or gender (i.e. possibly created in web UI)\n",
    "users_df = users_df[(users_df['persona'].str.strip().astype(bool)) | (users_df['gender'].str.strip().astype(bool))]\n",
    "pd.set_option('display.max_rows', 5)\n",
    "\n",
    "users_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Prepare and Upload Data\n",
    "\n",
    "Similar to the items dataset we created above, we can provide metadata on our users when training models in Personalize. For this workshop we will include each user's age and gender. Unlike the columns in the items dataset, there are no [column naming requirements for the users dataset](https://docs.aws.amazon.com/personalize/latest/dg/ECOMMERCE-users-dataset.html). As before, we will use uppercase column names to match the Personalize convention, save the data as a CSV, and upload to our S3 bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users_dataset_df = users_df[['id','age','gender']]\n",
    "users_dataset_df = users_dataset_df.rename(columns = {'id':'USER_ID','age':'AGE','gender':'GENDER'}) \n",
    "\n",
    "users_dataset_df.to_csv(users_filename, index=False)\n",
    "boto3.Session().resource('s3').Bucket(bucket).Object(users_filename).upload_file(users_filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create User-Items Interactions Dataset\n",
    "\n",
    "To mimic user behavior, we will be using a pre-generted dataset that represents user interactions with items.\n",
    "To make the interactions more realistic, a pre-defined shopper persona is used for each user to generate event\n",
    "types for products matching that persona. The interactions dataset contains for viewing products, adding products to a cart, checking out, and completing orders. The effect of discounts on interactions depends on a separate \"discount persona\" stored against the user.\n",
    "\n",
    "For more information about how interactions dataset was generated, take a look at the script file at [generate_interactions_personalize.py](../../generators/generate_interactions_personalize.py).\n",
    "\n",
    "Don't worry about understanding the nuances of the generator script. In a production workload you'd be using actual user interactions from your clickstream analytics and order managemement systems rather than generating them as we've done for this demo.\n",
    "\n",
    "First, let's download the pre-generated interactions CSV from the S3 bucket used to deploy the Retail Demo Store project in your account."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3 = boto3.resource('s3')\n",
    "\n",
    "resource_bucket, resource_path = notebook_util.lookup_resource_bucket_and_path()\n",
    "assert resource_bucket is not None, 'Resource bucket could not be determined from notebook instance tags'\n",
    "print(f'Resource bucket: {resource_bucket}')\n",
    "print(f'Resource path: {resource_path}')\n",
    "\n",
    "print('Downloading pre-generated interactions.csv')\n",
    "s3.Bucket(resource_bucket).download_file(resource_path + 'csvs/interactions.csv', interactions_filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Open and Explore the Simulated Interactions Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a closer look at the generated interactions file. First let us see a few lines of the raw CSV data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -n 5 $interactions_filename"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let us load it as a Pandas dataframe. Note:\n",
    "\n",
    "- An EVENT_TYPE column which can be used to train different Personalize campaigns and also to filter on recommendations.\n",
    "- The custom DISCOUNT column which is a contextual metadata field, that Personalize reranking campaign and user personalization recommenders can take into account to predict on the best next product based the user's propensity to interact with discount products."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_df = pd.read_csv(interactions_filename)\n",
    "interactions_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Chart the counts of each `EVENT_TYPE` generated for the interactions dataset. We're simulating a site where visitors heavily view/browse products and to a lesser degree add products to their cart and checkout."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "categorical_attributes = interactions_df.select_dtypes(include = ['object'])\n",
    "\n",
    "plt.figure(figsize=(16,3))\n",
    "chart = sns.countplot(data = categorical_attributes, x = 'EVENT_TYPE')\n",
    "plt.xticks(rotation=90, horizontalalignment='right')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us plot product persona vs product category distribution - so that you can see what categories get assigned to which types of persona. Note that because of attempts to reach balance between categories and products in the generation script, the proportions do not exactly match those configured."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged_df = interactions_df[['USER_ID', 'ITEM_ID']].astype({'USER_ID':str, 'ITEM_ID':str})\n",
    "merged_df = merged_df.merge(users_df[['id', 'persona']], left_on='USER_ID', right_on='id').drop(columns=['id', 'USER_ID'])\n",
    "merged_df = merged_df.merge(products_df[['id', 'category']], left_on='ITEM_ID', right_on='id').drop(columns=['id', 'ITEM_ID'])\n",
    "merged_df\n",
    "plot_df = merged_df.groupby(['persona', 'category'])['category'].count().unstack()\n",
    "sns.heatmap(plot_df, annot=True, fmt=\"g\", cmap='viridis')\n",
    "plt.title('Heatmap of user persona vs product category')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Discount persona vs event type distribution\n",
    "\n",
    "Let us see how the event distribution came out. We should see a different takeup of discounts between users with different discount personas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged_df = interactions_df.loc[interactions_df.EVENT_TYPE == 'AddToCart'][['USER_ID', 'DISCOUNT']]\n",
    "merged_df = merged_df[['USER_ID', 'DISCOUNT']].astype({'USER_ID':str}).merge(users_df, left_on='USER_ID', right_on='id')\n",
    "\n",
    "plot_df = merged_df.groupby(['discount_persona', 'DISCOUNT'])[['id']].count().unstack()\n",
    "plot_df = plot_df.droplevel(axis='columns', level=0)\n",
    "plot_df.plot.bar()\n",
    "plt.title('Event types according to discount persona')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Balance over products I\n",
    "\n",
    "Let us have a careful look at product and category distributions.\n",
    "The interactions generation script ensures that there are small groups of products users tend to interact with, to maintain\n",
    "strong training signals. If you look at the script you will see that although we choose\n",
    "products randomly within a category, they are interacted with in small random groups in the category.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "merged_df = interactions_df[['ITEM_ID', 'USER_ID']].astype({'ITEM_ID': str}).merge(products_df, left_on='ITEM_ID', right_on='id')\n",
    "plot_df = merged_df.groupby(['USER_ID', 'category']).id.apply(set)\n",
    "plot_df.apply(len).value_counts().sort_index().plot.bar()\n",
    "plt.xlabel('Number of different products examined by user.')\n",
    "plt.title('We should have reduced users to a small number of products')\n",
    "# The peak at 1 is the male jewelry - there is only one product\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Prepare and Upload Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's upload the generated interactions CSV to S3 so it can be picked up by Amazon Personalize later in the workshop."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "boto3.Session().resource('s3').Bucket(bucket).Object(interactions_filename).upload_file(interactions_filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lab 1 Summary - What have we accomplished?\n",
    "\n",
    "In this lab we prepared and inspected the three dataset types (items, users, and interactions) that will be used to train models and create recommenders in Amazon Personalize.\n",
    "\n",
    "We created a CSV for each dataset type and uploaded them to the Amazon S3 staging bucket created by the Retail Demo Store deployment.\n",
    "\n",
    "In the next lab we will start creating resources in Amazon Personalize to receive our dataset files."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Store variables needed in the next lab\n",
    "\n",
    "We will pass some variables initialized in this lab by storing them in the notebook environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store Uid\n",
    "%store bucket\n",
    "%store items_filename\n",
    "%store users_filename\n",
    "%store interactions_filename"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Continue to Lab 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Open [Lab 2](./Lab-2-Prepare-Personalize-and-import-data.ipynb) to continue the workshop."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}