{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Retail Demo Store - Personalization Workshop - Lab 8\n",
    "\n",
    "In this module we're going to be adding another sample use case of [Amazon Personalize](https://aws.amazon.com/personalize/): selecting offers or coupons to be providing to users. You might imagine these offers to be of the following kind:\n",
    "\n",
    " - Log in to our online store to get 10% off any product!\n",
    " - Enter PERFECTHAIR to get 3 for 2 on hair products online only! This coupon expires tomorrow!\n",
    " - and so forth.\n",
    " \n",
    "In this notebook we'll ask Amazon Personalize to select a coupon for us that is likely to result in a conversion for the given user. We are going to ask Amazon Personalize to choose between the following coupons:\n",
    "\n",
    " - Enter AWS25 to earn a 25% discount on any product in store today!\n",
    " - Enter AMAZON20 to earn a 20% discount on any product in store today!\n",
    " - Enter PERSONALIZE15 to earn a 15% discount on any product in store today!\n",
    " - Enter SAGEMAKER10 to earn a 10% discount on any product in store today!\n",
    " - Enter SALE5 to earn a 5% discount on any product in store today!\n",
    " - No offer given.\n",
    " \n",
    "One issue that this particular scenario brings to the fore is that personalization as generally practiced currently is designed to maximise conversions but choosing different offers for a user results in different costs to us.\n",
    "A relatively new feature of Amazon Personalize is [the use of **additional objective functions**](https://docs.aws.amazon.com/personalize/latest/dg/optimizing-solution-for-objective.html).\n",
    "\n",
    "With this feature, we can train Amazon Personalize to produce recommendations to maximise any objective we want - such as return on investment, lifetime value, and so forth.\n",
    "\n",
    "Alternatively, we can incorporate some decision science into our systems ourselves. In the current context this might mean:\n",
    "\n",
    " - Limiting the number of certain types of offers to assign to users (for example, we might allow Personalize to select users for offers rather than selecting offers for users). In the current scenario, however, we want to ensure that every user has one offer.\n",
    " - Using the \"contextual metadata\" facility of Amazon Personalize to estimate the likely change in behaviour of users under different offer contexts. See the [Personalize discounts workshop](./1.3-Personalized-Discounts.ipynb) notebook for an idea about how such an approach may work. \n",
    " - Using the scores provided by Personalize as probabilities as part of a decision system where the decision system chooses offers based on expected marginal return or other decision variables such as available coupons and take-up rate. For more about recommendation scores see this [blog article](https://aws.amazon.com/blogs/machine-learning/introducing-recommendation-scores-in-amazon-personalize/). We explore this option more in the present notebook in an empirical way, so that you can see for yourself what kind of scores Amazon Personalize gives back, and how they may be of use.\n",
    "\n",
    "\n",
    "### Simulation \n",
    "\n",
    "We are going to simulate a simple dataset on which to train Amazon Personalize so that you can see how such a campaign may be trained with Amazon Personalize. Optionally, you can add skew to the data as might happen when certain items are much more valuable than others, so that we can explore how to deal with it via the balancing approaches discussed above.\n",
    "\n",
    "Recommended Time: 2 Hours"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "To get started, we need to perform a bit of setup. Walk through each of the following steps to configure your environment to interact with the Amazon Personalize Service."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import Dependencies and Setup Boto3 Python Clients\n",
    "\n",
    "Throughout this workshop we will need access to some common libraries and clients for connecting to AWS services. We also have to retrieve Uid from a SageMaker notebook instance tag."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import Dependencies\n",
    "\n",
    "import boto3\n",
    "import json\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import time\n",
    "import requests\n",
    "import csv\n",
    "import sys\n",
    "import botocore\n",
    "import uuid\n",
    "import random\n",
    "from collections import defaultdict\n",
    "\n",
    "from packaging import version\n",
    "from random import randint\n",
    "from botocore.exceptions import ClientError\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "# Setup Clients\n",
    "\n",
    "personalize = boto3.client('personalize')\n",
    "personalize_runtime = boto3.client('personalize-runtime')\n",
    "personalize_events = boto3.client('personalize-events')\n",
    "\n",
    "servicediscovery = boto3.client('servicediscovery')\n",
    "ssm = boto3.client('ssm')\n",
    "\n",
    "with open('/opt/ml/metadata/resource-metadata.json') as f:\n",
    "  data = json.load(f)\n",
    "sagemaker = boto3.client('sagemaker')\n",
    "sagemakerResponce = sagemaker.list_tags(ResourceArn=data[\"ResourceArn\"])\n",
    "for tag in sagemakerResponce[\"Tags\"]:\n",
    "    if tag['Key'] == 'Uid':\n",
    "        Uid = tag['Value']\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Configure Bucket and Data Output Location"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will be configuring some variables that will store the location of our source data. When the Retail Demo Store stack was deployed in this account, an S3 bucket was created for you and the name of this bucket was stored in Systems Manager Parameter Store. Using the Boto3 client we can get the name of this bucket for use within our Notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = ssm.get_parameter(\n",
    "    Name='retaildemostore-stack-bucket'\n",
    ")\n",
    "\n",
    "bucket = response['Parameter']['Value']\n",
    "interactions_filename = \"offer_interactions.csv\"  # This is where we will store our simulated data set\n",
    "\n",
    "print('Bucket: {}'.format(bucket))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get, Prepare, and Upload User, Product, and Interaction Data\n",
    "\n",
    "Amazon Personalize provides predefined recipes, based on common use cases, for training models. A recipe is a machine learning algorithm that you use with settings, or hyperparameters, and the data you provide to train an Amazon Personalize model. The data you provide to train a model are organized into separate datasets by the type of data being provided. A collection of datasets are organized into a dataset group. The three dataset types supported by Personalize are items, users, and interactions. Depending on the recipe type you choose, a different combination of dataset types are required. For all recipe types, an interactions dataset is required. Interactions represent how users interact with items. For example, viewing a product, watching a video, listening to a recording, or reading an article. For this workshop, we will be using a recipe that supports all three dataset types.\n",
    "\n",
    "When we deployed the Retail Demo Store, it was deployed with an initial seed of fictitious User data as well as Offers. We will use this data to train a model, or solution, in the Amazon Personalize service which will be used to serve offer recommendations, related items,  and to rerank product lists for our users. The User and Product data can be accessed from the Retail Demo Store's Users and Offers microservices, respectively. We will access our data through microservice data APIs, process the data, and use it as the basis of a simualation of user-offer interactions.\n",
    "\n",
    "Although Amazon Personalize can do a better job if we provide it with information (\"metadata\") about the users and items we are providing it with interactions data for, we do not focus on that capability here, and generate and upload only the interactions dataset. We upload it as CSV to S3. Once our dataset is in S3, we can import it into the Amazon Personalize service.\n",
    "\n",
    "Let's get started."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Offers Service Instance\n",
    "\n",
    "We will be pulling our offers data from the Offers Service that was deployed in an Amazon Elastic Container Service as part of the Retail Demo Store. To connect to this service we will use [AWS Cloud Map](https://aws.amazon.com/cloud-map/)'s Service Discovery to discover an instance of the Product Service running in ECS, and then connect directly to that service instances to access our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = servicediscovery.discover_instances(\n",
    "    NamespaceName='retaildemostore.local',\n",
    "    ServiceName='offers',\n",
    "    MaxResults=1,\n",
    "    HealthStatus='HEALTHY'\n",
    ")\n",
    "\n",
    "offers_service_instance = response['Instances'][0]['Attributes']['AWS_INSTANCE_IPV4']\n",
    "print('Offers Service Instance IP: {}'.format(offers_service_instance))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Download and Explore the Offers Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = requests.get('http://{}/offers'.format(offers_service_instance))\n",
    "offers = response.json()\n",
    "print(json.dumps(offers, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us convert it to a Pandas dataframe for ease of exploration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "offers_df = pd.DataFrame(offers['tasks'])\n",
    "pd.set_option('display.max_rows', 5)\n",
    "offers_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example we do not use Amazon Personalize's ability to utilise user or item metadata. So we have no need to upload this data for Amazon Personalize - instead we will use it as the basis of generating user-offer item interactions below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Users Service Instance\n",
    "\n",
    "We will be pulling some User data from the [Users Service](https://github.com/aws-samples/retail-demo-store/tree/master/src/users) that is deployed as part of the Retail Demo Store - we mostly just need available user IDs as we are not making use in this workshop of user or item metadata. To connect to this service we will use Service Discovery to discover an instance of the User Service, and then connect directly to that service instance to access our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = servicediscovery.discover_instances(\n",
    "    NamespaceName='retaildemostore.local',\n",
    "    ServiceName='users',\n",
    "    MaxResults=1,\n",
    "    HealthStatus='HEALTHY'\n",
    ")\n",
    "\n",
    "users_service_instance = response['Instances'][0]['Attributes']['AWS_INSTANCE_IPV4']\n",
    "print('Users Service Instance IP: {}'.format(users_service_instance))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Download and Explore the Users Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = requests.get('http://{}/users/all?count=10000'.format(users_service_instance))\n",
    "users = response.json()\n",
    "print(json.dumps(users[:2], indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us convert it to a Pandas dataframe for ease of exploration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users_df = pd.DataFrame(users)\n",
    "# Remove any users without a persona or gender (i.e. created in web UI)\n",
    "users_df = users_df[(users_df['persona'].str.strip().astype(bool)) | (users_df['gender'].str.strip().astype(bool))]\n",
    "pd.set_option('display.max_rows', 5)\n",
    "users_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "In this example we do not use Amazon Personalize's ability to utilise user or item metadata. So we have no need to upload this data for Amazon Personalize - instead we will use it as the basis of generating user-offer item interactions below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create User-Offers Interactions Dataset\n",
    "\n",
    "To mimic user behavior, we will be generating a new dataset that represents user interactions with items - in this use case, items consist of coupon offers to the users that users have made use of.\n",
    "\n",
    "We do not consider potential relations between coupons and product context though this is very easy to do in Amazon Personalize using \"contextual metadata\". It is also possible to incorporate different \"event types\" to account for e.g. the difference between clicking on a coupon offer, adding it to a shopping cart, or redeeming it.\n",
    "\n",
    "\n",
    "In the below parameters we determine the size of the dataset (we simulate exactly 3 offer interactions per user) as well as its date range. We keep this particular simulation deterministic so that we can recreate exactly the same dataset again. Also, we provide the option of making the dataset inbalanced - set the `INBALANCED` option to simulate different take up of coupons - e.g. if some items are vastly better at driving conversion than others (for example, when comparing a 5% to 25% discount) - we want to show later that it is possible wrap Personalize to take into account other business logic - for example, that some items are more profitable to business than others when they do experience conversions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "INBALANCED = False\n",
    "NUM_INTERACTIONS_PER_USER = 3\n",
    "\n",
    "FIRST_TIMESTAMP = 1591803782  # 2020-06-10, 18:43:02\n",
    "LAST_TIMESTAMP = 1599579782  # 2020-09-08, 18:43:02\n",
    "RANDOM_SEED = 1\n",
    "\n",
    "random.seed(RANDOM_SEED)\n",
    "np.random.seed(RANDOM_SEED)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "def generate_data(interactions_filename, users_df, offers_df):\n",
    "    \"\"\"Script for writing to a file simulated user-offer interactions\"\"\"\n",
    "    \n",
    "    random.seed(RANDOM_SEED)\n",
    "    np.random.seed(RANDOM_SEED)\n",
    "\n",
    "    num_users = users_df.shape[0]\n",
    "    num_interactions = NUM_INTERACTIONS_PER_USER * num_users\n",
    "\n",
    "    if INBALANCED:\n",
    "        # We may wish to assume probability is proportional to ID to show off how we can add\n",
    "        # business logic around Personalize\n",
    "        offer_probs = offers_df.id.values.astype(float)\n",
    "    else:\n",
    "        # Or we can work around inbalance at the data munging stage\n",
    "        offer_probs = np.ones(len(offers_df.id.values), dtype=float)\n",
    "\n",
    "    # Normalise so that we have probabilities\n",
    "    offer_probs = offer_probs / offer_probs.sum()\n",
    "\n",
    "    # generate timestamps\n",
    "    time_between_events = (LAST_TIMESTAMP - FIRST_TIMESTAMP) / num_interactions\n",
    "    timestamps = np.arange(FIRST_TIMESTAMP, LAST_TIMESTAMP, time_between_events).astype(int)\n",
    "    # pre-shuffle them as we will be using them as a randomising key when we sort by timestamp\n",
    "    np.random.shuffle(timestamps)\n",
    "\n",
    "    # generate all users Ids\n",
    "    sample_user_ids = np.tile(users_df['id'].values.astype(int), NUM_INTERACTIONS_PER_USER)\n",
    "    # only one event type\n",
    "    event_type = ['OfferConverted'] * num_interactions  # Only one event type\n",
    "\n",
    "    # we sort it to ensure there is a correlation between user ID and offer ID.\n",
    "    sampled_offers = sorted(np.random.choice(offers_df.id.values, num_interactions, p=offer_probs))\n",
    "\n",
    "    interactions_df = pd.DataFrame({'ITEM_ID': sampled_offers,\n",
    "                                    'USER_ID': sample_user_ids,\n",
    "                                    'EVENT_TYPE': event_type,\n",
    "                                    'TIMESTAMP': timestamps})\n",
    "\n",
    "    # by sorting by timestamp, other elements get shuffled\n",
    "    interactions_df = interactions_df.sort_values('TIMESTAMP')\n",
    "\n",
    "    with open(interactions_filename, 'w') as outfile:\n",
    "        interactions_df.to_csv(outfile, index=False)\n",
    "\n",
    "    globals().update(locals())  # This can be used for inspecting in console after script ran or if run with ipython.\n",
    "    print('Generation script finished - created offers dataset')\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# In this cell, we run the script.\n",
    "generate_data(interactions_filename, users_df, offers_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Open and Explore the Interactions Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "!head \"$interactions_filename\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_df = pd.read_csv(interactions_filename)\n",
    "interactions_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_df.ITEM_ID.value_counts().sort_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_df.ITEM_ID.value_counts().sort_index().plot.bar()\n",
    "plt.title('Distribution of interactions by offer ID')\n",
    "plt.xlabel('Offer ID')\n",
    "plt.ylabel('Interaction count')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_df[['ITEM_ID', 'USER_ID']].plot.scatter(x='USER_ID', y='ITEM_ID', alpha=0.01)\n",
    "plt.title('Item vs User ID correlation scatter plot')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Prepare and Upload Data\n",
    "\n",
    "We push the data to an S3 bucket for Personalize to grab it from for training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "boto3.Session().resource('s3').Bucket(bucket).Object(interactions_filename).upload_file(interactions_filename)\n",
    "bucket, interactions_filename"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Configure Amazon Personalize\n",
    "\n",
    "Now that we've prepared our dataset and uploaded it to S3 we'll need to configure the Amazon Personalize service to understand our data so that it can be used to train models for generating recommendations.\n",
    "\n",
    "If Personalize auto-deployment was enabled when you deployed the demo, the below logic was run already by polling AWS Lambda function whose code is in the file `src/aws-lambda/personalize-pre-create-resources/personalize-pre-create-resources.py`. If you did not enable auto-deployment, you can run the code below to train your Personalize campaigns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Schemas for Dataset\n",
    "\n",
    "Amazon Personalize requires a schema for its dataset so it can map the columns in our CSVs to fields for model training. Each schema is declared in JSON using the [Apache Avro](https://avro.apache.org/) format.\n",
    "\n",
    "Let's define and create a schema in Personalize for our interactions dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Interactions Dataset Schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_schema = {\n",
    "    \"type\": \"record\",\n",
    "    \"name\": \"Interactions\",\n",
    "    \"namespace\": \"com.amazonaws.personalize.schema\",\n",
    "    \"fields\": [\n",
    "        {\n",
    "            \"name\": \"ITEM_ID\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"USER_ID\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"EVENT_TYPE\",\n",
    "            \"type\": \"string\"\n",
    "        },\n",
    "        {\n",
    "            \"name\": \"TIMESTAMP\",\n",
    "            \"type\": \"long\"\n",
    "        }\n",
    "    ],\n",
    "    \"version\": \"1.0\"\n",
    "}\n",
    "\n",
    "try:\n",
    "    create_schema_response = personalize.create_schema(\n",
    "        name = \"retaildemooffers-schema-interactions\",\n",
    "        schema = json.dumps(interactions_schema)\n",
    "    )\n",
    "    interactions_schema_arn = create_schema_response['schemaArn']\n",
    "    print(json.dumps(create_schema_response, indent=2))\n",
    "except personalize.exceptions.ResourceAlreadyExistsException:\n",
    "    print('You aready created this schema, seemingly')\n",
    "    schemas = personalize.list_schemas()['schemas']\n",
    "    for schema_response in schemas:\n",
    "        if schema_response['name'] == \"retaildemooffers-schema-interactions\":\n",
    "            interactions_schema_arn = schema_response['schemaArn']\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create and Wait for Dataset Group\n",
    "\n",
    "Next we need to create the dataset group that will contain our three datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create Dataset Group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_dataset_group_response = personalize.create_dataset_group(\n",
    "    name = 'retaildemooffers-notebook'\n",
    ")\n",
    "dataset_group_arn = create_dataset_group_response['datasetGroupArn']\n",
    "print(json.dumps(create_dataset_group_response, indent=2))\n",
    "\n",
    "print(f'DatasetGroupArn = {dataset_group_arn}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Wait for Dataset Group to Have ACTIVE Status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "status = None\n",
    "max_time = time.time() + 3*60*60 # 3 hours\n",
    "while time.time() < max_time:\n",
    "    describe_dataset_group_response = personalize.describe_dataset_group(\n",
    "        datasetGroupArn = dataset_group_arn\n",
    "    )\n",
    "    status = describe_dataset_group_response[\"datasetGroup\"][\"status\"]\n",
    "    print(\"DatasetGroup: {}\".format(status))\n",
    "    \n",
    "    if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n",
    "        break\n",
    "        \n",
    "    time.sleep(15)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Interactions Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_type = \"INTERACTIONS\"\n",
    "create_dataset_response = personalize.create_dataset(\n",
    "    name = \"retaildemooffers-notebook-INTERACTIONS\",\n",
    "    datasetType = dataset_type,\n",
    "    datasetGroupArn = dataset_group_arn,\n",
    "    schemaArn = interactions_schema_arn\n",
    ")\n",
    "\n",
    "interactions_dataset_arn = create_dataset_response['datasetArn']\n",
    "print(json.dumps(create_dataset_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Import Datasets to Personalize\n",
    "\n",
    "Up to this point we have generated CSVs containing data for our users, items, and interactions and staged them in an S3 bucket. We also created schemas in Personalize that define the columns in our CSVs. Then we created a datset group and a dataset in Personalize that will receive our data. In the following steps we will create an import job with Personalize that will import the dataset from our S3 bucket into the service."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inspect permissions\n",
    "\n",
    "By default, the Personalize service does not have permission to acccess the data we uploaded into the S3 bucket in our account. In order to grant access to the  Personalize service to read our CSVs, we need to set a Bucket Policy and create an IAM role that the Amazon Personalize service will assume.\n",
    "\n",
    "The deployment process for the Retail Demo Store has already setup these resources for you. However, let's take a look at the bucket policy and IAM role to see the required permissions.\n",
    "\n",
    "We'll start by displaying the bucket policy in the S3 staging bucket where we uploaded the CSVs. Note the service principal of `personalize.amazonaws.com` and the actions allowed on the staging bucket. The `s3:GetObject` is needed for import jobs to allow Personalize to read objects from the bucket and the `s3:PutObject` is used for export jobs, batch inference jobs, and batch segment jobs to allow Personalize to write output files to the bucket. The `s3:ListBucket` action allows Personalize to list the contents of a folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3 = boto3.client(\"s3\")\n",
    "\n",
    "response = s3.get_bucket_policy(Bucket = bucket)\n",
    "print(json.dumps(json.loads(response['Policy']), indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's look at the IAM role that Personalize will need to assume to access the S3 bucket. Again, this role was created for you during the Retail Demo Store deployment. We'll start by inspecting the role itself. Notice the same service principal as the bucket policy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iam = boto3.client(\"iam\")\n",
    "\n",
    "role_name = Uid+\"-PersonalizeS3\"\n",
    "\n",
    "response = iam.get_role(RoleName = role_name)\n",
    "role_arn = response['Role']['Arn']\n",
    "print(json.dumps(response['Role'], indent=2, default = str))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we'll get the inline policy named `BucketAccess` that has the same S3 permissions as the bucket policy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = iam.get_role_policy(RoleName = role_name, PolicyName = 'BucketAccess')\n",
    "print(json.dumps(response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Import Jobs\n",
    "\n",
    "With the permissions in place to allow Personalize to access our CSV files, let's create three import jobs to import each file into its respective dataset. Each import job can take several minutes to complete so we'll create all three and then wait for them all to complete."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create Interactions Dataset Import Job"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "interactions_create_dataset_import_job_response = personalize.create_dataset_import_job(\n",
    "    jobName = \"retaildemooffers-dataset-interactions-import-job\",\n",
    "    datasetArn = interactions_dataset_arn,\n",
    "    dataSource = {\n",
    "        \"dataLocation\": \"s3://{}/{}\".format(bucket, interactions_filename)\n",
    "    },\n",
    "    roleArn = role_arn\n",
    ")\n",
    "\n",
    "interactions_dataset_import_job_arn = interactions_create_dataset_import_job_response['datasetImportJobArn']\n",
    "print(json.dumps(interactions_create_dataset_import_job_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wait for Import Jobs to Complete\n",
    "\n",
    "It will take 10-15 minutes for the import jobs to complete, while you're waiting you can learn more about Datasets and Schemas here: https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Wait for Items Import Job to Complete"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "job_arn = interactions_dataset_import_job_arn\n",
    "\n",
    "max_time = time.time() + 3*60*60 # 3 hours\n",
    "while time.time() < max_time:\n",
    "    import_job_response = personalize.describe_dataset_import_job(\n",
    "        datasetImportJobArn = job_arn\n",
    "    )\n",
    "    status = import_job_response[\"datasetImportJob\"]['status']\n",
    "\n",
    "    if status == \"ACTIVE\":\n",
    "        print(f'Import job {job_arn} successfully completed')\n",
    "        break\n",
    "    elif status == \"CREATE FAILED\":\n",
    "        print(f'Import job {job_arn} failed')\n",
    "        if import_job_response.get('failureReason'):\n",
    "            print('   Reason: ' + import_job_response['failureReason'])\n",
    "        break\n",
    "    \n",
    "    time.sleep(15)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Solution\n",
    "\n",
    "We can now turn to training a model to predict offers for a user. This involves training a model involves creating a Solution and Solution Version.\n",
    "\n",
    "We have the choice between using the personalized recommendations and personalized ranking recipe - both of these return a set of scores, the main difference being that for the latter a list of items is provided. With these scores, it is possible to be more flexible with incorporating postprocessing logic. For more details: https://aws.amazon.com/blogs/machine-learning/introducing-recommendation-scores-in-amazon-personalize/\n",
    "\n",
    "These two recipes are as below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "recommend_recipe_arn = \"arn:aws:personalize:::recipe/aws-hrnn-metadata\"\n",
    "ranking_recipe_arn = \"arn:aws:personalize:::recipe/aws-personalized-ranking\"\n",
    "recommend_recipe_arn = \"arn:aws:personalize:::recipe/aws-user-personalization\"\n",
    "\n",
    "offer_recipe_arn = recommend_recipe_arn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These recipes also have different internal implementations and the way they generate scores can differ."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create Solutions and Solution Versions\n",
    "\n",
    "With our recipes defined, we can now create our solution (training config) and solution version (trained model)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create Offer Recommendation Solution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_solution_response = personalize.create_solution(\n",
    "    name = \"retaildemo-offers-personalization\",\n",
    "    datasetGroupArn = dataset_group_arn,\n",
    "    recipeArn = offer_recipe_arn\n",
    ")\n",
    "\n",
    "offers_solution_arn = create_solution_response['solutionArn']\n",
    "print(json.dumps(create_solution_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create Offer Recommendation Solution Version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_solution_version_response = personalize.create_solution_version(\n",
    "    solutionArn = offers_solution_arn\n",
    ")\n",
    "\n",
    "offers_solution_version_arn = create_solution_version_response['solutionVersionArn']\n",
    "print(json.dumps(create_solution_version_response, indent=2))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Wait for Solution Version to Complete\n",
    "\n",
    "Here we wait for the solution version to be created (training to finish)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Wait for Offers Solution Version to Have ACTIVE Status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "soln_ver_arn = offers_solution_version_arn\n",
    "\n",
    "max_time = time.time() + 3*60*60 # 3 hours\n",
    "while time.time() < max_time:\n",
    "    soln_ver_response = personalize.describe_solution_version(\n",
    "        solutionVersionArn = soln_ver_arn\n",
    "    )\n",
    "    status = soln_ver_response[\"solutionVersion\"][\"status\"]\n",
    "\n",
    "    if status == \"ACTIVE\":\n",
    "        print(f'Solution version {soln_ver_arn} successfully completed')\n",
    "        break\n",
    "    elif status == \"CREATE FAILED\":\n",
    "        print(f'Solution version {soln_ver_arn} failed')\n",
    "        if soln_ver_response.get('failureReason'):\n",
    "            print('   Reason: ' + soln_ver_response['failureReason'])\n",
    "        break\n",
    "        \n",
    "    time.sleep(15)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Offline Metrics for Solution Version\n",
    "\n",
    "Let's retrieve the metrics for the solution version we just created.\n",
    "\n",
    "The easiest of these to understand is the \"mean reciprocal rank at 25\" (`mean_reciprocal_rank_at_25`). It is the mean (over test data) reciprocal of the rank of the actual next item (offer) a user interacted with from the 1st 25 items returned by the recommender.\n",
    "\n",
    "The metrics are described in more detail [here](https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "get_solution_metrics_response = personalize.get_solution_metrics(\n",
    "    solutionVersionArn = offers_solution_version_arn\n",
    ")\n",
    "\n",
    "print(json.dumps(get_solution_metrics_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Campaigns\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get batch recommendations from our trained solution but we may want to grab recommendations from our trained model in real-time. Creating a campaign is equivalent to deploying our trained model so that it gives us a real-time endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "create_campaign_response = personalize.create_campaign(\n",
    "    name = \"retaildemooffers-personalization\",\n",
    "    solutionVersionArn = offers_solution_version_arn,\n",
    "    minProvisionedTPS = 1\n",
    ")\n",
    "\n",
    "offers_campaign_arn = create_campaign_response['campaignArn']\n",
    "print(json.dumps(create_campaign_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Wait for campaign to Have ACTIVE Status"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "campaign_arn = offers_campaign_arn\n",
    "\n",
    "max_time = time.time() + 3*60*60 # 3 hours\n",
    "while time.time() < max_time:\n",
    "    campaign_response = personalize.describe_campaign(\n",
    "        campaignArn = campaign_arn\n",
    "    )\n",
    "    status = campaign_response[\"campaign\"][\"status\"]\n",
    "\n",
    "    if status == \"ACTIVE\":\n",
    "        print(f'Campaign {campaign_arn} successfully completed')\n",
    "        break\n",
    "    elif status == \"CREATE FAILED\":\n",
    "        print(f'Campaign {campaign_arn} failed')\n",
    "        if campaign_response.get('failureReason'):\n",
    "            print('   Reason: ' + campaign_response['failureReason'])\n",
    "        break\n",
    "    time.sleep(15)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test Campaigns\n",
    "\n",
    "Now that our campaign has been fully created, let's test it and evaluate it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Select a user\n",
    "\n",
    "We'll just pick a random user for simplicity. Feel free to change the `user_id` below and execute the following cells with a different user to get a sense for how the recommendations change. Let's print out the user details - in our situation it is only the user ID that might be relevant as we only used the user ID and no other metadata about the user to alter how interactions are generated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "user_id = 256\n",
    "\n",
    "response = requests.get('http://{}/users/id/{}'.format(users_service_instance, user_id))\n",
    "user = response.json()\n",
    "print(json.dumps(user, indent=4, sort_keys=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Get Offer Recommendations for User\n",
    "\n",
    "Now let's call Amazon Personalize to get recommendations for our user from the offers recommendations campaign."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "get_recommendations_response = personalize_runtime.get_recommendations(\n",
    "    campaignArn = offers_campaign_arn,\n",
    "    userId = str(user_id),\n",
    "    numResults = 5\n",
    ")\n",
    "\n",
    "item_list = get_recommendations_response['itemList']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(json.dumps(item_list, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that in this response we have `score` field returned with each `itemId`.\n",
    "For all recipes except SIMS and Popularity-Count, Personalize\n",
    "[calculates a score](https://docs.aws.amazon.com/personalize/latest/dg/getting-real-time-recommendations.html)\n",
    "for each recommended item **across all items in your dataset**.\n",
    "Score values are between 0.0 and 1.0 and the sum of all scores across all items will total to 1.0.\n",
    "Therefore, the absolute value of scores will be smaller for larger item catalogs.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate results distribution\n",
    "\n",
    "We are interested in seeing what distribution of scores items our recommender is returning\n",
    "considering the dataset we provided Personalize above. Let us call the real-time recommender a few times in order to get a feel for the distribution. We can plot the distribution of scores and of top-ranked recommendations.\n",
    "\n",
    "We can also provide a multiplier for each offer type that we multiply into the score. The multiplier can reflect business logic, such as relative expected return on each offer. Play with the multiplier calculated below to see different results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "num_users_to_get_offers_for = 100\n",
    "offers_df = offers_df.set_index('id', drop=False)  # Let us index our offers dataframe by offer ID.\n",
    "# For example, we might like to weight inverse to ID. \n",
    "# You can play with these multipliers or try different schemes to get a feel for the range of possibilities here.\n",
    "adjustment_multipliers = 1 / offers_df.id.astype(float) \n",
    "\n",
    "# Depending on whether we chose to use the ranking or user personalization recipe, \n",
    "# we get slightly different data back from the real-time campaign.\n",
    "if offer_recipe_arn == recommend_recipe_arn:\n",
    "    return_key = 'itemList'\n",
    "elif offer_recipe_arn == ranking_recipe_arn:\n",
    "    return_key = 'personalizedRanking'\n",
    "\n",
    "# This is going to keep all the scores returned for further exploration.\n",
    "all_user_scores = []\n",
    "# After we apply our adjustment we get different scores - we can check how they look here.\n",
    "all_adjusted_scores = []\n",
    "\n",
    "sum_user_scores = defaultdict(float)\n",
    "sum_adjusted_scores = defaultdict(float)\n",
    "\n",
    "# We are going to plot as we go.\n",
    "plt.figure(figsize=[15,5])\n",
    "import time\n",
    "for user_id in users_df.id.sample(n=num_users_to_get_offers_for):\n",
    "    if offer_recipe_arn == recommend_recipe_arn:\n",
    "        get_recommendations_response = personalize_runtime.get_recommendations(\n",
    "           campaignArn = offers_campaign_arn,\n",
    "           userId = str(user_id),\n",
    "           numResults = offers_df.shape[0]\n",
    "        )\n",
    "    elif offer_recipe_arn == ranking_recipe_arn:\n",
    "        get_recommendations_response = personalize_runtime.get_personalized_ranking(\n",
    "            campaignArn = offers_campaign_arn,\n",
    "            userId = str(user_id),\n",
    "            inputList=list(offers_df.id.astype(str))\n",
    "        )\n",
    "\n",
    "    time.sleep(0.01)  # rate limit\n",
    "    \n",
    "    sum_scores = sum(float(item['score']) for item in get_recommendations_response[return_key])\n",
    "    \n",
    "    # Here we extract all the scores and put them into an offer: score dictionary.\n",
    "    user_scores = {int(item['itemId']):float(item['score'])/sum_scores for item in get_recommendations_response[return_key]}\n",
    "    \n",
    "    # Here we see what would happen to our scores if we applied an adjustment multiplier.\n",
    "    adusted_scores = {key: value * adjustment_multipliers[key] for key, value in user_scores.items()}\n",
    "\n",
    "    for key in user_scores:\n",
    "        sum_user_scores[key] += user_scores[key]\n",
    "        sum_adjusted_scores[key] += adusted_scores[key]\n",
    "    \n",
    "    all_user_scores.append(user_scores)\n",
    "    all_adjusted_scores.append(adusted_scores)\n",
    "\n",
    "# Let us plot every single score received.\n",
    "plt.plot(sum((list(user_scores.keys()) for user_scores in all_user_scores), []), \n",
    "         sum((list(user_scores.values()) for user_scores in all_user_scores), []), \n",
    "         'r.', label='All Scores from Personalize')\n",
    "plt.plot(sum((list(adusted_scores.keys()) for adusted_scores in all_adjusted_scores), []), \n",
    "         sum((list(adusted_scores.values()) for adusted_scores in all_adjusted_scores), []), \n",
    "         'b.', label='All Adjusted Scores from Personalize')\n",
    "\n",
    "    \n",
    "# Turn the sums into averages so we can plot them against the individual score distribution.\n",
    "all_user_sum = sum(sum_user_scores.values())\n",
    "all_adjusted_sum = sum(sum_user_scores.values())\n",
    "for key in sum_user_scores:\n",
    "    sum_user_scores[key] = sum_user_scores[key] / all_user_sum\n",
    "    sum_adjusted_scores[key] = sum_adjusted_scores[key] / all_adjusted_sum\n",
    "\n",
    "# Let us plot the average scores, so we can see how well our adjustment technique is doing.\n",
    "plt.plot(list(sum_user_scores.keys()), list(sum_user_scores.values()), 'm_', markersize=35, label='Mean Score from Personalize')\n",
    "plt.plot(list(sum_adjusted_scores.keys()), list(sum_adjusted_scores.values()), 'g_', markersize=35, label='Mean Adjusted Score from Personalize')\n",
    "\n",
    "plt.ylabel(\"Score\")\n",
    "plt.xlabel(\"Offer ID\")\n",
    "plt.legend()\n",
    "plt.ticklabel_format(style='plain',axis='x',useOffset=False)\n",
    "plt.title('Distribution of scores returned.')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us examine the distribution of scores using a histogram."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(5, constrained_layout=True, figsize=[5,5])\n",
    "plotdf = pd.DataFrame(all_user_scores)\n",
    "bins = np.arange(0,0.4,0.001)\n",
    "for offer_id in range(1,6):\n",
    "    plotind = offer_id-1\n",
    "    ax[plotind].set_xlim([0,0.4])\n",
    "    plotdf[[offer_id]].plot.hist(bins=bins, alpha=0.5, ax=ax[offer_id-1])\n",
    "    if offer_id != 5:\n",
    "        ax[plotind].set_xticklabels([])        \n",
    "    if offer_id != 3:\n",
    "        ax[plotind].set_ylabel('')\n",
    "    ax[plotind].set_title(f'Offer {offer_id}')\n",
    "    ax[plotind].get_legend().remove()\n",
    "        \n",
    "fig.suptitle('Histograms of score for each offer ID')    \n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Location Services demo\n",
    "\n",
    "A campaign similar to this personalized offers campaign is made use of in the Location Services demo; coupons are selected by Amazon Personalize and forwarded on to relevant customers using Amazon Pinpoint's segmenting, campaign, SMS and email capabilities as well as AWS API Gateway web push. These coupons provide potential users motivation to engage in the location-aware functionality in the Retail Demo Store and get in to the physical store itself for a truely local experience.\n",
    "\n",
    "For more information on that demo, see \"Retail Geofencing and Location-aware Personalization\" in the Demo Guide. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lab complete\n",
    "\n",
    "Congratulations! You have completed the Retail Demo Store personalized offers lab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}