{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ReInvent 2019 Amazon Personalize Workshop\n", "\n", "![MainDiagram](static/imgs/image.png)\n", "\n", "## Agenda\n", "\n", "At a high level working with Amazon Personalize follows the steps in the diagram below:\n", "\n", "![FlowDiagram](static/imgs/personalize_process.png)\n", "\n", "The specific process that you will be following is:\n", "\n", "1. Imports and Setup\n", "1. Preparing Your Data\n", "1. Importing Your Data\n", "1. Selecting a Recipe\n", "1. Training a Solution\n", "1. Deploying a Campaign\n", "1. Getting Recommendations\n", "1. Real-Time Interactions\n", "1. Conclusion\n", "1. Bonus: Bulk Export of Recommendations\n", "\n", "In each of these steps you'll see and execute code snippets written in Python using our Boto3 SDK, these snippets can be modified to be components of your production integration with Personalize. \n", "\n", "This notebook will walk you through the steps to build a recommendation model for movies that are tailored to specific users, the goal is to recommend movies based on that users' history of positive interactions with movies.\n", "\n", "The data is provided via the [MovieLens Project](https://movielens.org), you can read more about it later if you are interested.\n", "\n", "The content below has been written to guide you through the process within the timeline of a workshop session. You can deploy the CloudFormation template you found on GitHub later to your own account and can walk through notebooks 1-3 to do the same exercise again. Additionally you can update the content in the cells to reflect your own data and it is an effective path to building a custom recommendaiton model for your use case.\n", "\n", "## How to Use the Notebook\n", "\n", "Code is broken up into cells like the one below. There's a triangular `Run` button at the top of this page you can click to execute each cell and move onto the next, or you can press `Shift` + `Enter` or `Shift` + `Return`(Mac Users) while in the cell to execute it and move onto the next one.\n", "\n", "\n", "As a cell is executing you'll notice a line to the side showcase an `*` while the cell is running or it will update to a number to indicate the last cell that completed executing after it has finished exectuting all the code within a cell.\n", "\n", "\n", "Simply follow the instructions below and execute the cells to get started with Amazon Personalize.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports and Setup \n", "\n", "Python ships with a broad collection of libraries and we need to import those as well as the ones installed to help us like boto3(The AWS SDK) and Pandas/Numpy which are core data science tools. The cell below will import them for use here. It will also update the boto3 library to the latest version. You may see a warning(yellow text) or an error(red text), this is perfectly fine, just continue running the next cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Update boto3\n", "!pip install --upgrade boto3\n", "# Imports\n", "import boto3\n", "import json\n", "import numpy as np\n", "import pandas as pd\n", "import time\n", "import datetime\n", "import uuid" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next you will want to validate that your environment can communicate successfully with Amazon Personalize, the lines below do just that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Configure the SDK to Personalize:\n", "personalize = boto3.client('personalize')\n", "personalize_runtime = boto3.client('personalize-runtime')\n", "personalize_events = boto3.client(service_name='personalize-events')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last bit of setup is to determine the region you are running this workshop in. The cell below will do that and assign it to the `region` variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:\n", " data = json.load(notebook_info)\n", " resource_arn = data['ResourceArn']\n", " region = resource_arn.split(':')[3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing Your Data\n", "\n", "To begin you will need a collection of data points where users have interacted with content in some way. Amazon Personalize assumes that if an interaction is recorded it is a positive one, you will see how we use that later. \n", "\n", "The cell below will download the data you need, extract the content from a Zip file, and then display a small portion of it to your screen." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget -N http://files.grouplens.org/datasets/movielens/ml-100k.zip\n", "!unzip -o ml-100k.zip\n", "data = pd.read_csv('./ml-100k/u.data', sep='\\t', names=['USER_ID', 'ITEM_ID', 'RATING', 'TIMESTAMP'], engine='python')\n", "pd.set_option('display.max_rows', 5)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see the data contains a UserID, ItemID, Rating, and Timestamp.\n", "\n", "We do not need the `Rating` column so it will be removed before we save the file.\n", "\n", "Once the data has been prepared you will save it to a CSV file locally with the last line." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = data[['USER_ID', 'ITEM_ID', 'TIMESTAMP']] # select columns that match the columns in the schema below\n", "filename = \"movie-lens-ml-100k-prepared.csv\"\n", "data.to_csv(filename, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before you can upload the data into S3 you will need to create a bucket, the cell below will do that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(region)\n", "s3 = boto3.client('s3')\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "bucket_name = account_id + \"reinventpersonalizeworkshop\"\n", "print(bucket_name)\n", "if region != \"us-east-1\":\n", " s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})\n", "else:\n", " s3.create_bucket(Bucket=bucket_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you can upload the file to S3 and it will be ready to import into Amazon Personalize." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boto3.Session().resource('s3').Bucket(bucket_name).Object(filename).upload_file(filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing Your Data\n", "\n", "The steps for importing your data are:\n", "\n", "1. Create a Dataset Group.\n", "1. Determine the schema for your dataset.\n", "1. Create a Dataset using your schema.\n", "1. Run an ImportJob to load the data for use with Personalize.\n", "\n", "\n", "In Amazon Personalize a Dataset Group is how your information is isolated from any other experiment. No information is shared between these groups at all. A Dataset Group can contain your interaction data, item metadata, user metadata, event trackers, solutions, and campaigns. To learn more about them: https://docs.aws.amazon.com/personalize/latest/dg/API_DatasetGroup.html \n", "\n", "Begin by creating a Dataset Group" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Dataset Group" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_dataset_group_response = personalize.create_dataset_group(\n", " name = \"personalize-RI-demo\"\n", ")\n", "\n", "dataset_group_arn = create_dataset_group_response['datasetGroupArn']\n", "print(json.dumps(create_dataset_group_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dataset Group Create Loop\n", "\n", "It takes a few seconds to create a DatasetGroup so the cell below will poll until the DatasetGroup is active and you can continue with the workshop. Once it says `ACTIVE` move to the `Create Schema` step.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " describe_dataset_group_response = personalize.describe_dataset_group(\n", " datasetGroupArn = dataset_group_arn\n", " )\n", " status = describe_dataset_group_response[\"datasetGroup\"][\"status\"]\n", " print(\"DatasetGroup: {}\".format(status))\n", " \n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", " \n", " time.sleep(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Schema\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you can determine the Schema for your Dataset. In this workshop time is limited so you will only be utilizing user-item-interaction data or interaction data for short. The cell below contains the required items for this dataset and maps to the structure of the CSV that you uploaded earlier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = {\n", " \"type\": \"record\",\n", " \"name\": \"Interactions\",\n", " \"namespace\": \"com.amazonaws.personalize.schema\",\n", " \"fields\": [\n", " {\n", " \"name\": \"USER_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"ITEM_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"TIMESTAMP\",\n", " \"type\": \"long\"\n", " }\n", " ],\n", " \"version\": \"1.0\"\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cell below will create a schema inside Amazon Personalize that can be connected to a Dataset this is how Personalize understands the content within your CSV file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_schema_response = personalize.create_schema(\n", " name = \"personalize-ri-demo-schema\",\n", " schema = json.dumps(schema)\n", ")\n", "\n", "schema_arn = create_schema_response['schemaArn']\n", "print(json.dumps(create_schema_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Dataset\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next create the Dataset for interactions and assign it the Schema provided and assign it to the Dataset Group you created earlier with this code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset_type = \"INTERACTIONS\"\n", "create_dataset_response = personalize.create_dataset(\n", " name = \"personalize-ri-interactions\",\n", " datasetType = dataset_type,\n", " datasetGroupArn = dataset_group_arn,\n", " schemaArn = schema_arn\n", ")\n", "\n", "dataset_arn = create_dataset_response['datasetArn']\n", "print(json.dumps(create_dataset_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Attach Policy to S3 Bucket" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before you can execute the import job for your data you will need to attach a bucket policy to your S3 bucket to allow Personalize to communicate with it, as well as an IAM role for the service to use within this AWS account." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3 = boto3.client(\"s3\")\n", "\n", "policy = {\n", " \"Version\": \"2012-10-17\",\n", " \"Id\": \"PersonalizeS3BucketAccessPolicy\",\n", " \"Statement\": [\n", " {\n", " \"Sid\": \"PersonalizeS3BucketAccessPolicy\",\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"personalize.amazonaws.com\"\n", " },\n", " \"Action\": [\n", " \"s3:*Object\",\n", " \"s3:ListBucket\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:s3:::{}\".format(bucket_name),\n", " \"arn:aws:s3:::{}/*\".format(bucket_name)\n", " ]\n", " }\n", " ]\n", "}\n", "\n", "s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just a note, IAM changes in the cell below take just a few seconds to be confirmed so this section will take 1 minute to complete. Execute the cell below next." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Personalize Role" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iam = boto3.client(\"iam\")\n", "\n", "role_name = \"PersonalizeRoleRIDemo\"\n", "assume_role_policy_document = {\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"personalize.amazonaws.com\"\n", " },\n", " \"Action\": \"sts:AssumeRole\"\n", " }\n", " ]\n", "}\n", "\n", "create_role_response = iam.create_role(\n", " RoleName = role_name,\n", " AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)\n", ")\n", "\n", "# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes \"personalize\" or \"Personalize\" \n", "# if you would like to use a bucket with a different name, please consider creating and attaching a new policy\n", "# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role\n", "policy_arn = \"arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess\"\n", "iam.attach_role_policy(\n", " RoleName = role_name,\n", " PolicyArn = policy_arn\n", ")\n", "\n", "# Now add S3 support\n", "iam.attach_role_policy(\n", " PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',\n", " RoleName=role_name\n", ")\n", "time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate\n", "\n", "role_arn = create_role_response[\"Role\"][\"Arn\"]\n", "print(role_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At last you are ready to import the data into Personalize, the first cell below will start the process and the second contains a while loop that will poll the service to determine when the import job has completed. \n", "\n", "During this workshop you are importing a relatively small dataset and it may seem like this takes a while for such a small file. The leading use of time here is provisioning dedicated resources that will actually run the task. This allows Personalize to provide HIPAA compliance for example, it also means that just because your files are larger that it won't take an extremely long time, once the resources are up and running the import is pretty quick.\n", "\n", "It may take up to 20 minutes for the import process to complete. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Dataset Import Job" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_dataset_import_job_response = personalize.create_dataset_import_job(\n", " jobName = \"personalize-ri-import\",\n", " datasetArn = dataset_arn,\n", " dataSource = {\n", " \"dataLocation\": \"s3://{}/{}\".format(bucket_name, filename)\n", " },\n", " roleArn = role_arn\n", ")\n", "\n", "dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']\n", "print(json.dumps(create_dataset_import_job_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dataset Import Loop\n", "\n", "The cell below will take a bit longer to complete, it is polling to learn when your Dataset has been fully imported into Personalize. This should take around 20 minutes to complete. Most of the time is spent provisioning infrastructure behind the scenes. This means that much larger files do not take much longer to import in most cases.\n", "\n", "Run the cell below, then move on when it reaches an `ACTIVE` state.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "current_time = datetime.datetime.now()\n", "print(\"Import Started on: \", current_time.strftime(\"%I:%M:%S %p\"))\n", "\n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " describe_dataset_import_job_response = personalize.describe_dataset_import_job(\n", " datasetImportJobArn = dataset_import_job_arn\n", " )\n", " status = describe_dataset_import_job_response[\"datasetImportJob\"]['status']\n", " print(\"DatasetImportJob: {}\".format(status))\n", " \n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", " \n", " time.sleep(60)\n", " \n", "current_time = datetime.datetime.now()\n", "print(\"Import Completed on: \", current_time.strftime(\"%I:%M:%S %p\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting a Recipe\n", "\n", "Inside Personalize the algorithms available are called recipes, the code below will give you a full list of them. They each have various use cases and detailed information can be found here: https://docs.aws.amazon.com/personalize/latest/dg/working-with-predefined-recipes.html\n", "\n", "In this workshop you will be using HRNN(Hierarchical Recurrent Neural Network) quoting the docs:\n", "\n", "```\n", "HRRN is a hierarchical recurrent neural network, which can model the user-item interactions across a given timeframe. Use the HRNN recipe when user behavior changes over time, which is referred to the evolving intent problem.\n", "\n", "To train a model, HRNN uses the Interactions dataset from a dataset group. A dataset group is a set of related datasets, which can include the Users, Items, and Interactions datasets.\n", "```\n", "\n", "A paper explaining it more in detail as it relates to recommendations: https://openreview.net/pdf?id=ByzxsrrkJ4 \n", "\n", "\n", "It is used here as an example of a recommendation system built using deep neural networks and can be trained using only our interactions data. This is a great place to start when running your own experiments later as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "personalize.list_recipes()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cell below selects the `HRNN` recipe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "recipe_arn = \"arn:aws:personalize:::recipe/aws-hrnn\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training a Solution\n", "\n", "Within Amazon Personalize a model trained on customer data is called a Solution, these are versioned and so you will first create a Solution, then a Solution Version. The versions are utilized to track model improvement over time based on newer available data. \n", "\n", "Creating a Solution itself is nearly instantaneous however the actual training to create a version can take a bit of time. This is usually the longest waiting period in the process. If you are doing this outside of a workshop it is a great time to check your emails, grab a coffee, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_solution_response = personalize.create_solution(\n", " name = \"personalize-ri-soln-hrnn\",\n", " datasetGroupArn = dataset_group_arn,\n", " recipeArn = recipe_arn\n", ")\n", "\n", "solution_arn = create_solution_response['solutionArn']\n", "print(json.dumps(create_solution_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Solution Version" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_solution_version_response = personalize.create_solution_version(\n", " solutionArn = solution_arn\n", ")\n", "\n", "solution_version_arn = create_solution_version_response['solutionVersionArn']\n", "print(json.dumps(create_solution_version_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Solution Version Loop\n", "\n", "Training a model can take a bit of time, this will often take at least 20 minutes to complete. Run the cell below, again waiting for it to reach an `ACTIVE` state before moving to the next bit." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "current_time = datetime.datetime.now()\n", "print(\"Training Started on: \", current_time.strftime(\"%I:%M:%S %p\"))\n", "\n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " describe_solution_version_response = personalize.describe_solution_version(\n", " solutionVersionArn = solution_version_arn\n", " )\n", " status = describe_solution_version_response[\"solutionVersion\"][\"status\"]\n", " print(\"SolutionVersion: {}\".format(status))\n", " \n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", " \n", " time.sleep(60)\n", " \n", "current_time = datetime.datetime.now()\n", "print(\"Training Completed on: \", current_time.strftime(\"%I:%M:%S %p\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Get Metrics of Solution Version\n", "\n", "Now that your solution and version exists, you can obtain the metrics for it to judge its performance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "get_solution_metrics_response = personalize.get_solution_metrics(\n", " solutionVersionArn = solution_version_arn\n", ")\n", "\n", "print(json.dumps(get_solution_metrics_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create and Wait for the Campaign\n", "\n", "Now that you have a working solution version you will need to create a campaign to use it with your applications. A campaign is simply a hosted copy of your model. Again there will be a short wait so after executing you can take a quick break while the infrastructure is being provisioned." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Campaign" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_campaign_response = personalize.create_campaign(\n", " name = \"personalize-ri-camp\",\n", " solutionVersionArn = solution_version_arn,\n", " minProvisionedTPS = 1\n", ")\n", "\n", "campaign_arn = create_campaign_response['campaignArn']\n", "print(json.dumps(create_campaign_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Wait for Campaign to Have ACTIVE Status\n", "\n", "#### Create a Campaign Loop\n", "\n", "In this section Personalize is deploying your model and that takes a few minutes to complete. Execute the cell below to agian poll until the task has completed. It should take 10 to 15 minutes. Once the cell has reached an `ACTIVE` state, continue on." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "current_time = datetime.datetime.now()\n", "print(\"Deploying Started on: \", current_time.strftime(\"%I:%M:%S %p\"))\n", "\n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " describe_campaign_response = personalize.describe_campaign(\n", " campaignArn = campaign_arn\n", " )\n", " status = describe_campaign_response[\"campaign\"][\"status\"]\n", " print(\"Campaign: {}\".format(status))\n", " \n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", " \n", " time.sleep(60)\n", " \n", "current_time = datetime.datetime.now()\n", "print(\"Deploying Completed on: \", current_time.strftime(\"%I:%M:%S %p\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get Sample Recommendations\n", "\n", "After the campaign is active you are ready to get recommendations. First we need to select a random user from the collection. Then we will create a few helper functions for getting movie information to show for recommendations instead of just IDs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Getting a random user:\n", "user_id, item_id, _ = data.sample().values[0]\n", "print(\"USER: {}\".format(user_id))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "items = pd.read_csv('./ml-100k/u.item', sep='|', usecols=[0,1], encoding='latin-1', names=['ITEM_ID', 'TITLE'], index_col='ITEM_ID')\n", "\n", "def get_movie_title(movie_id):\n", " \"\"\"\n", " Takes in an ID, returns a title\n", " \"\"\"\n", " movie_id = int(movie_id)-1\n", " return items.iloc[movie_id]['TITLE']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Call GetRecommendations\n", "\n", "The code below will get recommendations for the random user selected, it then places the recommendations in a dataframe and renders it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "get_recommendations_response = personalize_runtime.get_recommendations(\n", " campaignArn = campaign_arn,\n", " userId = str(user_id),\n", ")\n", "# Update DF rendering\n", "pd.set_option('display.max_rows', 30)\n", "\n", "print(\"Recommendations for user: \", user_id)\n", "\n", "item_list = get_recommendations_response['itemList']\n", "\n", "recommendation_list = []\n", "\n", "for item in item_list:\n", " title = get_movie_title(item['itemId'])\n", " recommendation_list.append(title)\n", " \n", "recommendations_df = pd.DataFrame(recommendation_list, columns = ['Original Recommendations'])\n", "recommendations_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating an Event Tracker\n", "\n", "Before your recommendation system can respond to real time events you will need an event tracker, the code below will generate one and can be used going forward with this lab. Feel free to name it something more clever." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = personalize.create_event_tracker(\n", " name='MovieClickTrackerRI',\n", " datasetGroupArn=dataset_group_arn\n", ")\n", "print(response['eventTrackerArn'])\n", "print(response['trackingId'])\n", "TRACKING_ID = response['trackingId']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "event_tracker_arn = response['eventTrackerArn']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simulating User Behavior\n", "\n", "The lines below provide a code sample that simulates a user interacting with a particular item, you will then get recommendations that differ from those when you started." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session_dict = {}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def send_movie_click(USER_ID, ITEM_ID):\n", " \"\"\"\n", " Simulates a click as an envent\n", " to send an event to Amazon Personalize's Event Tracker\n", " \"\"\"\n", " # Configure Session\n", " try:\n", " session_ID = session_dict[USER_ID]\n", " except:\n", " session_dict[USER_ID] = str(uuid.uuid1())\n", " session_ID = session_dict[USER_ID]\n", " \n", " # Configure Properties:\n", " event = {\n", " \"itemId\": str(ITEM_ID),\n", " }\n", " event_json = json.dumps(event)\n", " \n", " # Make Call\n", " personalize_events.put_events(\n", " trackingId = TRACKING_ID,\n", " userId= USER_ID,\n", " sessionId = session_ID,\n", " eventList = [{\n", " 'sentAt': int(time.time()),\n", " 'eventType': 'EVENT_TYPE',\n", " 'properties': event_json\n", " }]\n", " )\n", "\n", "def get_new_recommendations_df(recommendations_df, movie_ID):\n", " # Get the title of the movie for the header of the column\n", " movie_title_clicked = get_movie_title(movie_to_click)\n", " # Interact with the movie\n", " send_movie_click(USER_ID=str(user_id), ITEM_ID=movie_to_click)\n", " # Sleep for 2 seconds\n", " time.sleep(2)\n", " # Get new recommendations\n", " get_recommendations_response = personalize_runtime.get_recommendations(\n", " campaignArn = campaign_arn,\n", " userId = str(user_id),\n", " )\n", " # Build a new dataframe of recommendations\n", " item_list = get_recommendations_response['itemList']\n", " recommendation_list = []\n", " for item in item_list:\n", " title = get_movie_title(item['itemId'])\n", " recommendation_list.append(title)\n", " new_rec_DF = pd.DataFrame(recommendation_list, columns = [movie_title_clicked])\n", " # Add this dataframe to the old one\n", " recommendations_df = recommendations_df.join(new_rec_DF)\n", " return recommendations_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The 3 cells below will simulate adding a movie ot the dataset and then rendering the results after each interaction. The first column is the original recommendations, the subsequent columns are for each movie that your user has now interacted with. The column header is the name of the movie that the user interacted with, and you can then see how the recommendations are altered." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "movie_to_click = 180\n", "recommendations_df = get_new_recommendations_df(recommendations_df, movie_to_click)\n", "recommendations_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "movie_to_click = 210\n", "recommendations_df = get_new_recommendations_df(recommendations_df, movie_to_click)\n", "recommendations_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "movie_to_click = 415\n", "recommendations_df = get_new_recommendations_df(recommendations_df, movie_to_click)\n", "recommendations_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "In this workshop you successfully created a dataset group, a dataset based on interactions, trained a recommendation model based on the data, deployed a campaign to generate recommendations, evaluated the initial recommendations, leveraged real time event tracking for even better recommendations, and below is bonus material to show how to export recommendations in \n", "\n", "This content will stay public on GitHub and can be used within your organization in order to build custom recommendation models, take a look at the various other notebooks in the future.\n", "\n", "Thank you, and good luck in your next Machine Learning project!\n", "\n", "BONUS MATERIAL BELOW:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Batch Recommendations\n", "\n", "So far you have seen how to generate recommendations via an API call and interact with event trackers. That works well for many applications but you may find yourself wanting to cache all recommendations for users locally or even to study the recommendations for new ideas. To support that Amazon Personalize supports batch exporting of your recommendations to a file as well. The cells below will walk you through sending recommendations to a file in S3 and then will show its contents. The file can be downloaded from S3 to your local computer as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First you will need to create a JSON file of user IDS" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "user_IDs = ['561', '233', '579']\n", "\n", "json_input_filename = \"json_input.json\"\n", "with open(json_input_filename, 'w') as json_input:\n", " for user_id in user_IDs:\n", " json_input.write('{\"userId\": \"' + user_id + '\"}\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now upload this file to S3:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boto3.Session().resource('s3').Bucket(bucket_name).Object(json_input_filename).upload_file(json_input_filename)\n", "s3_input_path = \"s3://\" + bucket_name + \"/\" + json_input_filename\n", "print(s3_input_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the input file is available in S3, you need to define where the output will go, and create a batch inference job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_output_path = \"s3://\" + bucket_name + \"/\"\n", "print(s3_output_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cell below creates the batch job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "personalize_rec = boto3.client(service_name='personalize')\n", "batchInferenceJobArn = personalize_rec.create_batch_inference_job (\n", " solutionVersionArn = solution_version_arn,\n", " jobName = \"RI-Workshop-Batch-Inference-Job\",\n", " roleArn = role_arn,\n", " jobInput = \n", " {\"s3DataSource\": {\"path\": s3_input_path}},\n", " jobOutput = \n", " {\"s3DataDestination\":{\"path\": s3_output_path}}\n", ")\n", "batchInferenceJobArn = batchInferenceJobArn['batchInferenceJobArn']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next cell will poll until the export has completed. This is the final waiting loop of the workshop! This will again poll until it reaches an `ACTIVE` state, you can continue on once this is reached." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "current_time = datetime.datetime.now()\n", "print(\"Import Started on: \", current_time.strftime(\"%I:%M:%S %p\"))\n", "\n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " describe_dataset_inference_job_response = personalize_rec.describe_batch_inference_job(\n", " batchInferenceJobArn = batchInferenceJobArn\n", " )\n", " status = describe_dataset_inference_job_response[\"batchInferenceJob\"]['status']\n", " print(\"DatasetInferenceJob: {}\".format(status))\n", " \n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", " \n", " time.sleep(60)\n", " \n", "current_time = datetime.datetime.now()\n", "print(\"Import Completed on: \", current_time.strftime(\"%I:%M:%S %p\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the data successfully exported, grab the file and parse it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3 = boto3.client('s3')\n", "export_name = json_input_filename + \".out\"\n", "s3.download_file(bucket_name, export_name, export_name)\n", "\n", "# Update DF rendering\n", "pd.set_option('display.max_rows', 30)\n", "with open(export_name) as json_file:\n", " # Get the first line and parse it\n", " line = json.loads(json_file.readline())\n", " # Do the same for the other lines\n", " while line:\n", " # extract the user ID \n", " col_header = \"User: \" + line['input']['userId']\n", " # Create a list for all the movies\n", " recommendation_list = []\n", " # Add all the entries\n", " for item in line['output']['recommendedItems']:\n", " title = get_movie_title(item)\n", " recommendation_list.append(title)\n", " if 'bulk_recommendations_df' in locals():\n", " new_rec_DF = pd.DataFrame(recommendation_list, columns = [col_header])\n", " bulk_recommendations_df = bulk_recommendations_df.join(new_rec_DF)\n", " else:\n", " bulk_recommendations_df = pd.DataFrame(recommendation_list, columns=[col_header])\n", " try:\n", " line = json.loads(json_file.readline())\n", " except:\n", " line = None\n", "bulk_recommendations_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the cell above you can see the various recommendations for the users provided, in a real scenario the list of users could be your entire userbase allowing you to quickly reference and compare results between them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extra Bonus: Cleanup\n", "\n", "The cells below are totally optional if you are using Event Engine as your account is deleted after the workshop. If you'd like to use this in your own account later, the cells below will delete the resources created to prevent continous charges." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete the campaign:\n", "personalize.delete_campaign(campaignArn=campaign_arn)\n", "time.sleep(60)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete the solution\n", "personalize.delete_solution(solutionArn=solution_arn)\n", "time.sleep(60)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete the event tracker\n", "personalize.delete_event_tracker(eventTrackerArn=event_tracker_arn)\n", "time.sleep(60)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete the interaction dataset\n", "personalize.delete_dataset(datasetArn=dataset_arn)\n", "time.sleep(60)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete the event dataset\n", "event_interactions_dataset_arn = dataset_arn\n", "event_interactions_dataset_arn = event_interactions_dataset_arn.replace(\"INTERACTIONS\", \"EVENT_INTERACTIONS\")\n", "personalize.delete_dataset(datasetArn=event_interactions_dataset_arn)\n", "time.sleep(60)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete the schema\n", "personalize.delete_schema(schemaArn=schema_arn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete the Dataset Group\n", "personalize.delete_dataset_group(datasetGroupArn=dataset_group_arn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Empty the S3 Bucket\n", "s3 = boto3.resource('s3')\n", "bucket = s3.Bucket(bucket_name)\n", "bucket.objects.all().delete()\n", "time.sleep(60)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Delete the S3 Bucket\n", "bucket.delete()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# IAM policies should also be removed\n", "iam = boto3.client(\"iam\")\n", "iam.detach_role_policy(PolicyArn=\"arn:aws:iam::aws:policy/AmazonS3FullAccess\", RoleName=role_name)\n", "iam.detach_role_policy(PolicyArn=\"arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess\",RoleName=role_name)\n", "\n", "iam.delete_role(RoleName=role_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Last Step\n", "\n", "After cleaning up all of the resources you can now close this window and go back to the github page you stareted on. At the bottom of the Readme file are steps to delete the CloudFormation stack you created earlier. Once that is done you are 100% done with the workshop.\n", "\n", "Congratulations!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 4 }