{ "cells": [ { "cell_type": "markdown", "id": "d1ddf24d", "metadata": {}, "source": [ "# Building Your First Video On Demand Recommender\n", "\n", "This notebook will walk you through the steps to build a Domain dataset group and a\n", "recommender that returns movie recommendations based on data collected from the movielens data set. The goal is to recommend movies that are relevant based on a particular user.\n", "\n", "The data comes from the [MovieLens project](https://grouplens.org/datasets/movielens/). Follow the link to learn more about the data and potential uses.\n", "\n", "# How to Use the Notebook\n", "\n", "The code is broken up into cells like the one below. There's a triangular Run button at the top of this page that you can click to execute each cell and move onto the next, or you can press `Shift` + `Enter` while in the cell to execute it and move onto the next one.\n", "\n", "As a cell is executing you'll notice a line to the side showcase an `*` while the cell is running or it will update to a number to indicate the last cell that completed executing after it has finished exectuting all the code within a cell.\n", "\n", "Simply follow the instructions below and execute the cells to get started with Amazon Personalize using case optimized recommenders." ] }, { "cell_type": "markdown", "id": "de3a07f7", "metadata": {}, "source": [ "## Imports\n", "Python ships with a broad collection of libraries and we need to import those as well as the ones installed to help us like [boto3](https://aws.amazon.com/sdk-for-python/) (AWS SDK for python) and [Pandas](https://pandas.pydata.org/)/[Numpy](https://numpy.org/) which are core data science tools." ] }, { "cell_type": "code", "execution_count": null, "id": "a5e0041c", "metadata": {}, "outputs": [], "source": [ "# Imports\n", "import boto3\n", "import json\n", "import numpy as np\n", "import pandas as pd\n", "import time\n", "import datetime" ] }, { "cell_type": "markdown", "id": "375d569e", "metadata": {}, "source": [ "Next you will want to validate that your environment can communicate successfully with Amazon Personalize, the lines below do just that." ] }, { "cell_type": "code", "execution_count": null, "id": "9ba2821e", "metadata": {}, "outputs": [], "source": [ "# Configure the SDK to Personalize:\n", "personalize = boto3.client('personalize')\n", "personalize_runtime = boto3.client('personalize-runtime')" ] }, { "cell_type": "markdown", "id": "76c6d93d", "metadata": {}, "source": [ "## Configure the data\n", "Data is imported into Amazon Personalize through Amazon S3, below we will specify a bucket that you have created within AWS for the purposes of this exercise.\n", "Below you will update the `bucket` variable to instead be set to the value that you created earlier in the CloudFormation steps, this should be in a text file from your earlier work. the `filename` does not need to be changed.\n", "### Specify a Bucket and Data Output Location\n", "Update the `bucket` name to a unique name." ] }, { "cell_type": "code", "execution_count": null, "id": "372982ba", "metadata": {}, "outputs": [], "source": [ "filename = \"movie-lens-100k.csv\"" ] }, { "cell_type": "markdown", "id": "fa2418ac", "metadata": {}, "source": [ "## Download, Prepare, and Upload Training Data\n", "At present you do not have the MovieLens data loaded locally yet for examination, execute the lines below to download the latest copy and to examine it quickly.\n", "\n", "### Download and Explore the Dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "55adbc5c", "metadata": {}, "outputs": [], "source": [ "!wget -N https://files.grouplens.org/datasets/movielens/ml-latest-small.zip\n", "!unzip -o ml-latest-small.zip" ] }, { "cell_type": "code", "execution_count": null, "id": "21345840", "metadata": {}, "outputs": [], "source": [ "!ls ml-latest-small" ] }, { "cell_type": "code", "execution_count": null, "id": "4a1f9fdc", "metadata": { "scrolled": true }, "outputs": [], "source": [ "!pygmentize ml-latest-small/README.txt" ] }, { "cell_type": "code", "execution_count": null, "id": "1dbfefea", "metadata": {}, "outputs": [], "source": [ "interactions_data = pd.read_csv('./ml-latest-small/ratings.csv')\n", "pd.set_option('display.max_rows', 5)\n", "interactions_data" ] }, { "cell_type": "code", "execution_count": null, "id": "4e521323", "metadata": {}, "outputs": [], "source": [ "interactions_data.info()" ] }, { "cell_type": "markdown", "id": "b8ffb037", "metadata": {}, "source": [ "## Prepare the Data\n", "\n", "### Interactions Data\n", "As you can see the data contains a UserID, ItemID, Rating, and Timestamp.\n", "\n", "We are now going to remove the items with low rankings, and remove the Rating column before we build our model.\n", "\n", "We are also adding the column EVENT_TYPE to all interactions." ] }, { "cell_type": "code", "execution_count": null, "id": "2c5eca62", "metadata": {}, "outputs": [], "source": [ "interactions_data = interactions_data[interactions_data['rating'] > 3] # Keep only movies rated higher than 3 out of 5.\n", "interactions_data = interactions_data[['userId', 'movieId', 'timestamp']]\n", "interactions_data.rename(columns = {'userId':'USER_ID', 'movieId':'ITEM_ID', \n", " 'timestamp':'TIMESTAMP'}, inplace = True)\n", "interactions_data['EVENT_TYPE']='watch' #Adding an EVENT_TYPE column that has the event type \"watched\" for all movies\n", "interactions_data.head()" ] }, { "cell_type": "markdown", "id": "6daf370c", "metadata": {}, "source": [ "### Item Metadata\n", "\n", "Open the item data file and take a look at the first rows." ] }, { "cell_type": "code", "execution_count": null, "id": "95981207", "metadata": {}, "outputs": [], "source": [ "items_data = pd.read_csv('./ml-latest-small/movies.csv')\n", "items_data.head(5)" ] }, { "cell_type": "code", "execution_count": null, "id": "69cfb3b5", "metadata": {}, "outputs": [], "source": [ "items_data.info()" ] }, { "cell_type": "code", "execution_count": null, "id": "965dae72", "metadata": {}, "outputs": [], "source": [ "items_data['year'] = items_data['title'].str.extract('.*\\((.*)\\).*',expand = False)\n", "items_data.head(5)" ] }, { "cell_type": "markdown", "id": "692f19df", "metadata": {}, "source": [ "Selecting a modern date as the creation timestamp for this example because the actual creation timestamp is unknown. In your use-case, please provide the appropriate creation timestamp." ] }, { "cell_type": "code", "execution_count": null, "id": "9e248410", "metadata": {}, "outputs": [], "source": [ "ts= datetime.datetime(2022, 1, 1, 0, 0).strftime('%s')\n", "print(ts)" ] }, { "cell_type": "code", "execution_count": null, "id": "6f13e2a1", "metadata": {}, "outputs": [], "source": [ "items_data[\"CREATION_TIMESTAMP\"] = ts\n", "items_data" ] }, { "cell_type": "code", "execution_count": null, "id": "a931e102", "metadata": {}, "outputs": [], "source": [ "# removing the title\n", "items_data.drop(columns=\"title\", inplace = True)\n", "\n", "# renaming the columns to match schema\n", "items_data.rename(columns = { 'movieId':'ITEM_ID', 'genres':'GENRES',\n", " 'year':'YEAR'}, inplace = True)\n", "items_data" ] }, { "cell_type": "markdown", "id": "a5516355", "metadata": {}, "source": [ "# User Metadata\n", "\n", "The dataset doe not have any user metadata so we will create an fake metadata field." ] }, { "cell_type": "code", "execution_count": null, "id": "88c08e4f", "metadata": {}, "outputs": [], "source": [ "# get user ids from the interaction dataset\n", "\n", "user_ids = interactions_data['USER_ID'].unique()\n", "user_data = pd.DataFrame()\n", "user_data[\"USER_ID\"]=user_ids\n", "user_data" ] }, { "cell_type": "markdown", "id": "57b0d67f", "metadata": {}, "source": [ "## Adding Metadata\n", "The current dataset does not contain additiona user information. For this example, we'll randomly assign a gender to the users with equal probablity of male and female." ] }, { "cell_type": "code", "execution_count": null, "id": "7e93c22f", "metadata": {}, "outputs": [], "source": [ "possible_genders = ['female', 'male']\n", "random = np.random.choice(possible_genders, len(user_data.index), p=[0.5, 0.5])\n", "user_data[\"GENDER\"] = random\n", "user_data" ] }, { "cell_type": "markdown", "id": "74b4e6ed", "metadata": {}, "source": [ "## Configure an S3 bucket and an IAM role\n", "\n", "So far, we have downloaded, manipulated, and saved the data onto the Amazon EBS instance attached to instance running this Jupyter notebook. However, Amazon Personalize will need an S3 bucket to act as the source of your data, as well as IAM roles for accessing that bucket. Let's set all of that up.\n", "\n", "The Amazon S3 bucket needs to be in the same region as the Amazon Personalize resources we have been creating so far. Simply define the region as a string below." ] }, { "cell_type": "code", "execution_count": null, "id": "e07dc1ef", "metadata": {}, "outputs": [], "source": [ "# Sets the same region as current Amazon SageMaker Notebook\n", "with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:\n", " data = json.load(notebook_info)\n", " resource_arn = data['ResourceArn']\n", " region = resource_arn.split(':')[3]\n", "print('region:', region)\n", "\n", "# Or you can specify the region where your bucket and model will be domiciled this should be the same region as the Amazon Personalize resources\n", "# region = \"us-east-1\"\n" ] }, { "cell_type": "code", "execution_count": null, "id": "5267ef25", "metadata": {}, "outputs": [], "source": [ "s3 = boto3.client('s3')\n", "account_id = boto3.client('sts').get_caller_identity().get('Account')\n", "bucket_name = account_id + \"-\" + region + \"-\" + \"personalizemanagedvod\"\n", "print('bucket_name:', bucket_name)\n", "\n", "try:\n", " if region == \"us-east-1\":\n", " s3.create_bucket(Bucket=bucket_name)\n", " else:\n", " s3.create_bucket(\n", " Bucket=bucket_name,\n", " CreateBucketConfiguration={'LocationConstraint': region}\n", " )\n", "except s3.exceptions.BucketAlreadyOwnedByYou:\n", " print(\"Bucket already exists. Using bucket\", bucket_name)" ] }, { "cell_type": "markdown", "id": "0fb30175", "metadata": {}, "source": [ "### Upload data to S3\n", "Now that your Amazon S3 bucket has been created, upload the CSV file of our user-item-interaction data." ] }, { "cell_type": "code", "execution_count": null, "id": "1dfbe6a3", "metadata": {}, "outputs": [], "source": [ "interactions_filename = \"interactions.csv\"\n", "interactions_data.to_csv(interactions_filename, index=False)\n", "boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_filename)\n", "\n", "items_filename = \"items.csv\"\n", "items_data.to_csv(items_filename, index=False)\n", "boto3.Session().resource('s3').Bucket(bucket_name).Object(items_filename).upload_file(items_filename)\n", "\n", "user_filename = \"users.csv\"\n", "user_data.to_csv(user_filename, index=False)\n", "boto3.Session().resource('s3').Bucket(bucket_name).Object(user_filename).upload_file(user_filename)" ] }, { "cell_type": "markdown", "id": "16be37c5", "metadata": {}, "source": [ "## Set the S3 bucket policy\n", "Amazon Personalize needs to be able to read the contents of your S3 bucket. So add a bucket policy which allows that.\n", "\n", "Note: Make sure the role you are using to run the code in this notebook has the necessary permissions to modify the S3 bucket policy." ] }, { "cell_type": "code", "execution_count": null, "id": "8b793feb", "metadata": {}, "outputs": [], "source": [ "s3 = boto3.client(\"s3\")\n", "policy = {\n", " \"Version\": \"2012-10-17\",\n", " \"Id\": \"PersonalizeS3BucketAccessPolicy\",\n", " \"Statement\": [\n", " {\n", " \"Sid\": \"PersonalizeS3BucketAccessPolicy\",\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"personalize.amazonaws.com\"\n", " },\n", " \"Action\": [\n", " \"s3:GetObject\",\n", " \"s3:ListBucket\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:s3:::{}\".format(bucket_name),\n", " \"arn:aws:s3:::{}/*\".format(bucket_name)\n", " ]\n", " }\n", " ]\n", "}\n", "\n", "s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))" ] }, { "cell_type": "markdown", "id": "fc6a2505", "metadata": {}, "source": [ "## Create and Wait for Dataset Group\n", "The largest grouping in Personalize is a Dataset Group, this will isolate your data, event trackers, solutions, recommenders and campaigns. Grouping things together that share a common collection of data. Feel free to alter the name below if you'd like.\n", "\n", "### Create Dataset Group" ] }, { "cell_type": "code", "execution_count": null, "id": "b3a4535f", "metadata": { "scrolled": true }, "outputs": [], "source": [ "response = personalize.create_dataset_group(\n", " name='personalize-video-on-demand-ds-group',\n", " domain='VIDEO_ON_DEMAND'\n", ")\n", "\n", "dataset_group_arn = response['datasetGroupArn']\n", "print(json.dumps(response, indent=2))" ] }, { "cell_type": "markdown", "id": "b9a32cae", "metadata": {}, "source": [ "Wait for Dataset Group to Have ACTIVE Status\n", "Before we can use the Dataset Group in any items below it must be active, execute the cell below and wait for it to show active." ] }, { "cell_type": "code", "execution_count": null, "id": "c4c75cc6", "metadata": {}, "outputs": [], "source": [ "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " describe_dataset_group_response = personalize.describe_dataset_group(\n", " datasetGroupArn = dataset_group_arn\n", " )\n", " status = describe_dataset_group_response[\"datasetGroup\"][\"status\"]\n", " print(\"DatasetGroup: {}\".format(status))\n", " \n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", " \n", " time.sleep(60)" ] }, { "cell_type": "markdown", "id": "d4de0c8f", "metadata": {}, "source": [ "## Create Interactions Schema\n", "A core component of how Personalize understands your data comes from the Schema that is defined below. This configuration tells the service how to digest the data provided via your CSV file. Note the columns and types align to what was in the file you created above." ] }, { "cell_type": "code", "execution_count": null, "id": "9243b429", "metadata": {}, "outputs": [], "source": [ "schema = {\n", " \"type\": \"record\",\n", " \"name\": \"Interactions\",\n", " \"namespace\": \"com.amazonaws.personalize.schema\",\n", " \"fields\": [\n", " {\n", " \"name\": \"USER_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"ITEM_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"EVENT_TYPE\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"TIMESTAMP\",\n", " \"type\": \"long\"\n", " }\n", " ],\n", " \"version\": \"1.0\"\n", "}\n", "\n", "create_interactions_schema_response = personalize.create_schema(\n", " name='personalize-demo-interactions-schema',\n", " schema=json.dumps(schema),\n", " domain='VIDEO_ON_DEMAND'\n", ")\n", "\n", "interactions_schema_arn = create_interactions_schema_response['schemaArn']\n", "print(json.dumps(create_interactions_schema_response, indent=2))" ] }, { "cell_type": "markdown", "id": "5bcf9ba6", "metadata": {}, "source": [ "# Create Items (movies) schema" ] }, { "cell_type": "code", "execution_count": null, "id": "1bf70e1a", "metadata": {}, "outputs": [], "source": [ "schema = {\n", " \"type\": \"record\",\n", " \"name\": \"Items\",\n", " \"namespace\": \"com.amazonaws.personalize.schema\",\n", " \"fields\": [\n", " {\n", " \"name\": \"ITEM_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"GENRES\",\n", " \"type\": [\n", " \"string\"\n", " ],\n", " \"categorical\": True\n", " },\n", " {\n", " \"name\": \"YEAR\",\n", " \"type\": [\n", " \"string\"\n", " ],\n", " \"categorical\": True\n", " }, \n", " {\n", " \"name\": \"CREATION_TIMESTAMP\",\n", " \"type\": \"long\"\n", " }\n", " ],\n", " \"version\": \"1.0\"\n", "}\n", "create_items_schema_response = personalize.create_schema(\n", " name='personalize-demo-items-schema',\n", " schema=json.dumps(schema),\n", " domain='VIDEO_ON_DEMAND'\n", ")\n", "\n", "items_schema_arn = create_items_schema_response['schemaArn']\n", "print(json.dumps(create_items_schema_response, indent=2))" ] }, { "cell_type": "markdown", "id": "1f9b0129", "metadata": {}, "source": [ "# Create Users schema" ] }, { "cell_type": "code", "execution_count": null, "id": "2932e5e6", "metadata": {}, "outputs": [], "source": [ "schema = {\n", " \"type\": \"record\",\n", " \"name\": \"Users\",\n", " \"namespace\": \"com.amazonaws.personalize.schema\",\n", " \"fields\": [\n", " {\n", " \"name\": \"USER_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"GENDER\",\n", " \"type\": \"string\",\n", " \"categorical\": True\n", " }\n", " ],\n", " \"version\": \"1.0\"\n", "}\n", "create_users_schema_response = personalize.create_schema(\n", " name='personalize-demo-users-schema',\n", " schema=json.dumps(schema),\n", " domain='VIDEO_ON_DEMAND'\n", ")\n", "\n", "users_schema_arn = create_users_schema_response['schemaArn']\n", "print(json.dumps(create_users_schema_response, indent=2))" ] }, { "cell_type": "markdown", "id": "16efc901", "metadata": {}, "source": [ "## Create Datasets\n", "After the group, the next thing to create is the actual datasets." ] }, { "cell_type": "markdown", "id": "7e648c0e", "metadata": {}, "source": [ "### Create Interactions Dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "af2e8591", "metadata": {}, "outputs": [], "source": [ "dataset_type = \"INTERACTIONS\"\n", "\n", "create_dataset_response = personalize.create_dataset(\n", " name = \"personalize-demo-interactions\",\n", " datasetType = dataset_type,\n", " datasetGroupArn = dataset_group_arn,\n", " schemaArn = interactions_schema_arn\n", ")\n", "\n", "interactions_dataset_arn = create_dataset_response['datasetArn']\n", "print(json.dumps(create_dataset_response, indent=2))" ] }, { "cell_type": "markdown", "id": "f03b5ed3", "metadata": {}, "source": [ "### Create Items Dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "9c7deb20", "metadata": {}, "outputs": [], "source": [ "dataset_type = \"ITEMS\"\n", "create_dataset_response = personalize.create_dataset(\n", " name = \"personalize-demo-items\",\n", " datasetType = dataset_type,\n", " datasetGroupArn = dataset_group_arn,\n", " schemaArn = items_schema_arn\n", ")\n", "\n", "items_dataset_arn = create_dataset_response['datasetArn']\n", "print(json.dumps(create_dataset_response, indent=2))" ] }, { "cell_type": "markdown", "id": "11b3a1cd", "metadata": {}, "source": [ "### Create Users Dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "3e76395c", "metadata": {}, "outputs": [], "source": [ "dataset_type = \"USERS\"\n", "create_dataset_response = personalize.create_dataset(\n", " name = \"personalize-demo-users\",\n", " datasetType = dataset_type,\n", " datasetGroupArn = dataset_group_arn,\n", " schemaArn = users_schema_arn\n", ")\n", "\n", "users_dataset_arn = create_dataset_response['datasetArn']\n", "print(json.dumps(create_dataset_response, indent=2))" ] }, { "cell_type": "markdown", "id": "1eed25ad", "metadata": {}, "source": [ "## Create Personalize Role\n", "Also Amazon Personalize needs the ability to assume Roles in AWS in order to have the permissions to execute certain tasks, the lines below grant that.\n", "\n", "Note: Make sure the role you are using to run the code in this notebook has the necessary permissions to create a role." ] }, { "cell_type": "code", "execution_count": null, "id": "0fcb13d4", "metadata": {}, "outputs": [], "source": [ "iam = boto3.client(\"iam\")\n", "\n", "role_name = \"PersonalizeRoleVODDemoRecommender\"\n", "assume_role_policy_document = {\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"personalize.amazonaws.com\"\n", " },\n", " \"Action\": \"sts:AssumeRole\"\n", " }\n", " ]\n", "}\n", "\n", "create_role_response = iam.create_role(\n", " RoleName = role_name,\n", " AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)\n", ")\n", "\n", "# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes \"personalize\" or \"Personalize\" \n", "# if you would like to use a bucket with a different name, please consider creating and attaching a new policy\n", "# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role\n", "policy_arn = \"arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess\"\n", "iam.attach_role_policy(\n", " RoleName = role_name,\n", " PolicyArn = policy_arn\n", ")\n", "\n", "# Now add S3 support\n", "iam.attach_role_policy(\n", " PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',\n", " RoleName=role_name\n", ")\n", "time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate\n", "\n", "role_arn = create_role_response[\"Role\"][\"Arn\"]\n", "print(role_arn)\n" ] }, { "cell_type": "markdown", "id": "05a673bd", "metadata": {}, "source": [ "## Import the data\n", "Earlier you created the DatasetGroup and Dataset to house your information, now you will execute an import job that will load the data from S3 into Amazon Personalize for usage building your model.\n", "### Create Interactions Dataset Import Job" ] }, { "cell_type": "code", "execution_count": null, "id": "a8abd39a", "metadata": {}, "outputs": [], "source": [ "create_interactions_dataset_import_job_response = personalize.create_dataset_import_job(\n", " jobName = \"personalize-demo-import-interactions\",\n", " datasetArn = interactions_dataset_arn,\n", " dataSource = {\n", " \"dataLocation\": \"s3://{}/{}\".format(bucket_name, interactions_filename)\n", " },\n", " roleArn = role_arn\n", ")\n", "\n", "dataset_interactions_import_job_arn = create_interactions_dataset_import_job_response['datasetImportJobArn']\n", "print(json.dumps(create_interactions_dataset_import_job_response, indent=2))" ] }, { "cell_type": "markdown", "id": "ec625ec5", "metadata": {}, "source": [ "### Create Items Dataset Import Job" ] }, { "cell_type": "code", "execution_count": null, "id": "c0d80744", "metadata": {}, "outputs": [], "source": [ "create_items_dataset_import_job_response = personalize.create_dataset_import_job(\n", " jobName = \"personalize-demo-import-items\",\n", " datasetArn = items_dataset_arn,\n", " dataSource = {\n", " \"dataLocation\": \"s3://{}/{}\".format(bucket_name, items_filename)\n", " },\n", " roleArn = role_arn\n", ")\n", "\n", "dataset_items_import_job_arn = create_items_dataset_import_job_response['datasetImportJobArn']\n", "print(json.dumps(create_items_dataset_import_job_response, indent=2))" ] }, { "cell_type": "markdown", "id": "df1fa4da", "metadata": {}, "source": [ "### Create Users Dataset Import Job" ] }, { "cell_type": "code", "execution_count": null, "id": "ba32c1a1", "metadata": {}, "outputs": [], "source": [ "create_users_dataset_import_job_response = personalize.create_dataset_import_job(\n", " jobName = \"personalize-demo-import-users\",\n", " datasetArn = users_dataset_arn,\n", " dataSource = {\n", " \"dataLocation\": \"s3://{}/{}\".format(bucket_name, user_filename)\n", " },\n", " roleArn = role_arn\n", ")\n", "\n", "dataset_users_import_job_arn = create_users_dataset_import_job_response['datasetImportJobArn']\n", "print(json.dumps(create_users_dataset_import_job_response, indent=2))" ] }, { "cell_type": "markdown", "id": "eb595d71", "metadata": {}, "source": [ "Wait for Dataset Import Job to Have ACTIVE Status\n", "It can take a while before the import job completes, please wait until you see that it is active below." ] }, { "cell_type": "code", "execution_count": null, "id": "d6044280", "metadata": {}, "outputs": [], "source": [ "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " describe_dataset_import_job_response = personalize.describe_dataset_import_job(\n", " datasetImportJobArn = dataset_interactions_import_job_arn\n", " )\n", " status = describe_dataset_import_job_response[\"datasetImportJob\"]['status']\n", " print(\"Interactions DatasetImportJob: {}\".format(status))\n", " \n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", " \n", " time.sleep(60)\n", " \n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " describe_dataset_import_job_response = personalize.describe_dataset_import_job(\n", " datasetImportJobArn = dataset_items_import_job_arn\n", " )\n", " status = describe_dataset_import_job_response[\"datasetImportJob\"]['status']\n", " print(\"Items DatasetImportJob: {}\".format(status))\n", " \n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", " \n", " time.sleep(60)\n", " \n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " describe_dataset_import_job_response = personalize.describe_dataset_import_job(\n", " datasetImportJobArn = dataset_users_import_job_arn\n", " )\n", " status = describe_dataset_import_job_response[\"datasetImportJob\"]['status']\n", " print(\"Users DatasetImportJob: {}\".format(status))\n", " \n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", " \n", " time.sleep(60)" ] }, { "cell_type": "markdown", "id": "5ac3aff1", "metadata": {}, "source": [ "## Choose a recommender use cases\n", "\n", "Each domain has different use cases. When you create a recommender you create it for a specific use case, and each use case has different requirements for getting recommendations.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "ee7e8012", "metadata": {}, "outputs": [], "source": [ "available_recipes = personalize.list_recipes(domain='VIDEO_ON_DEMAND') # See a list of recommenders for the domain. \n", "if (len(available_recipes[\"recipes\"])==0):\n", " # This is a workaround to get the recipes in case 'available_recipes[\"recipes\"]'does not retrieve them\n", " available_recipes = personalize.list_recipes(domain='VIDEO_ON_DEMAND', nextToken=available_recipes[\"nextToken\"])\n", "display(available_recipes[\"recipes\"])\n", " " ] }, { "cell_type": "markdown", "id": "d13be1bd", "metadata": {}, "source": [ "We are going to create a recommender of the type \"More like X\". This type of recommender offers recommendations for videos that are similar to a video a user watched. With this use case, Amazon Personalize automatically filters videos the user watched based on the userId specified in the `get_recommendations` call. For better performance, record Click events in addition to the required Watch events." ] }, { "cell_type": "code", "execution_count": null, "id": "8d4e1aca", "metadata": {}, "outputs": [], "source": [ "create_recommender_response = personalize.create_recommender(\n", " name = 'more_like_x_demo',\n", " recipeArn = 'arn:aws:personalize:::recipe/aws-vod-more-like-x',\n", " datasetGroupArn = dataset_group_arn\n", ")\n", "recommender_more_like_x_arn = create_recommender_response[\"recommenderArn\"]\n", "print (json.dumps(create_recommender_response))" ] }, { "cell_type": "markdown", "id": "d14fdf4c", "metadata": {}, "source": [ "We are going to create a second recommender of the type \"Top picks for you\". This type of recommender offers personalized streaming content recommendations for a user that you specify. With this use case, Amazon Personalize automatically filters videos the user watched based on the userId that you specify and `Watch` events.\n", "\n", "[More use cases per domain](https://docs.aws.amazon.com/personalize/latest/dg/domain-use-cases.html)" ] }, { "cell_type": "code", "execution_count": null, "id": "11f272f1", "metadata": {}, "outputs": [], "source": [ "create_recommender_response = personalize.create_recommender(\n", " name = 'top_picks_for_you_demo',\n", " recipeArn = 'arn:aws:personalize:::recipe/aws-vod-top-picks',\n", " datasetGroupArn = dataset_group_arn\n", ")\n", "recommender_top_picks_arn = create_recommender_response[\"recommenderArn\"]\n", "print (json.dumps(create_recommender_response))" ] }, { "cell_type": "markdown", "id": "06ab41e9", "metadata": {}, "source": [ "We wait until the recomenders have finished creating and have status `ACTIVE`. We check periodically on the status of the recommender" ] }, { "cell_type": "code", "execution_count": null, "id": "d8d2728c", "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "max_time = time.time() + 10*60*60 # 10 hours\n", "while time.time() < max_time:\n", "\n", " version_response = personalize.describe_recommender(\n", " recommenderArn = recommender_more_like_x_arn\n", " )\n", " status = version_response[\"recommender\"][\"status\"]\n", "\n", " if status == \"ACTIVE\":\n", " print(\"Build succeeded for {}\".format(recommender_more_like_x_arn))\n", " \n", " elif status == \"CREATE FAILED\":\n", " print(\"Build failed for {}\".format(recommender_more_like_x_arn))\n", "\n", " if status == \"ACTIVE\":\n", " break\n", " else:\n", " print(\"The More Like X Recommender build is still in progress\")\n", " \n", " time.sleep(60)\n", " \n", "while time.time() < max_time:\n", "\n", " version_response = personalize.describe_recommender(\n", " recommenderArn = recommender_top_picks_arn\n", " )\n", " status = version_response[\"recommender\"][\"status\"]\n", "\n", " if status == \"ACTIVE\":\n", " print(\"Build succeeded for {}\".format(recommender_top_picks_arn))\n", " \n", " elif status == \"CREATE FAILED\":\n", " print(\"Build failed for {}\".format(recommender_top_picks_arn))\n", "\n", " if status == \"ACTIVE\":\n", " break\n", " else:\n", " print(\"The Top Pics for You Recommender build is still in progress\")\n", " \n", " time.sleep(60)" ] }, { "cell_type": "markdown", "id": "8b1a1c05", "metadata": {}, "source": [ "# Getting recommendations with a recommender\n", "Now that the recommenders have been trained, lets have a look at the recommendations we can get for our users!" ] }, { "cell_type": "code", "execution_count": null, "id": "f45d85d3", "metadata": {}, "outputs": [], "source": [ "# reading the original data in order to have a dataframe that has both movie_ids \n", "# and the corresponding titles to make out recommendations easier to read.\n", "items_df = pd.read_csv('./ml-latest-small/movies.csv')\n", "items_df.sample(10)" ] }, { "cell_type": "code", "execution_count": null, "id": "e2fb09a0", "metadata": {}, "outputs": [], "source": [ "def get_movie_by_id(movie_id, movie_df):\n", " \"\"\"\n", " This takes in an movie_id from a recommendation in string format,\n", " converts it to an int, and then does a lookup in a specified\n", " dataframe.\n", " \n", " A really broad try/except clause was added in case anything goes wrong.\n", " \n", " Feel free to add more debugging or filtering here to improve results if\n", " you hit an error.\n", " \"\"\"\n", " try:\n", " return movie_df.loc[movie_df[\"movieId\"]==int(movie_id)]['title'].values[0]\n", " except:\n", " print (movie_id)\n", " return \"Error obtaining title\"" ] }, { "cell_type": "markdown", "id": "eac11c39", "metadata": {}, "source": [ "### Let us get some 'More Like X' recommendations:" ] }, { "cell_type": "code", "execution_count": null, "id": "49d365e2", "metadata": {}, "outputs": [], "source": [ "# First pick a user\n", "test_user_id = \"1\"\n", "\n", "# Select a random item\n", "test_item_id = \"81847\" #Iron Man 59315, Tangled: 81847\n", "\n", "# Get recommendations for the user for this item\n", "get_recommendations_response = personalize_runtime.get_recommendations(\n", " recommenderArn = recommender_more_like_x_arn,\n", " userId = test_user_id,\n", " itemId = test_item_id,\n", " numResults = 20\n", ")\n", "\n", "# Build a new dataframe for the recommendations\n", "item_list = get_recommendations_response['itemList']\n", "recommendation_list = []\n", "for item in item_list:\n", " movie = get_movie_by_id(item['itemId'], items_df)\n", " recommendation_list.append(movie)\n", "\n", "user_recommendations_df = pd.DataFrame(recommendation_list, columns = [get_movie_by_id(test_item_id, items_df)])\n", "\n", "pd.options.display.max_rows = 20\n", "display(user_recommendations_df)" ] }, { "cell_type": "markdown", "id": "75ea273f", "metadata": {}, "source": [ "### Get recommendations from the recommender returning \"Top picks for you\":" ] }, { "cell_type": "markdown", "id": "786c42dc", "metadata": {}, "source": [ "Adding the user's metadata to our sample user, you can use this type of metadata to get insights on your users." ] }, { "cell_type": "code", "execution_count": null, "id": "4875b839", "metadata": {}, "outputs": [], "source": [ "users_data_df = pd.read_csv('./users.csv')\n", "\n", "def get_gender_by_id(user_id, user_df):\n", " \"\"\"\n", " This takes in a user_id and then does a lookup in a specified\n", " dataframe.\n", " \n", " A really broad try/except clause was added in case anything goes wrong.\n", " \n", " Feel free to add more debugging or filtering here to improve results if\n", " you hit an error.\n", " \"\"\"\n", " return user_df.loc[user_df[\"USER_ID\"]==int(user_id)]['GENDER'].values[0]\n", " try:\n", " return user_df.loc[user_df[\"USER_ID\"]==int(user_id)]['GENDER'].values[0]\n", " except:\n", " print (user_id)\n", " return \"Error obtaining title\"" ] }, { "cell_type": "code", "execution_count": null, "id": "eb5f89d1", "metadata": {}, "outputs": [], "source": [ "# First pick a user\n", "test_user_id = \"111\" # samples users: 55, 75, 76, 111\n", "\n", "# Get recommendations for the user\n", "get_recommendations_response = personalize_runtime.get_recommendations(\n", " recommenderArn = recommender_top_picks_arn,\n", " userId = test_user_id,\n", " numResults = 20\n", ")\n", "\n", "# Build a new dataframe for the recommendations\n", "item_list = get_recommendations_response['itemList']\n", "recommendation_list = []\n", "for item in item_list:\n", " movie = get_movie_by_id(item['itemId'], items_df)\n", " recommendation_list.append(movie)\n", "\n", "column_name = test_user_id+\" (\"+get_gender_by_id(test_user_id, users_data_df)+\")\"\n", "\n", "user_recommendations_df = pd.DataFrame(recommendation_list, columns = [column_name])\n", "\n", "pd.options.display.max_rows =20\n", "display(user_recommendations_df)" ] }, { "cell_type": "markdown", "id": "f8f5d5b0", "metadata": {}, "source": [ "## Review\n", "Using the codes above you have successfully trained a deep learning model to generate movie recommendations based on prior user behavior. You have created two recommenders for two foundational use cases. \n", "Going forward, you can adapt this code to create other recommenders." ] }, { "cell_type": "markdown", "id": "3af0fcaa", "metadata": {}, "source": [ "## Notes for the Next Notebook:\n", "There are a few values you will need for the next notebook, execute the cell below to store them so they can be used in the `Clean_Up_Resources.ipynb` notebook.\n", "\n", "This will overwite any data stored for those variables and set them to the values specified in this notebook. " ] }, { "cell_type": "code", "execution_count": null, "id": "daf39a92", "metadata": {}, "outputs": [], "source": [ "# store for cleanup\n", "%store dataset_group_arn\n", "%store role_name\n", "%store region" ] }, { "cell_type": "markdown", "id": "2929ed88", "metadata": {}, "source": [ "If you have run the `Building_Your_First_Recommender_Ecommerce.ipynb` notebook, please make sure you re-run the previous step in the `Building_Your_First_Recommender_Ecommerce.ipynb` notebook and re-run the `Clean_Up_Resources.ipynb` to remove the resources created in that notebook after you run the `Clean_Up_Resources.ipynb` with the resources created here." ] }, { "cell_type": "code", "execution_count": null, "id": "da67af45", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 5 }