{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Objective Optimizations\n", "\n", "This notebook demonstrates a use case for Amazon Personalize Objective Optimization. Oftentimes there are other factors that influence the recommendations, and the objective optimization can be used to provide additional data to drive the model.\n", "\n", "This example assumes we are providing recommendations for a video on demand (VOD) streaming subscription service. The service allows users unlimited access to the video catalog, however the titles available in their catalog have different licensing terms depending on the arrangement with the content owner. By providing recommendations that provide weighting to the content with lower licensing costs, the streaming service can lower licensing costs while still keeping customers happy.\n", "\n", "Again, the data is coming from the [MovieLens](https://movielens.org/) project, you can learn more about the data and potential uses by doing a web search during any of the waiting periods in the cells below. You will supplement the MovieLens data with an additional royalty field and generate the data that will provide the royalty costs to the dataset. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to Use the Notebook\n", "\n", "Code is broken up into cells like the one below. There's a triangular `Run` button at the top of this page you can click to execute each cell and move onto the next, or you can press `Shift` + `Enter` while in the cell to execute it and move onto the next one.\n", "\n", "As a cell is executing you'll notice a line to the side showcase an `*` while the cell is running or it will update to a number to indicate the last cell that completed executing after it has finished exectuting all the code within a cell.\n", "\n", "\n", "Simply follow the instructions below and execute the cells to get started with Amazon Personalize Objective Optimization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports \n", "\n", "Python ships with a broad collection of libraries and we need to import those as well as the ones installed to help us like [boto3](https://aws.amazon.com/sdk-for-python/) (AWS SDK for python) and [Pandas](https://pandas.pydata.org/)/[Numpy](https://numpy.org/) which are core data science tools." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Imports\n", "import boto3\n", "import json\n", "import numpy as np\n", "import pandas as pd\n", "import time\n", "from botocore.exceptions import ClientError\n", "\n", "!conda install -y -c conda-forge unzip\n", "\n", "pd.options.mode.chained_assignment = None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next you will want to validate that your environment can communicate successfully with Amazon Personalize, the lines below do just that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Configure the SDK to Personalize:\n", "personalize = boto3.client('personalize')\n", "personalize_runtime = boto3.client('personalize-runtime')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure the data\n", "\n", "Data is imported into Amazon Personalize through Amazon S3, below we will specify a filenames that you use to write the files locally before uploading them to S3.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filename = \"movie-lens-100k.csv\"\n", "items_filename = \"movie-lens-items.csv\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download, Prepare, and Upload Training Data\n", "\n", "At present you do not have the MovieLens data loaded locally yet for examination, execute the lines below to download the latest copy and to examine it quickly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Download and Explore the Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget -N http://files.grouplens.org/datasets/movielens/ml-100k.zip\n", "!unzip -o ml-100k.zip\n", "data = pd.read_csv('./ml-100k/u.data', sep='\\t', names=['USER_ID', 'ITEM_ID', 'RATING', 'TIMESTAMP'])\n", "pd.set_option('display.max_rows', 5)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Prepare and Upload Data\n", "\n", "The code below loads the movie information from the MovieLens download. It provides information on the title, release date, IMDB link, and a list of columns to capture the genres that apply to the title.\n", "\n", "There isn't a royalty field since the workshop is generating a ficticious royalty, and will do that in the next step.\n", "\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "items = pd.read_csv('./ml-100k/u.item', sep='|',encoding='latin-1', names=['ITEM_ID', 'TITLE', 'RELEASE_DATE', 'VIDEO_RELEASE_DATE', 'IMDB_URL', 'MISC', 'ACTION_GENRE', 'ADVENTURE_GENRE', 'ANIMATION_GENRE', 'CHILDRENS_GENRE','COMEDY_GENRE', 'CRIME_GENRE', 'DOCUMENTARY_GENRE','DRAMA_GENRE','FANTASY_GENRE', 'FILMNOIR_GENRE','HORROR_GENRE', 'MUSICAL_GENRE','MYSTERY_GENRE', 'ROMANCE_GENRE', 'SCIFI_GENRE', 'THRILLER_GENRE', 'WAR_GENRE', 'WESTERN_GENRE' ])\n", "items" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Add the Royalty Data and Genres\n", "We are assigning a value to the ROYALTY field with a even distribution of the values 0.0, 0.005, 0.01, 0.015, 0.02, 0.025, 0.05, 0.10. This places most of the movies with a relatively low or zero royalty and a small number of movies with a higher royalty. \n", "\n", "Note on the bar chart below the even distribution of titles with each royalty value.\n", "\n", "The MovieLens GENRE field result above is categorical with each column indicating whether the title belongs to the genre, but the items loaded into the personalization can accept arrays. Some pandas code to pull together the genres into a pipe delimited column listing all the genres.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_rows', 10)\n", "\n", "royaltyvalues = [0, 0.005, 0.01, 0.015, 0.02, 0.025, 0.05, 0.10]\n", "items.loc[:,'ROYALTY'] = items['ITEM_ID'].map(lambda x: royaltyvalues[x%8])\n", "\n", "items.loc[:,'GENRE']='' \n", "\n", "for col_name in items.columns:\n", " if col_name.endswith('_GENRE'):\n", " items.loc[items[col_name]==1,'GENRE']= items['GENRE']+'|'+ col_name[:-6]\n", "\n", "items = items[['ITEM_ID', 'TITLE','ROYALTY', 'GENRE']]\n", "items.loc[:,'GENRE'] = items['GENRE'].str[1:]\n", "\n", "\n", "items.loc[:,'ROYALTY'].value_counts().plot.bar()\n", "items.head(10)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Adjust the royalty value to a negative number.\n", "\n", "Objective Optimization will optimize to the highest values of a field. In this case, we want to balance the relevance of the movie against the royalty expenses that will be incurred by the streaming service.\n", "\n", "To boost the movies with the lowest royalities, the movies with the lowest royalties need the highest numerical values. In this case, we will convert royalties to a negative number. This is done by multiplying -1 by the absolute value of the royalty fee.\n", "\n", "Note: The absolute value is used so this cell can be run twice with the same results, which is helpful in a workshop environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "items.loc[:,'ROYALTY'] = -1 * abs(items['ROYALTY'])\n", "items" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure an S3 bucket and an IAM role \n", "So far, we have downloaded, manipulated, and saved the data onto the Amazon EBS instance attached to instance running this Jupyter notebook. However, Amazon Personalize will need an S3 bucket to act as the source of your data, as well as IAM roles for accessing that bucket. Let's set all of that up.\n", "\n", "Use the metadata stored on the instance underlying this Amazon SageMaker notebook, to determine the region it is operating in. If you are using a Jupyter notebook outside of Amazon SageMaker, simply define the region as a string below. The Amazon S3 bucket needs to be in the same region as the Amazon Personalize resources we have been creating so far." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:\n", " notebook_data = json.load(notebook_info)\n", " resource_arn = notebook_data['ResourceArn']\n", " region = resource_arn.split(':')[3]\n", "print(region)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Amazon S3 bucket names are globally unique. To create a unique bucket name, the code below will append the string `personalize-objective-optimization-` to your AWS account number. Then it creates a bucket with this name in the region discovered in the previous cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3 = boto3.client('s3')\n", "suffix = str(np.random.uniform())[4:9]\n", "bucket_name = \"personalize-objective-optimization-\"+ suffix # replace with the name of your S3 bucket\n", "print(bucket_name)\n", "if region != \"us-east-1\":\n", " s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})\n", "else:\n", " s3.create_bucket(Bucket=bucket_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upload data to S3\n", "\n", "Now that your Amazon S3 bucket has been created, upload the CSV file of our item data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "items = items[['ITEM_ID', 'TITLE', 'GENRE', 'ROYALTY']]\n", "items_dataset = items[['ITEM_ID','ROYALTY', 'GENRE']]\n", "\n", "items_dataset.to_csv(items_filename, index=False)\n", "boto3.Session().resource('s3').Bucket(bucket_name).Object(items_filename).upload_file(items_filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set the S3 bucket policy\n", "Amazon Personalize needs to be able to read the contents of your S3 bucket. So add a bucket policy which allows that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "policy = {\n", " \"Version\": \"2012-10-17\",\n", " \"Id\": \"PersonalizeS3BucketAccessPolicy\",\n", " \"Statement\": [\n", " {\n", " \"Sid\": \"PersonalizeS3BucketAccessPolicy\",\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"personalize.amazonaws.com\"\n", " },\n", " \"Action\": [\n", " \"s3:*Object\",\n", " \"s3:ListBucket\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:s3:::{}\".format(bucket_name),\n", " \"arn:aws:s3:::{}/*\".format(bucket_name)\n", " ]\n", " }\n", " ]\n", "}\n", "\n", "s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create the Movie Title Item Schema\n", "We will need two schemas, one for the movie titles, which will be a type Items, and a second which will define the structure of the Interactions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = {\n", " \"type\": \"record\",\n", " \"name\": \"Items\",\n", " \"namespace\": \"com.amazonaws.personalize.schema\",\n", " \"fields\": [\n", " {\n", " \"name\": \"ITEM_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"ROYALTY\",\n", " \"type\": \"float\"\n", " }, {\n", " \"name\": \"GENRE\",\n", " \"type\": [\n", " \"null\",\n", " \"string\"\n", " ],\n", " \"categorical\": True\n", " }\n", " ],\n", " \"version\": \"1.0\"\n", "}\n", "\n", "create_item_schema_response = personalize.create_schema(\n", " name = \"personalize-objective-optmization-item-schema\"+suffix,\n", " schema = json.dumps(schema)\n", ")\n", "\n", "item_schema_arn = create_item_schema_response['schemaArn']\n", "print(json.dumps(create_item_schema_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this point, we will follow the same process from the [first campaign](https://github.com/aws-samples/amazon-personalize-samples/blob/master/getting_started/notebooks/1.Building_Your_First_Campaign.ipynb), loading the interactions. We only want to include interactions with at least a rating of three, and we don't want to recommend movies viewers will not like." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = data[data['RATING'] > 3] # Keep only movies rated higher than 3 out of 5.\n", "data = data[['USER_ID', 'ITEM_ID', 'TIMESTAMP']] # select columns that match the columns in the schema below\n", "data.to_csv(filename, index=False)\n", "boto3.Session().resource('s3').Bucket(bucket_name).Object(filename).upload_file(filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Interaction Schema\n", "\n", "A core component of how Personalize understands your data comes from the Schema that is defined below. This configuration tells the service how to digest the data provided via your CSV file. Note the columns and types align to what was in the file you created above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = {\n", " \"type\": \"record\",\n", " \"name\": \"Interactions\",\n", " \"namespace\": \"com.amazonaws.personalize.schema\",\n", " \"fields\": [\n", " {\n", " \"name\": \"USER_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"ITEM_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"TIMESTAMP\",\n", " \"type\": \"long\"\n", " }\n", " ],\n", " \"version\": \"1.0\"\n", "}\n", "\n", "create_schema_response = personalize.create_schema(\n", " name = \"personalize-objective-optmization-schema\"+suffix,\n", " schema = json.dumps(schema)\n", ")\n", "\n", "schema_arn = create_schema_response['schemaArn']\n", "print(json.dumps(create_schema_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create and Wait for Dataset Group\n", "\n", "The largest grouping in Personalize is a Dataset Group, this will isolate your data, event trackers, solutions, and campaigns. Grouping things together that share a common collection of data. Feel free to alter the name below if you'd like." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Dataset Group" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_dataset_group_response = personalize.create_dataset_group(\n", " name = \"personalize-objective-optmization-demo-\"+suffix\n", ")\n", "\n", "dataset_group_arn = create_dataset_group_response['datasetGroupArn']\n", "print(json.dumps(create_dataset_group_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Wait for Dataset Group to Have ACTIVE Status\n", "\n", "Before we can use the Dataset Group in any items below it must be active, execute the cell below and wait for it to show active." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " describe_dataset_group_response = personalize.describe_dataset_group(\n", " datasetGroupArn = dataset_group_arn\n", " )\n", " status = describe_dataset_group_response[\"datasetGroup\"][\"status\"]\n", " print(\"DatasetGroup: {}\".format(status))\n", " \n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", " \n", " time.sleep(60)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Dataset\n", "\n", "After the group, the next thing to create is the actual datasets, in this example we will create 1 for the interactions data and another for the items data. Execute the cells below to create it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_dataset(dataset_type, schema_arn, name):\n", " create_dataset_response = personalize.create_dataset(\n", " name = name,\n", " datasetType = dataset_type,\n", " datasetGroupArn = dataset_group_arn,\n", " schemaArn = schema_arn\n", " )\n", " dataset_arn = create_dataset_response['datasetArn']\n", "\n", " max_time = time.time() + 3*60*60 # 3 hours\n", " while time.time() < max_time:\n", " describe_dataset_response = personalize.describe_dataset(\n", " datasetArn = dataset_arn\n", " )\n", " status = describe_dataset_response[\"dataset\"][\"status\"]\n", " print(\"Dataset: {} {}\".format(name, status))\n", "\n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", "\n", " time.sleep(10)\n", " return dataset_arn\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "interaction_dataset_arn = create_dataset(\"INTERACTIONS\", schema_arn, 'personalize-objective-optmization-interactions-'+suffix)\n", "\n", "\n", "print('interaction_dataset_arn: ' + interaction_dataset_arn)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "item_dataset_arn = create_dataset(\"ITEMS\", item_schema_arn, 'personalize-objective-optmization-items-'+suffix)\n", "print('item_dataset_arn: ' + item_dataset_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Personalize Role\n", "\n", "Also Amazon Personalize needs the ability to assume Roles in AWS in order to have the permissions to execute certain tasks, the lines below grant that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iam = boto3.client(\"iam\")\n", "\n", "role_name = \"PersonalizeRoleDemo\"+suffix\n", "assume_role_policy_document = {\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"personalize.amazonaws.com\"\n", " },\n", " \"Action\": \"sts:AssumeRole\"\n", " }\n", " ]\n", "}\n", "\n", "create_role_response = iam.create_role(\n", " RoleName = role_name,\n", " AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)\n", ")\n", "\n", "# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes \"personalize\" or \"Personalize\" \n", "# if you would like to use a bucket with a different name, please consider creating and attaching a new policy\n", "# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role\n", "policy_arn = \"arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess\"\n", "iam.attach_role_policy(\n", " RoleName = role_name,\n", " PolicyArn = policy_arn\n", ")\n", "\n", "# Now add S3 support\n", "iam.attach_role_policy(\n", " PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',\n", " RoleName=role_name\n", ")\n", "time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate\n", "\n", "role_arn = create_role_response[\"Role\"][\"Arn\"]\n", "print(role_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import the data\n", "\n", "Earlier you created the DatasetGroup and Dataset to house your information, now you will execute an import job that will load the data from S3 into Amazon Personalize for usage building your model. We will load multiple imports, so the function below will be called to start the import job, then monitor the import job for completion." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Dataset Import Job" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_dataset_import_job(dataset_arn, dataLocation, name):\n", " create_dataset_import_job_response = personalize.create_dataset_import_job(\n", " jobName = name,\n", " datasetArn = dataset_arn,\n", " dataSource = {\n", " \"dataLocation\": dataLocation\n", " },\n", " roleArn = role_arn\n", " )\n", "\n", " dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']\n", " \n", " max_time = time.time() + 3*60*60 # 3 hours\n", " while time.time() < max_time:\n", " describe_dataset_import_job_response = personalize.describe_dataset_import_job(\n", " datasetImportJobArn = dataset_import_job_arn\n", " )\n", " status = describe_dataset_import_job_response[\"datasetImportJob\"]['status']\n", " print(\"DatasetImportJob: {} {}\".format(name, status))\n", "\n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", "\n", " time.sleep(60)\n", " return dataset_import_job_arn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Load the interactions\n", "The import job below will load the interaction dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset_import_job_arn = create_dataset_import_job(interaction_dataset_arn, \"s3://{}/{}\".format(bucket_name, filename), \"personalize-objective-optimization-interaction-\"+suffix)\n", "print('dataset_import_job_arn: ' + dataset_import_job_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Load the movie titles into the item dataset\n", "\n", "It can take a while before the import job completes, please wait until you see that it is active below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "item_dataset_import_job_arn = create_dataset_import_job(item_dataset_arn, \"s3://{}/{}\".format(bucket_name, items_filename), \"personalize-objective-optimization-item-\"+suffix)\n", "print('item_dataset_import_job_arn: ' + item_dataset_import_job_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create the Solution and Version\n", "\n", "In Amazon Personalize a trained model is called a Solution, each Solution can have many specific versions that relate to a given volume of data when the model was trained.\n", "\n", "To begin we will list all the recipies that are supported, a recipie is an algorithm that has not been trained on your data yet. After listing you'll select one and use that to build your model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select Recipe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### User Personalization\n", "The [User-Personalization](https://docs.aws.amazon.com/personalize/latest/dg/native-recipe-new-item-USER_PERSONALIZATION.html) (aws-user-personalization) recipe is optimized for all USER_PERSONALIZATION recommendation scenarios. When recommending items, it uses automatic item exploration.\n", "\n", "With automatic exploration, Amazon Personalize automatically tests different item recommendations, learns from how users interact with these recommended items, and boosts recommendations for items that drive better engagement and conversion. This improves item discovery and engagement when you have a fast-changing catalog, or when new items, such as news articles or promotions, are more relevant to users when fresh.\n", "\n", "You can balance how much to explore (where items with less interactions data or relevance are recommended more frequently) against how much to exploit (where recommendations are based on what we know or relevance). Amazon Personalize automatically adjusts future recommendations based on implicit user feedback.\n", "\n", "First, select the recipe by finding the ARN in the list of recipes above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "recipe_arn = \"arn:aws:personalize:::recipe/aws-user-personalization\" # aws-user-personalization selected for demo purposes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create and Wait for Solution\n", "\n", "First you will create the solution with the API, then you will create a version. It will take several minutes to train the model and thus create your version of a solution. Once it gets started and you are seeing the in progress notifications it is a good time to take a break, grab a coffee, etc.\n", "\n", "The function accepts the objective sensitibility setting, which can be OFF, LOW, MED, or HIGH. This will adjust the weighting that drives the impact of the objective into the model.\n", "\n", "The function creates the solution and the initial solution version for that solution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_solution(name, objectiveSensitivity):\n", " create_solution_response = personalize.create_solution(\n", " name = name,\n", " datasetGroupArn = dataset_group_arn,\n", " recipeArn = recipe_arn,\n", " solutionConfig = {\n", " \"optimizationObjective\": {\n", " \"itemAttribute\": \"ROYALTY\",\n", " \"objectiveSensitivity\":objectiveSensitivity\n", " }\n", " }\n", " )\n", " \n", " solution_arn = create_solution_response['solutionArn']\n", " \n", " print('solutionArn:' + solution_arn)\n", "\n", " create_solution_version_response = personalize.create_solution_version(\n", " solutionArn = solution_arn\n", " )\n", "\n", " solution_version_arn = create_solution_version_response['solutionVersionArn']\n", " print('solution_version_arn: ' + solution_version_arn)\n", "\n", " return {\n", " \"solution_arn\": solution_arn,\n", " \"solution_version_arn\": solution_version_arn\n", " }\n", " \n", "def waitForSolutionVersion(solution_version_arn):\n", " max_time = time.time() + 3*60*60 # 3 hours\n", " while time.time() < max_time:\n", " describe_solution_version_response = personalize.describe_solution_version(\n", " solutionVersionArn = solution_version_arn\n", " )\n", " status = describe_solution_version_response[\"solutionVersion\"][\"status\"]\n", " print(\"SolutionVersion: {} {}\".format(solution_version_arn, status))\n", "\n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\" or status == \"CREATE STOPPING\":\n", " break\n", "\n", " time.sleep(60)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create the Solution Versions\n", "Create 3 different solutions, one with objective optimization turned off, one with a LOW setting, and one set to HIGH. This will kick off 3 solutions in parallel, and the next step will wait for them to complete." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "high_solution = create_solution('movie-recommendation-low-royalties-'+suffix, 'HIGH')\n", "low_solution = create_solution('movie-recommendation-medium-royalties-'+suffix, 'LOW')\n", "no_objective_optimization_solution = create_solution('movie-recommendation-max-relevance-'+suffix, 'OFF')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Wait for Solution Versions to Have ACTIVE Status\n", "\n", "This will take approximately 40-50 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "waitForSolutionVersion(low_solution['solution_version_arn'])\n", "waitForSolutionVersion(no_objective_optimization_solution['solution_version_arn'])\n", "waitForSolutionVersion(high_solution['solution_version_arn'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Get Metrics of Solution Versions\n", "\n", "Now that your solution and version exists, you can obtain the metrics for it to judge its performance. These metrics are not particularly good as it is a demo set of data, but with larger more complex datasets you should see improvements.\n", "\n", "You can see differences in the quality of the model based on the impact of the objective optimizations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_solution_metrics(solutions):\n", " metricdata = { \"name\": []}\n", " \n", " for key in solutions:\n", " solution = solutions[key]\n", " \n", " metricdata[\"name\"].append(key)\n", " \n", " get_solution_metrics_response = personalize.get_solution_metrics(\n", " solutionVersionArn = solution['solution_version_arn']\n", " )\n", "\n", " for metricname in get_solution_metrics_response['metrics']:\n", " if not metricname in metricdata:\n", " metricdata[metricname] = []\n", " \n", " metricdata[metricname].append( get_solution_metrics_response['metrics'][metricname])\n", " \n", " # print(json.dumps(get_solution_metrics_response, indent=2))\n", " return pd.DataFrame.from_dict(metricdata);\n", "\n", "metrics = get_solution_metrics({\n", " \"no-optimization\": no_objective_optimization_solution,\n", " \"low-optimization\": low_solution,\n", " \"high-optimization\": high_solution,\n", "})\n", "\n", "metrics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We recommend reading [the documentation](https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html) to understand the metrics, but we have also copied parts of the documentation below for convenience.\n", "\n", "You need to understand the following terms regarding evaluation in Personalize:\n", "\n", "- *Relevant recommendation* refers to a recommendation that matches a value in the testing data for the particular user.\n", "- *Rank* refers to the position of a recommended item in the list of recommendations. Position 1 (the top of the list) is presumed to be the most relevant to the user.\n", "- *Query* refers to the internal equivalent of a GetRecommendations call.\n", "\n", "The metrics produced by Personalize are:\n", "\n", "- coverage: The proportion of unique recommended items from all queries out of the total number of unique items in the training data (includes both the Items and Interactions datasets).\n", "- mean_reciprocal_rank_at_25: The [mean of the reciprocal ranks](https://en.wikipedia.org/wiki/Mean_reciprocal_rank) of the first relevant recommendation out of the top 25 recommendations over all queries. This metric is appropriate if you're interested in the single highest ranked recommendation.\n", "- normalized_discounted_cumulative_gain_at_K: Discounted gain assumes that recommendations lower on a list of recommendations are less relevant than higher recommendations. Therefore, each recommendation is discounted (given a lower weight) by a factor dependent on its position. To produce the [cumulative discounted gain](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) (DCG) at K, each relevant discounted recommendation in the top K recommendations is summed together. The normalized discounted cumulative gain (NDCG) is the DCG divided by the ideal DCG such that NDCG is between 0 - 1. (The ideal DCG is where the top K recommendations are sorted by relevance.) Amazon Personalize uses a weighting factor of 1/log(1 + position), where the top of the list is position 1. This metric rewards relevant items that appear near the top of the list, because the top of a list usually draws more attention.\n", "- precision_at_K: The number of relevant recommendations out of the top K recommendations divided by K. This metric rewards precise recommendation of the relevant items." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create and Wait for the Campaign\n", "\n", "Now that you have a working solution version you will need to create a campaign to use it with your applications. A campaign is a hosted solution version; an endpoint which you can query for recommendations. Pricing is set by estimating throughput capacity (requests from users for personalization per second). When deploying a campaign, you set a minimum transactions per second (TPS) value (`minProvisionedTPS`). This service, like many within AWS, will automatically scale based on demand, but if latency is critical, you may want to provision ahead for larger demand. For this demo, the minimum throughput threshold is set to 1. For more information, see the [pricing](https://aws.amazon.com/personalize/pricing/) page.\n", "\n", "As mentioned above, the user-personalization recipe used for our solution supports automatic exploration of \"cold\" items. You can control how much exploration is performed when creating your campaign. The `itemExplorationConfig` data type supports `explorationWeight` and `explorationItemAgeCutOff` parameters. Exploration weight determines how frequently recommendations include items with less interactions data or relevance. The closer the value is to 1.0, the more exploration. At zero, no exploration occurs and recommendations are based on current data (relevance). Exploration item age cut-off determines items to be explored based on time frame since latest interaction. Provide the maximum item age, in days since the latest interaction, to define the scope of item exploration. The larger the value, the more items are considered during exploration. For our campaign below, we'll specify an exploration weight of 0.5." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Campaign" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_campaign(solution, name):\n", " create_campaign_response = personalize.create_campaign(\n", " name = \"personalize-demo-\" + name + '-' + suffix,\n", " solutionVersionArn = solution['solution_version_arn'],\n", " minProvisionedTPS = 1,\n", " campaignConfig = {\n", " \"itemExplorationConfig\": {\n", " \"explorationWeight\": \"0.5\"\n", " }\n", " }\n", " )\n", "\n", " campaign_arn = create_campaign_response['campaignArn']\n", " print('campaign_arn:' + campaign_arn)\n", " return campaign_arn\n", "\n", "def waitForCampaign(solution):\n", " max_time = time.time() + 3*60*60 # 3 hours\n", " while time.time() < max_time:\n", " describe_campaign_response = personalize.describe_campaign(\n", " campaignArn = solution['campaign_arn']\n", " )\n", " status = describe_campaign_response[\"campaign\"][\"status\"]\n", " print(\"Campaign: {} {}\".format(solution['campaign_arn'], status))\n", "\n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", "\n", " time.sleep(60)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create 3 Campaigns\n", "Create a campaign for each of the objective optimizations, but keep all the other settings the same to demonstrate the impact of the objective optimization." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "high_solution['campaign_arn'] = create_campaign(high_solution, 'high')\n", "low_solution['campaign_arn'] = create_campaign(low_solution, 'low')\n", "no_objective_optimization_solution['campaign_arn'] = create_campaign(no_objective_optimization_solution, 'max_relevance')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Wait for Campaign to Have ACTIVE Status\n", "\n", "This should take about 10 minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "waitForCampaign(high_solution)\n", "waitForCampaign(low_solution)\n", "waitForCampaign(no_objective_optimization_solution)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get Sample Recommendations\n", "\n", "After the campaign is active you are ready to get recommendations. First we need to select a random user from the collection. Then we will create a few helper functions for getting movie information to show for recommendations instead of just IDs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Getting a random user:\n", "user_id, item_id, _ = data.sample().values[0]\n", "print(\"USER: {}\".format(user_id))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "def get_movie_title(movie_id):\n", " \"\"\"\n", " Takes in an ID, returns a title\n", " \"\"\"\n", " movie_id = int(movie_id)-1\n", " return items.iloc[movie_id]['TITLE'] + '(' + f'{-1*items.iloc[movie_id][\"ROYALTY\"]:.2f}'+ ')'\n", "\n", "def get_movie_royalty(movie_id):\n", " \"\"\"\n", " Takes in an ID, returns a title\n", " \"\"\"\n", " movie_id = int(movie_id)-1\n", " return -1*items.iloc[movie_id][\"ROYALTY\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Call GetRecommendations\n", "\n", "Using the user that you obtained above, the lines below will get recommendations for you and return the list of movies that are recommended.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_recommendations(solution):\n", " get_recommendations_response = personalize_runtime.get_recommendations(\n", " campaignArn = solution['campaign_arn'],\n", " userId = str(user_id),\n", " )\n", " # Update DF rendering\n", " pd.set_option('display.max_rows', 30)\n", "\n", " item_list = get_recommendations_response['itemList']\n", "\n", " recommendation_list = []\n", "\n", " total_royalties = 0.0\n", " \n", " for item in item_list:\n", " title = get_movie_title(item['itemId'])\n", " total_royalties = total_royalties + get_movie_royalty(item['itemId'])\n", " recommendation_list.append(title)\n", " \n", " recommendation_list.append('TOTAL ROYALTIES: '+ f'{total_royalties:.2f}')\n", " return recommendation_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Compare Recommendations\n", "Create a set of recommendations for the same user to compare the impact of the objective optimization.\n", "\n", "Notice the impact of the royalty value shown in parantheses after the title and year. Highly rated, high royalty movies with objective optimization off tend to show up lower in the list, if at all, when the objective optmization is on. \n", "\n", "Also note the total royalties for all the titles in each set of recommendations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "recommendations_df = pd.DataFrame(get_recommendations(no_objective_optimization_solution), columns = ['ObjectiveOff'])\n", "recommendations_df['LowObjective'] = get_recommendations(low_solution)\n", "recommendations_df['HighObjective'] = get_recommendations(high_solution)\n", "recommendations_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review\n", "\n", "The notebook walked through an example where the movie recommendation engine took into account the royalties needed to pay for a given title. By including this weighting into the recommendation algorithm, the streaming service can provide good recommendations for a user, but also minimize the royalties needed to pay out to content creators.\n", "\n", "Notice in the chart above, with Objective optimization off, the royalty fee in paratheses is fairly evenly distributed, as one would expect since it isn't taken into account. However, in the LowObjective column, the values are lower and the sum of the royalties is lowest for the highest objective optimization setting." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleanup\n", "\n", "Clean up the resources" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def delete_campaign(campaign_arn):\n", " delete_campaign_result = personalize.delete_campaign(campaignArn=campaign_arn )\n", " \n", "\n", "def wait_for_delete_campaign(campaign_arn):\n", " max_time = time.time() + 3*60*60 # 3 hours\n", " while time.time() < max_time:\n", " try:\n", " describe_campaign_response = personalize.describe_campaign(\n", " campaignArn = campaign_arn\n", " )\n", " status = describe_campaign_response[\"campaign\"][\"status\"]\n", " print(\"campaign: {}\".format(status))\n", "\n", " except ClientError as e:\n", " print(e)\n", " break\n", "\n", " time.sleep(10)\n", " print('campaign ' + campaign_arn + ' deleted')\n", " \n", "def delete_solution(solution_arn):\n", " delete_solution_result = personalize.delete_solution(solutionArn=solution_arn )\n", " \n", " max_time = time.time() + 3*60*60 # 3 hours\n", "\n", "def wait_for_delete_solution(solution_arn):\n", " while time.time() < max_time:\n", " \n", " try:\n", " describe_solution_response = personalize.describe_solution(\n", " solutionArn = solution_arn\n", " )\n", " status = describe_solution_response[\"solution\"][\"status\"]\n", " print(\"Solution: {}\".format(status))\n", "\n", " except ClientError:\n", " break\n", " time.sleep(10)\n", " print('Solution ' + solution_arn + ' deleted')\n", " \n", "def delete_dataset(dataset_arn):\n", " delete_dataset_result = personalize.delete_dataset(datasetArn=dataset_arn )\n", " \n", " max_time = time.time() + 3*60*60 # 3 hours\n", " while time.time() < max_time:\n", " try:\n", " describe_dataset_response = personalize.describe_dataset(\n", " datasetArn = dataset_arn\n", " )\n", " status = describe_dataset_response[\"dataset\"][\"status\"]\n", " print(\"dataset: {}\".format(status))\n", "\n", " except ClientError:\n", " break\n", " time.sleep(10)\n", " print('dataset ' + dataset_arn + ' deleted')\n", " \n", "def delete_schema(schema_arn):\n", " delete_schema_result = personalize.delete_schema(schemaArn=schema_arn )\n", " \n", "\n", " print('schema ' + schema_arn + ' deleted')\n", " \n", "def delete_dataset_group(dataset_group_arn):\n", " delete_dataset_group_result = personalize.delete_dataset_group(datasetGroupArn=dataset_group_arn )\n", " \n", " max_time = time.time() + 3*60*60 # 3 hours\n", " while time.time() < max_time:\n", " try:\n", " describe_dataset_group_response = personalize.describe_dataset_group(\n", " datasetGroupArn = dataset_group_arn\n", " )\n", " status = describe_dataset_group_response[\"datasetGroup\"][\"status\"]\n", " print(\"dataset_group: {}\".format(status))\n", "\n", " except ClientError:\n", " break\n", " time.sleep(10)\n", " print('dataset_group ' + dataset_group_arn + ' deleted')\n", " \n", "def delete_all(solutions):\n", " for solution in solutions:\n", " delete_campaign(solution['campaign_arn'])\n", " \n", " for solution in solutions:\n", " wait_for_delete_campaign(solution['campaign_arn'])\n", " for solution in solutions:\n", " delete_solution(solution['solution_arn'])\n", " for solution in solutions:\n", " wait_for_delete_solution(solution['solution_arn'])\n", "\n", "\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "delete_all([no_objective_optimization_solution, low_solution, high_solution] )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "delete_dataset(item_dataset_arn)\n", "delete_dataset(interaction_dataset_arn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "delete_schema(item_schema_arn)\n", "delete_schema(schema_arn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "delete_dataset_group(dataset_group_arn)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iam = boto3.client(\"iam\")\n", "\n", "iam.detach_role_policy(RoleName=role_name, PolicyArn='arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess')\n", "iam.detach_role_policy(RoleName=role_name, PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess')\n", "time.sleep(10) # propogation time\n", "\n", "iam.delete_role(RoleName=role_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! aws s3 rm --recursive s3://$bucket_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3.delete_bucket(Bucket=bucket_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 4 }