{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Get the Personalize boto3 Client\n", "\n", "All of the code that we are using in this lab is Python, but any langugage supported by SageMaker could be used. In this initial piece of code we are loading in the library dependencies that we need for the rest of the lab:\n", "\n", "- **boto3** - standard Python SDK that wraps the AWS CLI\n", "- **json** - used to manipulate JSON structures used by our API calls\n", "- **numpy** and **pandas** - standard libraries used by Data Scientists everywhere\n", "- **time** - used for some time manipulation calls" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import boto3\n", "\n", "import json\n", "import numpy as np\n", "import pandas as pd\n", "import time\n", "\n", "personalize = boto3.client(service_name='personalize')\n", "personalize_runtime = boto3.client(service_name='personalize-runtime')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specify a Bucket and Data Output Location\n", "\n", "For this demo, we'll choose to upload data to Amazon Personalize directly from an S3 bucket. Hence, you need to create a new bucket - please name your bucket before running this Code cell, overriding what is shown in the code cell, and you need to ensure that the bucket name is globally unique; for this lab we recommend using your name or initials, followed by *-jnb-personalize-lab*, as that is likely to be unique.\n", "\n", "If the bucket already exists - such as if you execute this code cell a second time - then it will not create a new bucket, and will not make any changes to the exsting bucket. Whilst we're here, we also define the name of the file in S3 that will eventually hold our model training data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bucket = \"{your-prefix}-summit-personalize-lab\" # replace with the name of your S3 bucket\n", "filename = \"DEMO-movie-lens-100k.csv\"\n", "\n", "s3 = boto3.client('s3')\n", "if boto3.resource('s3').Bucket(bucket).creation_date is None:\n", " s3.create_bucket(ACL = \"private\", Bucket = bucket)\n", " print(\"Creating bucket: {}\".format(bucket))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download, Prepare, and Upload Training Data\n", "\n", "#### Download and Explore the Dataset\n", "\n", "In this step we download the entire Movie Lens data set zip-file and unzip it - it will go in the same location as the physical notebook *.ipynb* file and the *u.item* file that you downloaded earlier. We use the **pandas** library to read in the *u.data* file, which contains all of the movie reviews; this file consists of a movie ID, a user ID, a timestamp and a rating of between 1 and 5, and there are 100,000 unique reviews." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget -N http://files.grouplens.org/datasets/movielens/ml-100k.zip\n", "!unzip -o ml-100k.zip\n", "data = pd.read_csv('./ml-100k/u.data', sep='\\t', names=['USER_ID', 'ITEM_ID', 'RATING', 'TIMESTAMP'])\n", "pd.set_option('display.max_rows', 5)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Prepare and Upload Data\n", "\n", "We don't actually need all of the review data. We would like to recommend movies that a user would actually watch based upon what they and others have liked - we don't want to provide a rating between 1 and 5 for every movie in the data set! Hence, we're going to use **pandas** to drop any reviews that are not > 3.6, and once we've done that also drop the review column - we're left with a subset of the original that contain 4- and 5-star ratings, which is basically indicating movies that the user really liked. Of course, we could use all of the data, but this would lead to far greater data import and model training time.\n", "\n", "Additionally, the reviews are quite old - they are from August 1997 to April 1998. Some of the Amazon Personalize recipies react differently depending upon the age of the interactions - for instance, the _Similar Items_ recipe has several hyperparameters around how to handle 'decaying' interactions. In order to make this lab easier, and not have to worry about these hyperparameters, we are shifting all of the review timestamps to be from August 2018 up until April 2019.\n", "\n", "We then write that out to a file named as per that defined two steps previously, and upload it into our S3 bucket. \n", "\n", "This is the minimum amount of data that Amazon Personalize needs to train a model on - you need just 1000 rows of user/item/timestamp interactions, but we still have many 10s of thousands of entries left from our original 100,000 review dataset. This file is known in Amazon Personalize as an **Interactions** data file. Other data files are usable, such as ones that define additional metadata about the movies (such as year and genre) and another that defines demographic data about the user (such as age, gender and location). In this lab we do not need them, but this information is available in the Movie Lens data set files that you have downloaded - you can create your own models based upon those at a later date." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = data[data['RATING'] > 3.6] # keep only movies rated 3.6 and above\n", "data = data[['USER_ID', 'ITEM_ID', 'TIMESTAMP']] # select columns that match the columns in the schema below\n", "data['TIMESTAMP'] = data['TIMESTAMP'] + 660833618 # make reviews end 1st April 2019 rather than 23rd April 1998\n", "data.to_csv(filename, index=False)\n", "\n", "boto3.Session().resource('s3').Bucket(bucket).Object(filename).upload_file(filename)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Schema\n", "\n", "Amazon Personalize uses *Schemas* to tell it how to interpret your data files. This step defines the schema for our Interations file, which consists solely of a `USER_ID`, `ITEM_ID` and `TIMESTAMP`. Once defined we pass it into Personalize for use." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = {\n", " \"type\": \"record\",\n", " \"name\": \"Interactions\",\n", " \"namespace\": \"com.amazonaws.personalize.schema\",\n", " \"fields\": [\n", " {\n", " \"name\": \"USER_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"ITEM_ID\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"name\": \"TIMESTAMP\",\n", " \"type\": \"long\"\n", " }\n", " ],\n", " \"version\": \"1.0\"\n", "}\n", "\n", "create_schema_response = personalize.create_schema(\n", " name = \"summit-lab-recs-schema\",\n", " schema = json.dumps(schema)\n", ")\n", "\n", "schema_arn = create_schema_response['schemaArn']\n", "print(json.dumps(create_schema_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create and Wait for Dataset Group\n", "\n", "Now that we have defined a schema, and we have our Interactions data file, we can import the data into Personalize. But first we have to define a *Dataset Group*, which is essentially a collection of imported data files, trained models and campaigns - each Dataset Group can contain one, and only one, Interaction, Item Metadata and User Demographic file. When you train a model Personalize will use **all** data files present within its Dataset Group.\n", "\n", "#### Create Dataset Group" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_dataset_group_response = personalize.create_dataset_group(\n", " name = \"summit-recs-dataset-group\"\n", ")\n", "\n", "dataset_group_arn = create_dataset_group_response['datasetGroupArn']\n", "print(json.dumps(create_dataset_group_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Wait for Dataset Group to Have ACTIVE Status\n", "\n", "A number of Personalize API calls do take time, hence the calls are asynchronous. Before we can continue with the next stage we need to poll the status of the `create_dataset_group()` call from the previous code cell - once the Dataset Group is active then we can continue. **NOTE: this step should not take more than 1-2 minutes to complete**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "status = None\n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " describe_dataset_group_response = personalize.describe_dataset_group(\n", " datasetGroupArn = dataset_group_arn\n", " )\n", " status = describe_dataset_group_response[\"datasetGroup\"][\"status\"]\n", " print(\"DatasetGroup: {}\".format(status))\n", " \n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", " \n", " time.sleep(15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Dataset\n", "\n", "We now have to create our dataset for the Interactions file. This step does not actually import any data, rather it creates an internal structure for the data to be imported into." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset_type = \"INTERACTIONS\"\n", "create_dataset_response = personalize.create_dataset(\n", " datasetType = dataset_type,\n", " datasetGroupArn = dataset_group_arn,\n", " schemaArn = schema_arn,\n", " name=\"summit-recs-dataset\"\n", ")\n", "\n", "dataset_arn = create_dataset_response['datasetArn']\n", "print(json.dumps(create_dataset_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare, Create, and Wait for Dataset Import Job" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Attach policy to S3 bucket\n", "\n", "Whilst we have created an S3 bucket, and our Interactions data file is sat there waiting to be imported, we have a problem - you may have full access to the bucket via the AWS console or APIs, but the Amazon Personalize service does not. Hence, you have to create an S3 bucket policy that explicitly grants the service access to the `GetObject` and `ListBucket` commands in S3. This code step creates such a policy and attaches it to your S3 bucket.\n", "\n", "Note, any Personalize API calls that need to access you S3 bucket need to be done using an IAM role that gives it permission - this step simply allows the service to access the bucket if, and only if, roles with appropriate permissions are used." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3 = boto3.client(\"s3\")\n", "\n", "policy = {\n", " \"Version\": \"2012-10-17\",\n", " \"Id\": \"PersonalizeS3BucketAccessPolicy\",\n", " \"Statement\": [\n", " {\n", " \"Sid\": \"PersonalizeS3BucketAccessPolicy\",\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"personalize.amazonaws.com\"\n", " },\n", " \"Action\": [\n", " \"s3:GetObject\",\n", " \"s3:ListBucket\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:s3:::{}\".format(bucket),\n", " \"arn:aws:s3:::{}/*\".format(bucket)\n", " ]\n", " }\n", " ]\n", "}\n", "\n", "s3.put_bucket_policy(Bucket=bucket, Policy=json.dumps(policy));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Find Personalize S3 Role ARN\n", "\n", "As part of the AWS Event Engine process we have defined an IAM role that gives Personalize the ability to access S3 buckets - as mentioned this is needed as well as the S3 bucket policy. As the Event Engine creates the IAM role via CloudFormation it will always have an essentially random numeric suffix, so we cannot hard-code it into the lab. This code cell looks for any any service role that has the name _PersonalizeS3RoleForLab_ in it and selects it as the ARN that we need." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iam = boto3.client(\"iam\")\n", "\n", "role_name = \"PersonalizeS3RoleForLab\"\n", "prefix_name = \"/service-role\"\n", "role_list = iam.list_roles(PathPrefix=prefix_name)\n", "\n", "for role in role_list['Roles']:\n", " if role_name in (role['Arn']):\n", " role_arn = (role['Arn'])\n", " \n", "role_arn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Dataset Import Job\n", "\n", "This pulls together the information that we have on our Dataset, on our S3 bucket, on our Interactions file and a suitable role for Personalize, and then triggers the actual data import process." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_dataset_import_job_response = personalize.create_dataset_import_job(\n", " jobName = \"summit-recs-dataset-import-job\",\n", " datasetArn = dataset_arn,\n", " dataSource = {\n", " \"dataLocation\": \"s3://{}/{}\".format(bucket, filename)\n", " },\n", " roleArn = role_arn\n", ")\n", "\n", "dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']\n", "print(json.dumps(create_dataset_import_job_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Wait for Dataset Import Job and Dataset Import Job Run to Have ACTIVE Status\n", "\n", "We now poll the status of Interactions file import job, as until it is complete we cannot continue. **Note: this can take anything between 12-25 minutes to complete**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "status = None\n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " describe_dataset_import_job_response = personalize.describe_dataset_import_job(\n", " datasetImportJobArn = dataset_import_job_arn\n", " )\n", " \n", " dataset_import_job = describe_dataset_import_job_response[\"datasetImportJob\"]\n", " if \"latestDatasetImportJobRun\" not in dataset_import_job:\n", " status = dataset_import_job[\"status\"]\n", " print(\"DatasetImportJob: {}\".format(status))\n", " else:\n", " status = dataset_import_job[\"latestDatasetImportJobRun\"][\"status\"]\n", " print(\"LatestDatasetImportJobRun: {}\".format(status))\n", " \n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", " \n", " time.sleep(60)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select Recipe\n", "\n", "There are many different algorithm recipes available within Personalize, this is a list of all supported algorithms at the time of the workshop. We are going to select the standard HRNN recipe, which only needs the Interactions file and not the Item metadata or User demographic files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "recipe_list = [\n", " \"arn:aws:personalize:::recipe/aws-hrnn\",\n", " \"arn:aws:personalize:::recipe/aws-hrnn-coldstart\",\n", " \"arn:aws:personalize:::recipe/aws-hrnn-metadata\",\n", " \"arn:aws:personalize:::recipe/aws-personalized-ranking\",\n", " \"arn:aws:personalize:::recipe/aws-popularity-count\",\n", " \"arn:aws:personalize:::recipe/aws-sims\"\n", "]\n", "\n", "recipe_arn = recipe_list[0]\n", "print(recipe_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create and Wait for Solution\n", "\n", "With our data imported we can now train our ML solution. This consists of just a single API to Personalize, where we specify the Dataset to use.\n", "\n", "#### Create Solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_solution_response = personalize.create_solution(\n", " name = \"summit-lab-recs-solution\",\n", " datasetGroupArn = dataset_group_arn,\n", " recipeArn = recipe_arn\n", ")\n", "\n", "solution_arn = create_solution_response['solutionArn']\n", "print(json.dumps(create_solution_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Solution Version" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_solution_version_response = personalize.create_solution_version(\n", " solutionArn = solution_arn\n", ")\n", "\n", "solution_version_arn = create_solution_version_response['solutionVersionArn']\n", "print(json.dumps(create_solution_version_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Wait for Solution to Have ACTIVE Status\n", "\n", "We now poll the status of solution creation job, as until it is complete we cannot continue. **Note: this can take anything between 25-50 minutes to complete**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "status = None\n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " describe_solution_version_response = personalize.describe_solution_version(\n", " solutionVersionArn = solution_version_arn\n", " )\n", " status = describe_solution_version_response[\"solutionVersion\"][\"status\"]\n", " print(\"SolutionVersion: {}\".format(status))\n", " \n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", " \n", " time.sleep(60)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create Additional Solutions in the Console\n", "\n", "Whilst you're waiting for this to complete, jump back into the original Lab Guidebook - there we will walk you through creating two additional solutions in parallel using the same dataset; one for Personalized Rankings and one for Item-to-Item Similarities (or SIMS), both of which can be used in the final application. Once you've begun to create both additional solutions you can come back here and continue.\n", "\n", "#### Get Metrics of Solution\n", "\n", "Once the soluition is built you can look up the various metrics that Personalize provides - this allows you to see how well a model has been trained. If you are re-training models after the acquisition of new data then these metrics can tell you if the models are training equally as well as before, better than before or worse than before, giving you the information that you need in order to decide whether or not to push a new model into Production. You can also compare results across multiple different algorithm recipes, helping you choose the best performing one for you particular dataset.\n", "\n", "You can find details on each of the metrics in our [documentation](https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "get_solution_metrics_response = personalize.get_solution_metrics(\n", " solutionVersionArn = solution_version_arn\n", ")\n", "\n", "print(json.dumps(get_solution_metrics_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create and Wait for Campaign\n", "\n", "A trained model is exactly that - just a model. In order to use it you need to create an API endpoint, and you do this by creating a *Campaign*. A Campaign simply provides the endpoint for a specific version of your model, and as such you are able to host endpoint for multiple versions of your models simultaneously, allowing you to do things like A/B testing of new models.\n", "\n", "At the campaign level we specify the the minimum deployed size of the inference engine in terms of transactions per second - whilst this engine can scale up and down dynamically it will never scale below this level, but please note that pricing for Personalize is heavily based around the number of TPS currently deployed.\n", "\n", "#### Create campaign" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_campaign_response = personalize.create_campaign(\n", " name = \"summit-lab-recs-campaign\",\n", " solutionVersionArn = solution_version_arn,\n", " minProvisionedTPS = 1\n", ")\n", "\n", "campaign_arn = create_campaign_response['campaignArn']\n", "print(json.dumps(create_campaign_response, indent=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Wait for Campaign to Have ACTIVE Status\n", "\n", "We now poll the status of Campaign creation job, as until it is complete we cannot continue. **Note: this can take anything between 3-15 minutes to complete**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "status = None\n", "max_time = time.time() + 3*60*60 # 3 hours\n", "while time.time() < max_time:\n", " describe_campaign_response = personalize.describe_campaign(\n", " campaignArn = campaign_arn\n", " )\n", " status = describe_campaign_response[\"campaign\"][\"status\"]\n", " print(\"Campaign: {}\".format(status))\n", " \n", " if status == \"ACTIVE\" or status == \"CREATE FAILED\":\n", " break\n", " \n", " time.sleep(60)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get Recommendations\n", "\n", "Finally, we have a deployed Campaign end-point, which hosts a specific version of our trained model - we are now able to make recommendation inference requests against it. However, the Personalize recommendation calls just return itemID values - it returns no context around the title of the movie, Hence, we use our pre-loaded version of the *u.item* file that contains the movie titles - we load in the file via **pandas** library calls and pick out a random userID and movieID from our training date. This info is displayed, along with the name of the movie.\n", "\n", "#### Select a User and an Item" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "items = pd.read_csv('./u.item', sep='\\t', usecols=[0,1], header=None)\n", "items.columns = ['ITEM_ID', 'TITLE']\n", "\n", "user_id, item_id, _ = data.sample().values[0]\n", "item_title = items.loc[items['ITEM_ID'] == item_id].values[0][-1]\n", "print(\"USER: {}\".format(user_id))\n", "print(\"ITEM: {}\".format(item_title))\n", "\n", "items" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Call GetRecommendations\n", "\n", "The last thing we do is actually make a Personalize recommendations inference call - as you can see from the Code cell this is literally a single line of code with a userID and itemID as input variables (although, strictly speaking, you only need the userID for the datasets that we have).\n", "\n", "The inference call returns a list of up to 25 itemIDs from the training set - we take that and look up the corresponding movie titles from the *u.item* file and display them; this is far more useful than just a list of ID values." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "get_recommendations_response = personalize_runtime.get_recommendations(\n", " campaignArn = campaign_arn,\n", " userId = str(user_id)\n", ")\n", "\n", "item_list = get_recommendations_response['itemList']\n", "title_list = [items.loc[items['ITEM_ID'] == np.int(item['itemId'])].values[0][-1] for item in item_list]\n", "\n", "print(\"Recommendations: {}\".format(json.dumps(title_list, indent=2)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }