--- title: "Personalize preprocessing" date: 2020-02-26T00:12:50-05:00 draft: false algo: [personalize] --- ### Import data Amazon Personalize imports data only from files that are in the comma-separated values (CSV) format. Amazon Personalize requires the first row of your CSV file to contain column headers. The column headers in your CSV file need to map to the schema to create the dataset. Don't enclose headers in quotation marks ("). Upload your file into an Amazon Simple Storage Service (Amazon S3) bucket that [Amazon Personalize can access](https://docs.aws.amazon.com/personalize/latest/dg/data-prep-upload-s3.html) Let's assume your dataset contains file named "ratings.csv" with following data. ```html userId,movieId,rating,timestamp 1,2,3.5,1112486027 1,29,3.5,1112484676 1,32,3.5,1112484819 1,47,3.5,1112484727 1,50,3.5,1112484580 ``` It reads like this: user 1 gave movie 2 a 3.5 rating. Same for movies 29, 32, 47, 50 and so on! Before you add a dataset to Amazon Personalize, you must define a schema for that dataset. Name the file as "my-personalize-schema.json" ```JSON {"type": "record", "name": "Interactions", "namespace": "com.amazonaws.personalize.schema", "fields":[ {"name": "ITEM_ID", "type": "string"}, {"name": "USER_ID", "type": "string"}, {"name": "TIMESTAMP", "type": "long"} ], "version": "1.0"} ``` ### Prepare dataset Essentially, what the following python code does: - Shuffle reviews. - Keep only movies rated 4 and above, and drop the ratings columns: Purpose for this is to recommend movies that users should really like. - Rename columns to the names used in the schema. - Keep only 100,000 interactions to minimize training time ```python import pandas, boto3 from sklearn.utils import shuffle ratings = pandas.read_csv('ratings.csv') ratings = shuffle(ratings) ratings = ratings[ratings['rating']>4] ratings = ratings.drop(columns='rating') ratings.columns = ['USER_ID','ITEM_ID','TIMESTAMP'] ratings = ratings[:100000] ratings.to_csv('ratings.processed.csv',index=False) s3 = boto3.client('s3') s3.upload_file('ratings.processed.csv','my-personalize-bucket','ratings.processed.csv') ``` ### Create dataset group, create dataset and create import job - using Python, CLI and Console #### Python [Click here](https://docs.aws.amazon.com/personalize/latest/dg/getting-started-python.html) ```python import boto3 personalize = boto3.client('personalize') dsg_response = personalize.create_dataset_group(name = 'my-personalize-dataset-group') dsg_arn = dsg_response['datasetGroupArn'] with open('my-personalize-schema.json') as f: createSchemaResponse = personalize.create_schema( name = 'MyPersonalizeSchema', schema = f.read() ) schema_arn = createSchemaResponse['schemaArn'] ds_response = personalize.create_dataset( name = 'MyPersonalizeDataset', schemaArn = schema_arn, datasetGroupArn = dsg_arn, datasetType = 'Interactions') ds_arn = ds_response['datasetArn'] dsij_response = personalize.create_dataset_import_job( jobName = 'YourImportJob', datasetArn = ds_arn, dataSource = {'dataLocation':'s3://my-personalize-bucket/ratings.processed.csv'}, roleArn = 'role_arn') dsij_arn = dsij_response['datasetImportJobArn'] ``` #### CLI [Click here](https://docs.aws.amazon.com/personalize/latest/dg/getting-started-cli.html) ```html aws personalize create-dataset-group --name my-personalize-dataset-group aws personalize create-schema --name my-personalize-schema \ --schema file://my-personalize-schema.json aws personalize create-dataset --schema-arn $SCHEMA_ARN \ --dataset-group-arn $DATASET_GROUP_ARN \ --dataset-type INTERACTIONS aws personalize create-dataset-import-job --job-name my-personalize-import-job \ --role-arn $ROLE_ARN --dataset-arn $DATASET_ARN \ --data-source dataLocation=s3://my-personalize-bucket/ratings.processed.csv ``` ### Console [Click here](https://docs.aws.amazon.com/personalize/latest/dg/getting-started-console.html)