{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicting Product Success When Review Data Is Available\n", "_**Using XGBoost to Predict Whether Sales will Exceed the \"Hit\" Threshold**_\n", "\n", "---\n", "\n", "---\n", "\n", "## Contents\n", "\n", "1. [Background](#Background)\n", "1. [Setup](#Setup)\n", "1. [Data](#Data)\n", "1. [Train](#Train)\n", "1. [Host](#Host)\n", "1. [Evaluation](#Evaluation)\n", "1. [Extensions](#Extensions)\n", "\n", "\n", "## Background\n", "\n", "Word of mouth in the form of user reviews, critic reviews, social media comments, etc. often can provide insights about whether a product ultimately will be a success. In the video game industry in particular, reviews and ratings can have a large impact on a game's success. However, not all games with bad reviews fail, and not all games with good reviews turn out to be hits. To predict hit games, machine learning algorithms potentially can take advantage of various relevant data attributes in addition to reviews. \n", "\n", "For this notebook, we will work with the data set [Video Game Sales with Ratings](https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings) from Kaggle. This [Metacritic](http://www.metacritic.com/browse/games/release-date/available) data includes attributes for user reviews as well as critic reviews, sales, ESRB ratings, among others. Both user reviews and critic reviews are in the form of ratings scores, on a scale of 0 to 10 or 0 to 100. Although this is convenient, a significant issue with the data set is that it is relatively small. \n", "\n", "Dealing with a small data set such as this one is a common problem in machine learning. This problem often is compounded by imbalances between the classes in the small data set. In such situations, using an ensemble learner can be a good choice. This notebook will focus on using XGBoost, a popular ensemble learner, to build a classifier to determine whether a game will be a hit. \n", "\n", "## Setup\n", "\n", "Let's start by specifying:\n", "\n", "- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.\n", "- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the `get_execution_role()` call with the appropriate full IAM role arn string(s).\n", "\n", "### Lab time\n", "This lab takes around 10 to 15 minutes to complete." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import timeit\n", "start_time = timeit.default_timer()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "isConfigCell": true }, "outputs": [], "source": [ "import numpy as np \n", "import pandas as pd \n", "import matplotlib.pyplot as plt \n", "from IPython.display import Image \n", "from IPython.display import display \n", "from sklearn.datasets import dump_svmlight_file \n", "from time import gmtime, strftime \n", "import sys \n", "import math \n", "import json\n", "import boto3\n", "import sagemaker\n", "\n", "role = sagemaker.get_execution_role()\n", "\n", "bucket = '' # replace with an existing bucket if needed\n", "prefix = 'sagemaker/videogames_xgboost'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we'll import the Python libraries we'll need." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Data\n", "\n", "Before proceeding further, let's download the data set from a public S3 bucket to your notebook instance. It will then appear in the same directory as this notebook. Then we'll take an initial look at the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "raw_data_filename = 'Video_Games_Sales_as_at_22_Dec_2016.csv'\n", "data_bucket = 'sagemaker-workshop-pdx'\n", "\n", "s3 = boto3.resource('s3')\n", "s3.Bucket(data_bucket).download_file(raw_data_filename, '/tmp/raw_data.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv('/tmp/raw_data.csv')\n", "pd.set_option('display.max_rows', 20) \n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before proceeding further, we need to decide upon a target to predict. Video game development budgets can run into the tens of millions of dollars, so it is critical for game publishers to publish \"hit\" games to recoup their costs and make a profit. As a proxy for what constitutes a \"hit\" game, we will set a target of greater than 1 million units in global sales." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['y'] = (data['Global_Sales'] > 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With our target now defined, let's take a look at the imbalance between the \"hit\" and \"not a hit\" classes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.bar(['not a hit', 'hit'], data['y'].value_counts())\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not surprisingly, only a small fraction of games can be considered \"hits\" under our metric. Next, we'll choose features that have predictive power for our target. We'll begin by plotting review scores versus global sales to check our hunch that such scores have an impact on sales. Logarithmic scale is used for clarity." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "viz = data.filter(['User_Score','Critic_Score', 'Global_Sales'], axis=1)\n", "viz['User_Score'] = pd.Series(viz['User_Score'].apply(pd.to_numeric, errors='coerce'))\n", "viz['User_Score'] = viz['User_Score'].mask(np.isnan(viz[\"User_Score\"]), viz['Critic_Score'] / 10.0)\n", "viz.plot(kind='scatter', logx=True, logy=True, x='Critic_Score', y='Global_Sales')\n", "viz.plot(kind='scatter', logx=True, logy=True, x='User_Score', y='Global_Sales')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our intuition about the relationship between review scores and sales seems justified. We also note in passing that other relevant features can be extracted from the data set. For example, the ESRB rating has an impact since games with an \"E\" for everyone rating typically reach a wider audience than games with an age-restricted \"M\" for mature rating, though depending on another feature, the genre (such as shooter or action), M-rated games also can be huge hits. Our model hopefully will learn these relationships and others. \n", "\n", "Next, looking at the columns of features of this data set, we can identify several that should be excluded. For example, there are five columns that specify sales numbers: these numbers are directly related to the target we're trying to predict, so these columns should be dropped. Other features may be irrelevant, such as the name of the game." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = data.drop(['Name', 'Year_of_Release', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales', 'Critic_Count', 'User_Count', 'Developer'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the number of columns reduced, now is a good time to check how many columns are missing data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As noted in Kaggle's overview of this data set, many review ratings are missing. Unfortunately, since those are crucial features that we are relying on for our predictions, and there is no reliable way of imputing so many of them, we'll need to drop rows missing those features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = data.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to resolve a problem we see in the User_Score column: it contains some 'tbd' string values, so it obviously is not numeric. User_Score is more properly a numeric rather than categorical feature, so we'll need to convert it from string type to numeric, and temporarily fill in NaNs for the tbds. Next, we must decide what to do with these new NaNs in the User_Score column. We've already thrown out a large number of rows, so if we can salvage these rows, we should. As a first approximation, we'll take the value in the Critic_Score column and divide by 10 since the user scores tend to track the critic scores (though on a scale of 0 to 10 instead of 0 to 100). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['User_Score'] = data['User_Score'].apply(pd.to_numeric, errors='coerce')\n", "data['User_Score'] = data['User_Score'].mask(np.isnan(data[\"User_Score\"]), data['Critic_Score'] / 10.0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's do some final preprocessing of the data, including converting the categorical features into numeric using the one-hot encoding method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['y'] = data['y'].apply(lambda y: 'yes' if y == True else 'no')\n", "model_data = pd.get_dummies(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To help prevent overfitting the model, we'll randomly split the data into three groups. Specifically, the model will be trained on 70% of the data. It will then be evaluated on 20% of the data to give us an estimate of the accuracy we hope to have on \"new\" data. As a final testing dataset, the remaining 10% will be held out until the end." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9 * len(model_data))]) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "XGBoost operates on data in the libSVM data format, with features and the target variable provided as separate arguments. To avoid any misalignment issues due to random reordering, this split is done after the previous split in the above cell. As a last step before training, we'll copy the resulting files to S3 as input for SageMaker's managed training." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dump_svmlight_file(X=train_data.drop(['y_no', 'y_yes'], axis=1), y=train_data['y_yes'], f='train.libsvm')\n", "dump_svmlight_file(X=validation_data.drop(['y_no', 'y_yes'], axis=1), y=validation_data['y_yes'], f='validation.libsvm')\n", "dump_svmlight_file(X=test_data.drop(['y_no', 'y_yes'], axis=1), y=test_data['y_yes'], f='test.libsvm')\n", "\n", "boto3.Session().resource('s3').Bucket(bucket).Object(prefix + '/train/train.libsvm').upload_file('train.libsvm')\n", "boto3.Session().resource('s3').Bucket(bucket).Object(prefix + '/validation/validation.libsvm').upload_file('validation.libsvm')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Train\n", "\n", "Our data is now ready to be used to train a XGBoost model. The XGBoost algorithm has many tunable hyperparameters. Some of these hyperparameters are listed below; initially we'll only use a few of them. \n", "\n", "- `max_depth`: Maximum depth of a tree. As a cautionary note, a value too small could underfit the data, while increasing it will make the model more complex and thus more likely to overfit the data (in other words, the classic bias-variance tradeoff).\n", "- `eta`: Step size shrinkage used in updates to prevent overfitting. \n", "- `eval_metric`: Evaluation metric(s) for validation data. For data sets such as this one with imbalanced classes, we'll use the AUC metric.\n", "- `scale_pos_weight`: Controls the balance of positive and negative weights, again useful for data sets having imbalanced classes.\n", "\n", "First we'll setup the parameters for a training job, then create a training job with those parameters and run it. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "job_name = 'videogames-xgboost-' + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", "print(\"Training job\", job_name)\n", "\n", "from sagemaker.amazon.amazon_estimator import get_image_uri\n", "container = get_image_uri(boto3.Session().region_name, 'xgboost', \"latest\")\n", "\n", "create_training_params = \\\n", "{\n", " \"RoleArn\": role,\n", " \"TrainingJobName\": job_name,\n", " \"AlgorithmSpecification\": {\n", " \"TrainingImage\": container,\n", " \"TrainingInputMode\": \"File\"\n", " },\n", " \"ResourceConfig\": {\n", " \"InstanceCount\": 1,\n", " \"InstanceType\": \"ml.c4.xlarge\",\n", " \"VolumeSizeInGB\": 10\n", " },\n", " \"InputDataConfig\": [\n", " {\n", " \"ChannelName\": \"train\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": \"s3://{}/{}/train\".format(bucket, prefix),\n", " \"S3DataDistributionType\": \"FullyReplicated\"\n", " }\n", " },\n", " \"ContentType\": \"libsvm\",\n", " \"CompressionType\": \"None\"\n", " },\n", " {\n", " \"ChannelName\": \"validation\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": \"s3://{}/{}/validation\".format(bucket, prefix),\n", " \"S3DataDistributionType\": \"FullyReplicated\"\n", " }\n", " },\n", " \"ContentType\": \"libsvm\",\n", " \"CompressionType\": \"None\"\n", " }\n", " ],\n", " \"OutputDataConfig\": {\n", " \"S3OutputPath\": \"s3://{}/{}/xgboost-video-games/output\".format(bucket, prefix)\n", " },\n", " \"HyperParameters\": {\n", " \"max_depth\":\"3\",\n", " \"eta\":\"0.1\",\n", " \"eval_metric\":\"auc\",\n", " \"scale_pos_weight\":\"2.0\",\n", " \"subsample\":\"0.5\",\n", " \"objective\":\"binary:logistic\",\n", " \"num_round\":\"100\"\n", " },\n", " \"StoppingCondition\": {\n", " \"MaxRuntimeInSeconds\": 60 * 60\n", " }\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "sm = boto3.client('sagemaker')\n", "sm.create_training_job(**create_training_params)\n", "\n", "status = sm.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']\n", "print(status)\n", "\n", "try:\n", " sm.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)\n", "finally:\n", " status = sm.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']\n", " print(\"Training job ended with status: \" + status)\n", " if status == 'Failed':\n", " message = sm.describe_training_job(TrainingJobName=job_name)['FailureReason']\n", " print('Training failed with the following error: {}'.format(message))\n", " raise Exception('Training job failed')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Host\n", "\n", "Now that we've trained the XGBoost algorithm on our data, let's prepare the model for hosting on a SageMaker serverless endpoint. We will:\n", "\n", "1. Point to the scoring container\n", "1. Point to the model.tar.gz that came from training\n", "1. Create the hosting model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "create_model_response = sm.create_model(\n", " ModelName=job_name,\n", " ExecutionRoleArn=role,\n", " PrimaryContainer={\n", " 'Image': container,\n", " 'ModelDataUrl': sm.describe_training_job(TrainingJobName=job_name)['ModelArtifacts']['S3ModelArtifacts']})\n", "\n", "print(create_model_response['ModelArn'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we'll configure our hosting endpoint. Here we specify:\n", "\n", "1. EC2 instance type to use for hosting\n", "1. The initial number of instances\n", "1. Our hosting model name\n", "\n", "After the endpoint has been configured, we'll create the endpoint itself." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xgboost_endpoint_config = 'videogames-xgboost-endpoint-config-' + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", "print(xgboost_endpoint_config)\n", "create_endpoint_config_response = sm.create_endpoint_config(\n", " EndpointConfigName=xgboost_endpoint_config,\n", " ProductionVariants=[{\n", " 'InstanceType': 'ml.t2.medium',\n", " 'InitialInstanceCount': 1,\n", " 'ModelName': job_name,\n", " 'VariantName': 'AllTraffic'}])\n", "\n", "print(\"Endpoint Config Arn: \" + create_endpoint_config_response['EndpointConfigArn'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "xgboost_endpoint = 'EXAMPLE-videogames-xgb-endpoint-' + strftime(\"%Y%m%d%H%M\", gmtime())\n", "print(xgboost_endpoint)\n", "create_endpoint_response = sm.create_endpoint(\n", " EndpointName=xgboost_endpoint,\n", " EndpointConfigName=xgboost_endpoint_config)\n", "print(create_endpoint_response['EndpointArn'])\n", "\n", "resp = sm.describe_endpoint(EndpointName=xgboost_endpoint)\n", "status = resp['EndpointStatus']\n", "print(\"Status: \" + status)\n", "\n", "try:\n", " sm.get_waiter('endpoint_in_service').wait(EndpointName=xgboost_endpoint)\n", "finally:\n", " resp = sm.describe_endpoint(EndpointName=xgboost_endpoint)\n", " status = resp['EndpointStatus']\n", " print(\"Arn: \" + resp['EndpointArn'])\n", " print(\"Status: \" + status)\n", "\n", " if status != 'InService':\n", " message = sm.describe_endpoint(EndpointName=xgboost_endpoint)['FailureReason']\n", " print('Endpoint creation failed with the following error: {}'.format(message))\n", " raise Exception('Endpoint creation did not succeed')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Evaluation\n", "\n", "Now that we have our hosted endpoint, we can generate predictions from it. More specifically, let's generate predictions from our test data set to understand how well our model generalizes to data it has not seen yet.\n", "\n", "There are many ways to compare the performance of a machine learning model. We'll start simply by comparing actual to predicted values of whether the game was a \"hit\" (`1`) or not (`0`). Then we'll produce a confusion matrix, which shows how many test data points were predicted by the model in each category versus how many test data points actually belonged in each category." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "runtime = boto3.client('runtime.sagemaker')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def do_predict(data, endpoint_name, content_type):\n", " payload = '\\n'.join(data)\n", " response = runtime.invoke_endpoint(EndpointName=endpoint_name, \n", " ContentType=content_type, \n", " Body=payload)\n", " result = response['Body'].read()\n", " result = result.decode(\"utf-8\")\n", " result = result.split(',')\n", " preds = [float((num)) for num in result]\n", " preds = [round(num) for num in preds]\n", " return preds\n", "\n", "def batch_predict(data, batch_size, endpoint_name, content_type):\n", " items = len(data)\n", " arrs = []\n", " \n", " for offset in range(0, items, batch_size):\n", " if offset+batch_size < items:\n", " results = do_predict(data[offset:(offset+batch_size)], endpoint_name, content_type)\n", " arrs.extend(results)\n", " else:\n", " arrs.extend(do_predict(data[offset:items], endpoint_name, content_type))\n", " sys.stdout.write('.')\n", " return(arrs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "import json\n", "\n", "with open('test.libsvm', 'r') as f:\n", " payload = f.read().strip()\n", "\n", "labels = [int(line.split(' ')[0]) for line in payload.split('\\n')]\n", "test_data = [line for line in payload.split('\\n')]\n", "preds = batch_predict(test_data, 100, xgboost_endpoint, 'text/x-libsvm')\n", "\n", "print ('\\nerror rate=%f' % ( sum(1 for i in range(len(preds)) if preds[i]!=labels[i]) /float(len(preds))))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.crosstab(index=np.array(labels), columns=np.array(preds))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of the 132 games in the test set that actually are \"hits\" by our metric, the model correctly identified 73, while the overall error rate is 13%. The amount of false negatives versus true positives can be shifted substantially in favor of true positives by increasing the hyperparameter scale_pos_weight. Of course, this increase comes at the expense of reduced accuracy/increased error rate and more false positives. How to make this trade-off ultimately is a business decision based on the relative costs of false positives, false negatives, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Extensions\n", "\n", "This XGBoost model is just the starting point for predicting whether a game will be a hit based on reviews and other features. There are several possible avenues for improving the model's performance. First, of course, would be to collect more data and, if possible, fill in the existing missing fields with actual information. Another possibility is further hyperparameter tuning, with Amazon SageMaker's Hyperparameter Optimization service. And, although ensemble learners often do well with imbalanced data sets, it could be worth exploring techniques for mitigating imbalances such as downsampling, synthetic data augmentation, and other approaches. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sm.delete_endpoint(EndpointName=xgboost_endpoint)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# code you want to evaluate\n", "elapsed = timeit.default_timer() - start_time\n", "print(elapsed/60)" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" }, "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." }, "nbformat": 4, "nbformat_minor": 2 }