{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Build a Customer Churn Model for Music Streaming App Users: Model Selection and Model Explainability\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Background\n", "\n", "This notebook is one of a sequence of notebooks that show you how to use various SageMaker functionalities to build, train, and deploy the model from end to end, including data pre-processing steps like ingestion, cleaning and processing, feature engineering, training and hyperparameter tuning, model explainability, and eventually deploy the model. There are two parts of the demo: \n", "\n", "1. Build a Customer Churn Model for Music Streaming App Users: Overview and Data Preparation - you will process the data with the help of Data Wrangler, then create features from the cleaned data. By the end of part 1, you will have a complete feature data set that contains all attributes built for each user, and it is ready for modeling.\n", "1. Build a Customer Churn Model for Music Streaming App Users: Model Selection and Model Explainability (current notebook) - you will use the data set built from part 1 to find an optimal model for the use case, then test the model predictability with the test data. \n", "\n", "For how to set up the SageMaker Studio Notebook environment, please check the [onboarding video]( https://www.youtube.com/watch?v=wiDHCWVrjCU&feature=youtu.be). And for a list of services covered in the use case demo, please check the documentation linked in each section.\n", "\n", "\n", "## Content\n", "* [Model Selection](#Model-Selection)\n", "* [Training with SageMaker Estimator and Experiment](#Training-with-SageMaker-Estimator-and-Experiment)\n", "* [Hyperparameter Tuning with SageMaker Hyperparameter Tuning Job](#Hyperparameter-Tuning-with-SageMaker-Hyperparameter-Tuning-Job)\n", "* [Deploy the model with SageMaker Batch-transform](#Deploy-the-model-with-SageMaker-Batch-transform)\n", "* [Model Explainability with SageMaker Clarify](#Model-Explainability-with-SageMaker-Clarify)\n", "* [Optional: Automate your training and model selection with SageMaker Autopilot (Console)](#Optional:-Automate-your-training-and-model-selection-with-SageMaker-Autopilot-(Console))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "### What is Customer Churn and why is it important for businesses?\n", "\n", "Customer churn, or customer retention/attrition, means a customer has the tendency to leave and stop paying for a business. It is one of the primary metrics companies want to track to get a sense of their customer satisfaction, especially for a subscription-based business model. The company can track churn rate (defined as the percentage of customers churned during a period) as a health indicator for the business, but we would love to identify the at-risk customers before they churn and offer appropriate treatment to keep them with the business, and this is where machine learning comes into play.\n", "### Use Cases for Customer Churn\n", "\n", "Any subscription-based business would track customer churn as one of the most critical Key Performance Indicators (KPIs). Such companies and industries include Telecom companies (cable, cell phone, internet, etc.), digital subscriptions of media (news, forums, blogposts platforms, etc.), music and video streaming services, and other Software as a Service (SaaS) providers (e-commerce, CRM, Mar-Tech, cloud computing, video conference provider, and visualization and data science tools, etc.)\n", "\n", "### Define Business problem\n", "\n", "To start with, here are some common business problems to consider depending on your specific use cases and your focus:\n", "\n", " * Will this customer churn (cancel the plan, cancel the subscription)?\n", " * Will this customer downgrade a pricing plan?\n", " * For a subscription business model, will a customer renew his/her subscription?\n", "\n", "### Machine learning problem formulation\n", "\n", "#### Classification: will this customer churn?\n", "\n", "To goal of classification is to identify the at-risk customers and sometimes their unusual behavior, such as: will this customer churn or downgrade their plan? Is there any unusual behavior for a customer? The latter question can be formulated as an anomaly detection problem.\n", "\n", "#### Time Series: will this customer churn in the next X months? When will this customer churn?\n", "\n", "You can further explore your users by formulating the problem as a time series one and detect when will the customer churn.\n", "\n", "### Data Requirements\n", "\n", "#### Data collection Sources\n", "\n", "Some most common data sources used to construct a data set for churn analysis are:\n", "* Customer Relationship Management platform (CRM), \n", "* engagement and usage data (analytics services), \n", "* passive feedback (ratings based on your request), and active feedback (customer support request, feedback on social media and review platforms).\n", "\n", "#### Construct a Data Set for Churn Analysis\n", "\n", "Most raw data collected from the sources mentioned above are huge and often needs a lot of cleaning and pre-processing. For example, usage data is usually event-based log data and can be more than a few gigabytes every day; you can aggregate the data to user-level daily for further analysis. Feedback and review data are mostly text data, so you would need to clean and pre-process the natural language data to be normalized, machine-readable data. If you are joining multiple data sources (especially from different platforms) together, you would want to make sure all data points are consistent, and the user identity can be matched across different platforms.\n", " \n", "#### Challenges with Customer Churn\n", "\n", "* Business related\n", " * Importance of domain knowledge: this is critical when you start building features for the machine learning model. It is important to understand the business enough to decide which features would trigger retention.\n", "* Data issues\n", " * fewer churn data available (imbalanced classes): data for churn analysis is often very imbalanced as most of the customers of a business are happy customers (usually).\n", " * User identity mapping problem: if you are joining data from different platforms (CRM, email, feedback, mobile app, and website usage data), you would want to make sure user A is recognized as the same user across multiple platforms. There are third-party solutions that help you tackle this problem.\n", " * Not collecting the right data for the use case or Lacking enough data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model Selection\n", "\n", "You can experiment with all your model choices and see which one gives better results. A few things to note when you choose algorithms:\n", "* **Start with simple ones**: Usually for tabular data classification that does not contain complex unstructured data (text, audio, image, etc.), you can start with logistic regression to see how your data performs, as sometimes the simplest model gives great results if your data have a strong linear pattern.\n", "\n", "* **Think about your data structure**: For imbalanced class data like churn analysis, you can experiment with tree-based models like the random forest, gradient boosting, or XGboost since they are less sensitive to class imbalance.\n", "\n", "* **Interpretability**: logistic regression model generally has better interpretability because of its linearity. You can also use feature importance from tree-based models or Support Vector Machines as an overall observation, but not to your predicting instance level. Instead, you can utilize tools like SHAP or the SageMaker new feature SageMaker Clarify to better visualize which feature contributing more to your prediction results.\n", "\n", "In this use case, a tree-based model XGBoost is chosen due to consideration of imbalanced class, and in the family of tree based models, XGBoost usually gives best results as its built for model performance and computational speed. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training with SageMaker Estimator and Experiment\n", "\n", "Once you decide on a range of models you want to experiment with, you can start training and comparing model results to choose the best one. A few things left for you to make a decision:\n", "* SageMaker estimator configuration\n", " * to initialize your training job, you would need to config your SageMaker estimator and SageMaker training image by specifying the model choice, instance size, and type.\n", "* Choose evaluation methods\n", " * You can check the [model parameter documentation page](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters) for all the evaluation metrics you can choose for a model. For a imbalanced classification problem, you can choose F1 as your evaluation especially for comparing different models; area under curve (auc) is also a good choice when your output is probability.\n", "* Hyper-parameters\n", " * You can look at the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) for a complete list of hyper-parameters tunable for the model (The XGBoost model here was given as an example). For best performances, you can experiment with a range of combinations for the hyper-parameters and compare the validation results.\n", " \n", "### How to create a training job as a trial in SageMaker Experiment \n", "\n", "#### Get ECR image URIs for pre-built SageMaker Docker images" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pip install sagemaker-experiments" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sagemaker\n", "import json\n", "import pandas as pd\n", "import glob\n", "import s3fs\n", "import boto3\n", "from datetime import datetime\n", "import os" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sagemaker_session = sagemaker.Session()\n", "s3 = sagemaker_session.boto_session.resource(\"s3\")\n", "\n", "region = boto3.Session().region_name\n", "role = sagemaker.get_execution_role()\n", "smclient = boto3.Session().client(\"sagemaker\")\n", "bucket = sagemaker_session.default_bucket()\n", "prefix = \"music-streaming\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Download Data and Upload to S3\n", "\n", "We ingest the simulated data from the public SageMaker S3 training database. If you want to see how the train, test, and validation datasets are created in detail, look at [Build a Customer Churn Model for Music Streaming App Users: Overview and Data Preparation](0_cust_churn_overview_dw.ipynb)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##### Alternative: copy data from a public S3 bucket to your own bucket\n", "##### data file should include full_data.csv and sample.json\n", "#### cell 5 - 7 is not needed; the processing job before data wrangler screenshots is not needed\n", "!mkdir -p data/raw\n", "s3 = boto3.client(\"s3\")\n", "s3.download_file(\n", " f\"sagemaker-example-files-prod-{region}\",\n", " \"datasets/tabular/customer-churn/customer-churn-data-v2.zip\",\n", " \"data/raw/customer-churn-data.zip\",\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!unzip -o ./data/raw/customer-churn-data.zip -d ./data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# unzip the partitioned data files into the same folder\n", "!unzip -o data/simu-1.zip -d data/raw\n", "!unzip -o data/simu-2.zip -d data/raw\n", "!unzip -o data/simu-3.zip -d data/raw\n", "!unzip -o data/simu-4.zip -d data/raw" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!rm ./data/raw/*.zip" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!unzip -o data/sample.zip -d data/raw" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!aws s3 cp ./data/raw s3://$bucket/$prefix/data/json/ --recursive" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_input_train = (\n", " boto3.Session()\n", " .resource(\"s3\")\n", " .Bucket(bucket)\n", " .Object(os.path.join(prefix, \"train/train.csv\"))\n", " .upload_file(\"data/train_updated.csv\")\n", ")\n", "s3_input_validation = (\n", " boto3.Session()\n", " .resource(\"s3\")\n", " .Bucket(bucket)\n", " .Object(os.path.join(prefix, \"validation/validation.csv\"))\n", " .upload_file(\"data/validation_updated.csv\")\n", ")\n", "s3_input_validation = (\n", " boto3.Session()\n", " .resource(\"s3\")\n", " .Bucket(bucket)\n", " .Object(os.path.join(prefix, \"test/test_labeled.csv\"))\n", " .upload_file(\"data/test_updated.csv\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Initialize Model Hyperparameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hyperparameters = {\n", " \"max_depth\": \"12\",\n", " \"eta\": \"0.08\",\n", " \"gamma\": \"4\",\n", " \"min_child_weight\": \"7\",\n", " \"subsample\": \"0.7\",\n", " \"eval_metric\": \"auc\",\n", " \"objective\": \"binary:logistic\",\n", " \"num_round\": \"800\",\n", " \"early_stopping_rounds\": \"50\",\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define SageMaker estimator" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "from time import gmtime, strftime\n", "\n", "container = sagemaker.image_uris.retrieve(\n", " \"xgboost\", region, version=\"1.0-1\", instance_type=\"ml.m4.xlarge\"\n", ")\n", "\n", "\n", "xgb = sagemaker.estimator.Estimator(\n", " container,\n", " role,\n", " instance_count=1,\n", " instance_type=\"ml.m4.xlarge\",\n", " output_path=\"s3://{}/{}/output\".format(bucket, prefix),\n", " sagemaker_session=sagemaker_session,\n", ")\n", "\n", "xgb.set_hyperparameters(**hyperparameters)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.inputs import TrainingInput\n", "\n", "content_type = \"csv\"\n", "train_input = TrainingInput(\n", " \"s3://{}/{}/{}/\".format(bucket, prefix, \"train\"), content_type=content_type\n", ")\n", "validation_input = TrainingInput(\n", " \"s3://{}/{}/{}/\".format(bucket, prefix, \"validation\"), content_type=content_type\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "%%time\n", "xgb.fit(inputs={\"train\": train_input, \"validation\": validation_input}, wait=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Define SageMaker Experiment and Trial" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# custom trial name\n", "experiment_name = \"music-streaming-churn-exp-{}\".format(datetime.now().strftime(\"%Y%m%d-%H%M%S\"))\n", "trial_name_xgb = \"xgboost-{}\".format(datetime.now().strftime(\"%Y%m%d-%H%M%S\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "from smexperiments import experiment, trial\n", "from sagemaker import analytics\n", "\n", "# create experiment if it doesn't exist\n", "try:\n", " my_experiment = experiment.Experiment.load(experiment_name=experiment_name)\n", " print(f\"Experiment loaded {experiment_name}: SUCCESS\")\n", "except Exception as ex:\n", " if \"ResourceNotFound\" in str(ex):\n", " my_experiment = experiment.Experiment.create(experiment_name=experiment_name)\n", " print(f\"Experiment creation {experiment_name}: SUCCESS\")\n", "\n", "# create the trial if it doesn't exist\n", "try:\n", " my_trial = trial.Trial.load(trial_name=trial_name_xgb)\n", " print(f\"Trial loaded {trial_name_xgb}: SUCCESS\")\n", "except Exception as ex:\n", " if \"ResourceNotFound\" in str(ex):\n", " my_trial = trial.Trial.create(experiment_name=experiment_name, trial_name=trial_name_xgb)\n", " print(f\"Create trial {my_trial.trial_name}: SUCCESSFUL\")\n", "\n", "\n", "xgb.fit(\n", " inputs={\"train\": train_input, \"validation\": validation_input},\n", " wait=True,\n", " experiment_config={\n", " \"ExperimentName\": my_experiment.experiment_name,\n", " \"TrialName\": my_trial.trial_name,\n", " \"TrialComponentDisplayName\": \"churn-xgboost\",\n", " },\n", " logs=True,\n", ")\n", "\n", "trial_component_analytics = analytics.ExperimentAnalytics(\n", " experiment_name=my_experiment.experiment_name\n", ")\n", "\n", "analytic_table = trial_component_analytics.dataframe()\n", "analytic_table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hyperparameter Tuning with SageMaker Hyperparameter Tuning Job\n", "\n", "Now that you understand how training one model works and how to create a SageMaker experiment, and selected the XGBoost model as the final model, you will need to fine-tune the hyperparameters for the best model performances. For a xgboost model, you can start with defining ranges for the eta, alpha, min_child_weight, and max_depth. You can check the [documentation when considering what haperparameter to tune](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-considerations.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specify the Hyperparameter Tuning Job Settings\n", "\n", "To specify settings for the hyperparameter tuning job, you define a JSON object. You pass the object as the value of the HyperParameterTuningJobConfig parameter to CreateHyperParameterTuningJob when you create the tuning job." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuning_job_config = {\n", " \"ParameterRanges\": {\n", " \"CategoricalParameterRanges\": [],\n", " \"ContinuousParameterRanges\": [\n", " {\"MaxValue\": \"1\", \"MinValue\": \"0\", \"Name\": \"eta\"},\n", " {\"MaxValue\": \"2\", \"MinValue\": \"0\", \"Name\": \"alpha\"},\n", " {\"MaxValue\": \"10\", \"MinValue\": \"1\", \"Name\": \"min_child_weight\"},\n", " ],\n", " \"IntegerParameterRanges\": [{\"MaxValue\": \"10\", \"MinValue\": \"1\", \"Name\": \"max_depth\"}],\n", " },\n", " \"ResourceLimits\": {\"MaxNumberOfTrainingJobs\": 20, \"MaxParallelTrainingJobs\": 3},\n", " \"Strategy\": \"Bayesian\",\n", " \"TrainingJobEarlyStoppingType\": \"Auto\",\n", " \"HyperParameterTuningJobObjective\": {\"MetricName\": \"validation:auc\", \"Type\": \"Maximize\"},\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure the Training Jobs\n", "\n", "To configure the training jobs that the tuning job launches, define a JSON object that you pass as the value of the TrainingJobDefinition parameter of the CreateHyperParameterTuningJob call." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_input_train = \"s3://{}/{}/train\".format(bucket, prefix)\n", "s3_input_validation = \"s3://{}/{}/validation\".format(bucket, prefix)\n", "\n", "training_job_definition = {\n", " \"AlgorithmSpecification\": {\"TrainingImage\": container, \"TrainingInputMode\": \"File\"},\n", " \"InputDataConfig\": [\n", " {\n", " \"ChannelName\": \"train\",\n", " \"CompressionType\": \"None\",\n", " \"ContentType\": \"csv\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataDistributionType\": \"FullyReplicated\",\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": s3_input_train,\n", " }\n", " },\n", " },\n", " {\n", " \"ChannelName\": \"validation\",\n", " \"CompressionType\": \"None\",\n", " \"ContentType\": \"csv\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataDistributionType\": \"FullyReplicated\",\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": s3_input_validation,\n", " }\n", " },\n", " },\n", " ],\n", " \"OutputDataConfig\": {\"S3OutputPath\": \"s3://{}/{}/output\".format(bucket, prefix)},\n", " \"ResourceConfig\": {\"InstanceCount\": 2, \"InstanceType\": \"ml.c4.2xlarge\", \"VolumeSizeInGB\": 10},\n", " \"RoleArn\": role,\n", " \"StaticHyperParameters\": {\n", " \"eval_metric\": \"auc\",\n", " \"num_round\": \"100\",\n", " \"objective\": \"binary:logistic\",\n", " \"rate_drop\": \"0.3\",\n", " \"tweedie_variance_power\": \"1.4\",\n", " },\n", " \"StoppingCondition\": {\"MaxRuntimeInSeconds\": 43200},\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Name and Launch the Hyperparameter Tuning Job\n", "\n", "Now you can provide a name for the hyperparameter tuning job and then launch it by calling the CreateHyperParameterTuningJob API. Pass tuning_job_config, and training_job_definition that you created in previous steps as the values of the parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# custom a tuner job name\n", "tuning_job_name = \"ChurnPredictTune-{}\".format(datetime.now().strftime(\"%Y%m%d-%H%M%S\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check if tuner job has been created\n", "list_tuning_job = smclient.list_hyper_parameter_tuning_jobs(NameContains=tuning_job_name)\n", "job_results = [[i for i in list_tuning_job[x]] for x in list_tuning_job.keys()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "smclient.list_hyper_parameter_tuning_jobs(NameContains=tuning_job_name)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.tuner import HyperparameterTuner\n", "\n", "# create the tuning job if it doesn't exist\n", "try:\n", " if tuning_job_name == job_results[0][0][\"HyperParameterTuningJobName\"]:\n", " print(f\"Tuning job exists\")\n", "except Exception as ex:\n", " # create hyperparameter tuning job\n", " smclient.create_hyper_parameter_tuning_job(\n", " HyperParameterTuningJobName=tuning_job_name,\n", " HyperParameterTuningJobConfig=tuning_job_config,\n", " TrainingJobDefinition=training_job_definition,\n", " )\n", " print(f\"Create tuning job {tuning_job_name}: SUCCESSFUL\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Monitor the Progress of a Hyperparameter Tuning Job\n", "\n", "To monitor the progress of a hyperparameter tuning job and the training jobs that it launches, you can use the Amazon SageMaker console.\n", "\n", "