{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Predict Hospital Spending Per Patient with SageMaker Autopilot\n", "In this lab we'll get started with SageMaker using Autopilot! In particular we will download the Medicare dataset, clean it, and plug it into a framework for SageMaker Autopilot.\n", "\n", "You'll see the notebooks generated for you, the hundreds of models trained, in addition to your very own inference pipeline, deployable to a SageMaker endpoint or batch transform job!\n", "\n", "At the end, we'll set up a SHAP explainer to analyze local feature importance for a set of predictions. Let's get started!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Download the Mediare dataset as csv file to the notebook\n", "!wget -O Medicare_Hospital_Spending_by_Claim.csv https://data.medicare.gov/api/views/nrth-mfg3/rows.csv?accessType=DOWNLOAD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Preprocessing on the Raw Dataset\n", "In this section we read the raw csv data set into a pandas data frame. We inspect the data using pandas head() function. We do data pre-processing using feature encoding, feature engineering, column renaming, dropping some columns that have no relevance to the prediction of `Avg_Hosp` cost and examining there are no missing values in the data set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read the CSV file into panda dataframe and save it to another table so we can keep a copy of the original dataset\n", "# In our example we use the dataframe called table1 for all pre-processing, while the dataframe table\n", "# maintains a copy of the original data\n", "\n", "import pandas as pd\n", "table = pd.read_csv('Medicare_Hospital_Spending_by_Claim.csv')\n", "table1 = table.copy()\n", "table1.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Encode column \"State\"\n", "\n", "replace_map = {'State': {'AK': 1, 'AL': 2, 'AR': 3, 'AZ': 4, 'CA': 5, 'CO': 6, 'CT': 7, \n", " 'DC': 8, 'DE': 9, 'FL': 10, 'GA': 11, 'HI': 12, \n", " 'IA': 13, 'ID': 14, 'IL': 15, 'IN': 16, 'KS': 17, \n", " 'KY': 18, 'LA': 19, 'MA': 20, 'ME': 21, 'MI': 22, \n", " 'MN': 23, 'MO': 24, 'MS': 25, 'MT': 26, 'NC': 27, \n", " 'ND': 28, 'NE': 29, 'NH': 30, 'NJ': 31, 'NM': 32, \n", " 'NV': 33, 'NY': 34, 'OH': 35, 'OK': 36, 'OR': 37, \n", " 'PA': 38, 'RI': 39, 'SC': 40, 'SD': 41, 'TN': 42, \n", " 'TX': 43, 'UT': 44, 'VA': 45, 'VT': 46, 'WA': 47, \n", " 'WI': 48, 'WV': 49, 'WY': 50}}\n", "table1.replace(replace_map,inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Encode column \"Period\"\n", "\n", "replace_map = {'Period': {'1 to 3 days Prior to Index Hospital Admission': 1, \n", " 'During Index Hospital Admission': 2, \n", " '1 through 30 days After Discharge from Index Hospital Admission': 3, \n", " 'Complete Episode': 4}}\n", "table1.replace(replace_map,inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Encode column \"Claim Type\"\n", "\n", "replace_map = {'Claim Type': {'Home Health Agency': 1, \n", " 'Hospice': 2, \n", " 'Inpatient': 3, \n", " 'Outpatient': 4, \n", " 'Skilled Nursing Facility': 5, \n", " 'Durable Medical Equipment': 6, \n", " 'Carrier': 7, \n", " 'Total': 8}}\n", "table1.replace(replace_map,inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert the column \"Percent of Spending Hospital\tPercent of Spending\" to float, remove the percent sign and \n", "# divide by 100 to normalize for percentage\n", "\n", "table1['Percent of Spending Hospital'] = table1['Percent of Spending Hospital'].str.rstrip('%').astype('float')\n", "table1['Percent of Spending Hospital'] = table1['Percent of Spending Hospital']/100" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert the column \"Percent of Spending State\" to float, remove the percent sign and \n", "# divide by 100 to normalize for percentage\n", "\n", "table1['Percent of Spending State'] = table1['Percent of Spending State'].str.rstrip('%').astype('float')\n", "table1['Percent of Spending State'] = table1['Percent of Spending State']/100" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert the column \"Percent of Spending Nation\" to float, remove the percent sign and \n", "# divide by 100 to normalize for percentage\n", "\n", "table1['Percent of Spending Nation'] = table1['Percent of Spending Nation'].str.rstrip('%').astype('float')\n", "table1['Percent of Spending Nation'] = table1['Percent of Spending Nation']/100" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Drop Column \"Facility Name\", Facility Id related to the facility, hence facility name is not\n", "# relevant for the model\n", "\n", "table1.drop(['Facility Name'], axis=1, inplace = True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Move the \"Avg Spending Per Episode Hospital\" column to the beginning, since the\n", "# algorithm requires the prediction column at the beginning\n", "\n", "col_name='Avg Spending Per Episode Hospital'\n", "first_col = table1.pop(col_name)\n", "table1.insert(0, col_name, first_col)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert integer values to float in the columns \"Avg Spending Per Episode Hospital\", \n", "# \"Avg Spending Per Episode State\" and \"Avg Spending Per Episode Nation\"\n", "# Columns with integer values are interpreted as categorical values. Changing to float avoids any mis-interpretetaion\n", "\n", "table1['Avg Spending Per Episode Hospital'] = table1['Avg Spending Per Episode Hospital'].astype('float')\n", "table1['Avg Spending Per Episode State'] = table1['Avg Spending Per Episode State'].astype('float')\n", "table1['Avg Spending Per Episode Nation'] = table1['Avg Spending Per Episode Nation'].astype('float')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Rename long column names for costs and percentage costs on the hospital, state and nation,\n", "# so they are easily referenced in the rest of this discussion\n", "\n", "table1.rename(columns={'Avg Spending Per Episode Hospital':'Avg_Hosp',\n", " 'Avg Spending Per Episode State':'Avg_State',\n", " 'Avg Spending Per Episode Nation':'Avg_Nation',\n", " 'Percent of Spending Hospital':'Percent_Hosp',\n", " 'Percent of Spending State':'Percent_State',\n", " 'Percent of Spending Nation':'Percent_Nation'}, \n", " inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert Start Date and End Date to datetime objects, then convert them to integers. First the data is converted\n", "# to Pandas datetime object. Then the year, month and days are extracted from the datetime object and \n", "# multipled with some weights to convert into final integer values.\n", "\n", "table1['Start Date'] = pd.to_datetime(table1['Start Date'])\n", "table1['End Date'] = pd.to_datetime(table1['End Date'])\n", "table1['Start Date'] = 1000*table1['Start Date'].dt.year + 100*table1['Start Date'].dt.month + table1['Start Date'].dt.day\n", "table1['End Date'] = 1000*table1['End Date'].dt.year + 100*table1['End Date'].dt.month + table1['End Date'].dt.day" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# See the first 5 rows in the dataframe to see how the changed data looks\n", "\n", "table1.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Drop Columns \"Start Date\" and \"End Date\". The dataset is only for 2018, hence all start and end dates\n", "# are same in each row and does not impact the model\n", "\n", "table1.drop(['Start Date'], axis=1, inplace = True)\n", "table1.drop(['End Date'], axis=1, inplace = True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make sure the table do not have missing values. The following code line shows there are no missing values\n", "# in the table\n", "\n", "table1.isna().sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = table1.sample(frac=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fraction_train = .85\n", "test_row = round(df.shape[0] * fraction_train)\n", "test_set = df.iloc[test_row:]\n", "train_set = df.iloc[:test_row]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_train_file = 'train_set.csv'\n", "\n", "train_set.to_csv(local_train_file, index=False, header=True)\n", "test_set.to_csv('test_set.csv', index=False, header=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# optionally run some of your own plots here to analyze the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# SageMaker Autopilot\n", "Next, let's run this dataset on SageMaker Autopilot! " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker import AutoML\n", "from time import gmtime, strftime, sleep\n", "import numpy as np\n", "import sagemaker\n", "\n", "sess = sagemaker.Session()\n", "\n", "role = sagemaker.get_execution_role()\n", "\n", "timestamp_suffix = strftime('%d-%H-%M-%S', gmtime())\n", "base_job_name = 'cost-prediction-' + timestamp_suffix\n", "\n", "target_attribute_name = 'Avg_Hosp'\n", "target_attribute_values = np.unique(train_set[target_attribute_name])\n", "target_attribute_true_value = target_attribute_values[1] # 'True.'\n", "\n", "automl = AutoML(role=role,\n", " target_attribute_name=target_attribute_name,\n", " base_job_name=base_job_name,\n", " sagemaker_session=sess,\n", " max_candidates=20,\n", " problem_type = 'Regression',\n", " job_objective = {'MetricName':'MSE'})\n", "\n", "automl.fit(local_train_file, job_name=base_job_name, wait=True, logs=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After you run this cell, open up the Experiments tab on SageMaker Studio, right click on your new `cost-prediction` job, and view the AutoML job details! " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](../../Images/Autopilot.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the state of the job has moved into `Feature Engineering`, you should be able to open the data exploration notebook, in addition to the candidate generation notebook. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Spend some time stepping through these notebooks. You can also download the data transformation code base. Remember, all of this was generated for your specific dataset!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Analyze Autopilot Modeling Performance\n", "Your AutoML job will take some time to complete. Feel free to use that time to step through the generated notebooks and learn about all the feature engineering strategies they are using! \n", "\n", "Once your job has finished, it's time to analyze that performance. Luckily for us we can simply deploy that entire artifact onto an endpoint, using the same `model.deploy()` that we saw earlier. Let's do that here.\n", "\n", "We'll attach the name of your job to an AutoML estimator, so please make sure to paste in the name of your job below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime\n", "from sagemaker import AutoML\n", "import sagemaker\n", "import numpy as np\n", "\n", "sess = sagemaker.Session()\n", "\n", "# if you needed to restart you kernel, you can attach your AutoML job here\n", "automl_job_name = 'COST-PREDICTION-28-02-12-32' #<== REPLACE THIS WITH YOUR OWN AUTOML JOB NAME\n", "automl = AutoML.attach(automl_job_name, sagemaker_session=sess)\n", "\n", "ep_name = 'automl-endpoint-' + datetime.now().strftime('%S')\n", "\n", "inference_response_keys = ['predicted_label', 'probability']\n", "\n", "# Create the inference endpoint\n", "automl.deploy(1, 'ml.m5.xlarge', endpoint_name = ep_name) #inference_response_keys=inference_response_keys)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install --upgrade sagemaker" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.predictor import RealTimePredictor\n", "class AutomlEstimator:\n", " def __init__(self, endpoint_name, sagemaker_session):\n", " self.predictor = RealTimePredictor(\n", " endpoint_name=endpoint_name,\n", " sagemaker_session=sagemaker_session,\n", " serializer=sagemaker.serializers.CSVSerializer(),\n", " content_type='text/csv',\n", " accept='text/csv'\n", " )\n", " # Prediction function for regression\n", " def predict(self, x):\n", " response = self.predictor.predict(x)\n", " return np.array([float(x) for x in response.decode('utf-8').split(',')])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# make sure this is pointing to the right endpoint name - if you reran that cell above you may have overwitten the variable in memory\n", "automl_estimator = AutomlEstimator(endpoint_name=ep_name, sagemaker_session=sess)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "test_data = pd.read_csv('test_set.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Explain Global and Local Modeling Performance with SHAP\n", "A key question that many stakeholders will have is how your model came to its predictions, both for the entire dataset and for individual predictions. In this lab we'll set up a SHAP model explainer to view feature importances. Feature importances can be understood both in terms of \"local,\" or per-prediction, and \"global,\" or for the entire datset.\n", "\n", "We will actually wrap your model endpoint to provide these." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!conda update -n base -c defaults conda -y" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!conda install -c conda-forge -y shap" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import shap\n", "\n", "from shap import KernelExplainer\n", "from shap import sample\n", "from scipy.special import expit\n", "\n", "# Initialize plugin to make plots interactive.\n", "shap.initjs()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_without_target = test_data.drop(columns=['Avg_Hosp'])\n", "\n", "background_data = sample(data_without_target, 50)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Derive link function \n", "problem_type = automl.describe_auto_ml_job(job_name=automl_job_name)['ResolvedAttributes']['ProblemType'] \n", "link = \"identity\" if problem_type == 'Regression' else \"logit\" \n", "\n", "# the handle to predict_proba is passed to KernelExplainerWrapper since KernelSHAP requires the class probability\n", "explainer = KernelExplainer(automl_estimator.predict, background_data, link=link)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Since expected_value is given in the log-odds space we convert it back to probability using expit which is the inverse function to logit\n", "print('expected value =', explainer.expected_value)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile managed_endpoint.py\n", "\n", "import boto3\n", "region = boto3.Session().region_name\n", "\n", "sm = boto3.Session().client(service_name='sagemaker',region_name=region)\n", "\n", "class ManagedEndpoint:\n", " def __init__(self, ep_name, auto_delete=False):\n", " self.name = ep_name\n", " self.auto_delete = auto_delete\n", " \n", " def __enter__(self):\n", " endpoint_description = sm.describe_endpoint(EndpointName=self.name)\n", " if endpoint_description['EndpointStatus'] == 'InService':\n", " self.in_service = True \n", "\n", " def __exit__(self, type, value, traceback):\n", " if self.in_service and self.auto_delete:\n", " print(\"Deleting the endpoint: {}\".format(self.name)) \n", " sm.delete_endpoint(EndpointName=self.name)\n", " sm.get_waiter('endpoint_deleted').wait(EndpointName=self.name)\n", " self.in_service = False" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the first sample\n", "x = data_without_target.iloc[0:1]\n", "\n", "# ManagedEndpoint can optionally auto delete the endpoint after calculating the SHAP values. To enable auto delete, use ManagedEndpoint(ep_name, auto_delete=True)\n", "from managed_endpoint import ManagedEndpoint\n", "with ManagedEndpoint(ep_name) as mep:\n", " shap_values = explainer.shap_values(x, nsamples='auto', l1_reg='aic')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualize SHAP Values\n", "Now, let's see which features are more strongly influencing the predictions from our model!\n", "\n", "![](images/shap_1.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Since shap_values are provided in the log-odds space, we convert them back to the probability space by using LogitLink\n", "shap.force_plot(explainer.expected_value, shap_values, x, link=link)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/shap_2.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with ManagedEndpoint(ep_name) as mep:\n", " shap_values = explainer.shap_values(x, nsamples='auto', l1_reg='num_features(5)')\n", "shap.force_plot(explainer.expected_value, shap_values, x, link=link)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Sample 50 random samples\n", "X = sample(data_without_target, 50)\n", "\n", "# Calculate SHAP values for these samples, and delete the endpoint\n", "with ManagedEndpoint(ep_name, auto_delete=True) as mep:\n", " shap_values = explainer.shap_values(X, nsamples='auto', l1_reg='aic')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/shap_3.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "shap.force_plot(explainer.expected_value, shap_values, X, link=link)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/shap_4.png)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "shap.summary_plot(shap_values, X, plot_type=\"bar\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Optional - Extend Autopilot with your own feature engineering code\n", "If you have extra time after getting to the local inference explanations, why not take a look at bringing your own feature engineering code into SageMaker Autopilot? Remember that this notebook started with ~10 basic ETL steps in Python to convert the raw Medicare data into something our models could even start to loook at. Look at the following example to see how to port your own ETL scripts into SageMaker Autopilot for custom feature engineering.\n", "\n", "Remember, once you get the entire pipeline deployed onto an endpoint, it means you can send the raw data up to the endpoint, and it will perform both feature engineering and model infereing for you, all in real time!\n", "\n", "- https://github.com/aws/amazon-sagemaker-examples/tree/master/autopilot/custom-feature-selection" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }