{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SageMaker Built-In Linear Learner Algorithm - Model for Predicting Medicare Average Hospital Spending:\n",
"In this notebook, we show how to build a Model to predict Medicare spending per beneficiary at a hospital based on state and national average costs using Amazon SageMaker. This notebook contains code instructions for all the steps for building, training and deploying a machine learning model using SageMaker built-in Linear Learner algorithm. In this example, the notebook is created on a \"ml.c4.xlarge\" instance. Use the 'conda_python3' environment for the notebook.\n",
"\n",
"## Learning Objectives:\n",
"This workshop covers both the preprocessing using SageMaker algorithms and other Python libraries. The major learnings are:\n",
"\n",
"1. Load data into SageMaker Notebooks\n",
"2. Perform basic preprocessing including: feature cleaning, normalization and basic feature engineering.\n",
"3. Perform basic feature selection/subsampling.\n",
"4. Perform exploratory data analysis.\n",
"5. Build, train, and deploy Linear Learner regression model.\n",
"\n",
"## Business Problem:\n",
"Medicare is a national health insurance program, administered by the Center for Medicare and Medicaid Services (CMS). This is a primary health insurance for Americans who are aged 65 and older. Medicare has published historical data showing hospital’s average spending for Medicare Part A and Part B claims based on different claim types and claim periods covering 1 to 3 days prior to hospital admission up to 30 days after discharge from hospital admission. These hospital spending are price standardized and non-risk adjusted, since risk adjustment is done at the episode level of the claims spanning the entire period during the episode. The hospital average costs are listed against the corresponding state level average cost and national level average cost.\n",
"\n",
"In this notebook, the data is used to build a machine learning model using Amazon SageMaker built-in Linear Learner algorithm, which predicts average hospital spending cost based on the average state level spending and average national level spending. The predicted cost can be used for purposes of budget and for negotiating pricing with the hospitals. From the hospital’s perspective, the predicted average hospital spending provides visibility to claim financials that can be used by the hospitals to increase their efficiency and level of care.\n",
"\n",
"## Public Dataset Used:\n",
"Medicare has published dataset showing average hospital spending on Medicare Part A and Part B claims. Both the links below refer to the same data set, one is listed in the healthdata.gov site and the other is listed at the data.medicare.gov site. The data dictionary is described in the link marked as #2 below. The dataset has hospital spending data from the year 2018 and has 67,826 data rows spanning across 13 columns. For the purposes of our analysis and machine learning, we use the dataset in csv (Comma Separated Values) format.\n",
"1.\thttps://healthdata.gov/dataset/medicare-hospital-spending-claim\n",
"2.\thttps://data.medicare.gov/Hospital-Compare/Medicare-Hospital-Spending-by-Claim/nrth-mfg3\n",
"\n",
"A direct link to download the dataset to local computer can be accessed at this link - https://data.medicare.gov/api/views/nrth-mfg3/rows.csv?accessType=DOWNLOAD"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import boto3\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import os\n",
"import pandas as pd\n",
"from pandas.plotting import scatter_matrix\n",
"import sagemaker\n",
"from sagemaker import get_execution_role\n",
"from sagemaker import image_uris\n",
"from sagemaker.deserializers import JSONDeserializer\n",
"from sagemaker.s3 import S3Uploader, S3Downloader\n",
"from sagemaker.serializers import CSVSerializer\n",
"import seaborn as sn\n",
"from sklearn.feature_selection import chi2\n",
"from sklearn.feature_selection import SelectKBest\n",
"from sklearn.model_selection import StratifiedShuffleSplit\n",
"from sklearn.preprocessing import scale"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"sagemaker_session = sagemaker.Session()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"S3Downloader.download(\n",
" s3_uri=\"s3://aws-hcls-ml/workshop/immersion_day_workshop_data_DO_NOT_DELETE/data/medicare_data_07_13_2021/Medicare_Hospital_Spending_by_Claim.csv\",\n",
" local_path=\"data\",\n",
" sagemaker_session=sagemaker_session,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Preprocessing on the Raw Dataset:\n",
"In this section we read the raw csv data set into a pandas data frame. We inspect the data using pandas head() function. We do data pre-processing using feature encoding, feature engineering, column renaming, dropping some columns that have no relevance to the prediction of `Avg_Hosp` cost and examining there are no missing values in the data set.\n",
"\n",
"**Note**: Many of these transformations and preprocessing steps are for demonstration purposes only and may not correspond to the optimal transformations for a specific column."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Facility ID
\n",
"
Facility Name
\n",
"
State
\n",
"
Period
\n",
"
Claim Type
\n",
"
Avg Spending Per Episode Hospital
\n",
"
Avg Spending Per Episode State
\n",
"
Avg Spending Per Episode Nation
\n",
"
Percent of Spending Hospital
\n",
"
Percent of Spending State
\n",
"
Percent of Spending Nation
\n",
"
Start Date
\n",
"
End Date
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
670102
\n",
"
AD HOSPITAL EAST LLC
\n",
"
TX
\n",
"
1 to 3 days Prior to Index Hospital Admission
\n",
"
Hospice
\n",
"
0
\n",
"
1
\n",
"
1
\n",
"
0.00%
\n",
"
0.00%
\n",
"
0.00%
\n",
"
01/01/2018
\n",
"
12/31/2018
\n",
"
\n",
"
\n",
"
1
\n",
"
670102
\n",
"
AD HOSPITAL EAST LLC
\n",
"
TX
\n",
"
1 to 3 days Prior to Index Hospital Admission
\n",
"
Inpatient
\n",
"
0
\n",
"
7
\n",
"
7
\n",
"
0.00%
\n",
"
0.03%
\n",
"
0.03%
\n",
"
01/01/2018
\n",
"
12/31/2018
\n",
"
\n",
"
\n",
"
2
\n",
"
670102
\n",
"
AD HOSPITAL EAST LLC
\n",
"
TX
\n",
"
1 to 3 days Prior to Index Hospital Admission
\n",
"
Outpatient
\n",
"
48
\n",
"
140
\n",
"
152
\n",
"
0.17%
\n",
"
0.60%
\n",
"
0.70%
\n",
"
01/01/2018
\n",
"
12/31/2018
\n",
"
\n",
"
\n",
"
3
\n",
"
670102
\n",
"
AD HOSPITAL EAST LLC
\n",
"
TX
\n",
"
1 to 3 days Prior to Index Hospital Admission
\n",
"
Skilled Nursing Facility
\n",
"
0
\n",
"
3
\n",
"
2
\n",
"
0.00%
\n",
"
0.01%
\n",
"
0.01%
\n",
"
01/01/2018
\n",
"
12/31/2018
\n",
"
\n",
"
\n",
"
4
\n",
"
670102
\n",
"
AD HOSPITAL EAST LLC
\n",
"
TX
\n",
"
1 to 3 days Prior to Index Hospital Admission
\n",
"
Durable Medical Equipment
\n",
"
0
\n",
"
8
\n",
"
8
\n",
"
0.00%
\n",
"
0.03%
\n",
"
0.04%
\n",
"
01/01/2018
\n",
"
12/31/2018
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Facility ID Facility Name State \\\n",
"0 670102 AD HOSPITAL EAST LLC TX \n",
"1 670102 AD HOSPITAL EAST LLC TX \n",
"2 670102 AD HOSPITAL EAST LLC TX \n",
"3 670102 AD HOSPITAL EAST LLC TX \n",
"4 670102 AD HOSPITAL EAST LLC TX \n",
"\n",
" Period Claim Type \\\n",
"0 1 to 3 days Prior to Index Hospital Admission Hospice \n",
"1 1 to 3 days Prior to Index Hospital Admission Inpatient \n",
"2 1 to 3 days Prior to Index Hospital Admission Outpatient \n",
"3 1 to 3 days Prior to Index Hospital Admission Skilled Nursing Facility \n",
"4 1 to 3 days Prior to Index Hospital Admission Durable Medical Equipment \n",
"\n",
" Avg Spending Per Episode Hospital Avg Spending Per Episode State \\\n",
"0 0 1 \n",
"1 0 7 \n",
"2 48 140 \n",
"3 0 3 \n",
"4 0 8 \n",
"\n",
" Avg Spending Per Episode Nation Percent of Spending Hospital \\\n",
"0 1 0.00% \n",
"1 7 0.00% \n",
"2 152 0.17% \n",
"3 2 0.00% \n",
"4 8 0.00% \n",
"\n",
" Percent of Spending State Percent of Spending Nation Start Date End Date \n",
"0 0.00% 0.00% 01/01/2018 12/31/2018 \n",
"1 0.03% 0.03% 01/01/2018 12/31/2018 \n",
"2 0.60% 0.70% 01/01/2018 12/31/2018 \n",
"3 0.01% 0.01% 01/01/2018 12/31/2018 \n",
"4 0.03% 0.04% 01/01/2018 12/31/2018 "
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Read the CSV file into panda dataframe and save it to another table so we can keep a copy of the original dataset\n",
"# In our example we use the dataframe called table1 for all pre-processing, while the dataframe table\n",
"# maintains a copy of the original data\n",
"\n",
"table = pd.read_csv(\"data/Medicare_Hospital_Spending_by_Claim.csv\")\n",
"table1 = table.copy()\n",
"table1.head()"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"# Encode column \"State\"\n",
"\n",
"replace_map = {\n",
" \"State\": {\n",
" \"AK\": 1,\n",
" \"AL\": 2,\n",
" \"AR\": 3,\n",
" \"AZ\": 4,\n",
" \"CA\": 5,\n",
" \"CO\": 6,\n",
" \"CT\": 7,\n",
" \"DC\": 8,\n",
" \"DE\": 9,\n",
" \"FL\": 10,\n",
" \"GA\": 11,\n",
" \"HI\": 12,\n",
" \"IA\": 13,\n",
" \"ID\": 14,\n",
" \"IL\": 15,\n",
" \"IN\": 16,\n",
" \"KS\": 17,\n",
" \"KY\": 18,\n",
" \"LA\": 19,\n",
" \"MA\": 20,\n",
" \"ME\": 21,\n",
" \"MI\": 22,\n",
" \"MN\": 23,\n",
" \"MO\": 24,\n",
" \"MS\": 25,\n",
" \"MT\": 26,\n",
" \"NC\": 27,\n",
" \"ND\": 28,\n",
" \"NE\": 29,\n",
" \"NH\": 30,\n",
" \"NJ\": 31,\n",
" \"NM\": 32,\n",
" \"NV\": 33,\n",
" \"NY\": 34,\n",
" \"OH\": 35,\n",
" \"OK\": 36,\n",
" \"OR\": 37,\n",
" \"PA\": 38,\n",
" \"RI\": 39,\n",
" \"SC\": 40,\n",
" \"SD\": 41,\n",
" \"TN\": 42,\n",
" \"TX\": 43,\n",
" \"UT\": 44,\n",
" \"VA\": 45,\n",
" \"VT\": 46,\n",
" \"WA\": 47,\n",
" \"WI\": 48,\n",
" \"WV\": 49,\n",
" \"WY\": 50,\n",
" }\n",
"}\n",
"table1.replace(replace_map, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"# Encode column \"Period\"\n",
"\n",
"replace_map = {\n",
" \"Period\": {\n",
" \"1 to 3 days Prior to Index Hospital Admission\": 1,\n",
" \"During Index Hospital Admission\": 2,\n",
" \"1 through 30 days After Discharge from Index Hospital Admission\": 3,\n",
" \"Complete Episode\": 4,\n",
" }\n",
"}\n",
"table1.replace(replace_map, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"# Encode column \"Claim Type\"\n",
"\n",
"replace_map = {\n",
" \"Claim Type\": {\n",
" \"Home Health Agency\": 1,\n",
" \"Hospice\": 2,\n",
" \"Inpatient\": 3,\n",
" \"Outpatient\": 4,\n",
" \"Skilled Nursing Facility\": 5,\n",
" \"Durable Medical Equipment\": 6,\n",
" \"Carrier\": 7,\n",
" \"Total\": 8,\n",
" }\n",
"}\n",
"table1.replace(replace_map, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"# Convert the column \"Percent of Spending Hospital\tPercent of Spending\" to float, remove the percent sign and\n",
"# divide by 100 to normalize for percentage\n",
"\n",
"table1[\"Percent of Spending Hospital\"] = (\n",
" table1[\"Percent of Spending Hospital\"].str.rstrip(\"%\").astype(\"float\")\n",
")\n",
"table1[\"Percent of Spending Hospital\"] = table1[\"Percent of Spending Hospital\"] / 100"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [],
"source": [
"# Convert the column \"Percent of Spending State\" to float, remove the percent sign and\n",
"# divide by 100 to normalize for percentage\n",
"\n",
"table1[\"Percent of Spending State\"] = (\n",
" table1[\"Percent of Spending State\"].str.rstrip(\"%\").astype(\"float\")\n",
")\n",
"table1[\"Percent of Spending State\"] = table1[\"Percent of Spending State\"] / 100"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"# Convert the column \"Percent of Spending Nation\" to float, remove the percent sign and\n",
"# divide by 100 to normalize for percentage\n",
"\n",
"table1[\"Percent of Spending Nation\"] = (\n",
" table1[\"Percent of Spending Nation\"].str.rstrip(\"%\").astype(\"float\")\n",
")\n",
"table1[\"Percent of Spending Nation\"] = table1[\"Percent of Spending Nation\"] / 100"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"# Drop Column \"Facility Name\", Facility Id related to the facility, hence facility name is not\n",
"# relevant for the model\n",
"\n",
"table1.drop([\"Facility Name\"], axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"# Move the \"Avg Spending Per Episode Hospital\" column to the beginning, since the\n",
"# algorithm requires the prediction column at the beginning\n",
"\n",
"col_name = \"Avg Spending Per Episode Hospital\"\n",
"first_col = table1.pop(col_name)\n",
"table1.insert(0, col_name, first_col)"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"# Convert integer values to float in the columns \"Avg Spending Per Episode Hospital\",\n",
"# \"Avg Spending Per Episode State\" and \"Avg Spending Per Episode Nation\"\n",
"# Columns with integer values are interpreted as categorical values. Changing to float avoids any mis-interpretetaion\n",
"\n",
"table1[\"Avg Spending Per Episode Hospital\"] = table1[\n",
" \"Avg Spending Per Episode Hospital\"\n",
"].astype(\"float\")\n",
"table1[\"Avg Spending Per Episode State\"] = table1[\n",
" \"Avg Spending Per Episode State\"\n",
"].astype(\"float\")\n",
"table1[\"Avg Spending Per Episode Nation\"] = table1[\n",
" \"Avg Spending Per Episode Nation\"\n",
"].astype(\"float\")"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"# Rename long column names for costs and percentage costs on the hospital, state and nation,\n",
"# so they are easily referenced in the rest of this discussion\n",
"\n",
"table1.rename(\n",
" columns={\n",
" \"Avg Spending Per Episode Hospital\": \"Avg_Hosp\",\n",
" \"Avg Spending Per Episode State\": \"Avg_State\",\n",
" \"Avg Spending Per Episode Nation\": \"Avg_Nation\",\n",
" \"Percent of Spending Hospital\": \"Percent_Hosp\",\n",
" \"Percent of Spending State\": \"Percent_State\",\n",
" \"Percent of Spending Nation\": \"Percent_Nation\",\n",
" },\n",
" inplace=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
"# Convert Start Date and End Date to datetime objects, then convert them to integers. First the data is converted\n",
"# to Pandas datetime object. Then the year, month and days are extracted from the datetime object and\n",
"# multipled with some weights to convert into final integer values.\n",
"\n",
"table1[\"Start Date\"] = pd.to_datetime(table1[\"Start Date\"])\n",
"table1[\"End Date\"] = pd.to_datetime(table1[\"End Date\"])\n",
"table1[\"Start Date\"] = (\n",
" 1000 * table1[\"Start Date\"].dt.year\n",
" + 100 * table1[\"Start Date\"].dt.month\n",
" + table1[\"Start Date\"].dt.day\n",
")\n",
"table1[\"End Date\"] = (\n",
" 1000 * table1[\"End Date\"].dt.year\n",
" + 100 * table1[\"End Date\"].dt.month\n",
" + table1[\"End Date\"].dt.day\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Avg_Hosp
\n",
"
Facility ID
\n",
"
State
\n",
"
Period
\n",
"
Claim Type
\n",
"
Avg_State
\n",
"
Avg_Nation
\n",
"
Percent_Hosp
\n",
"
Percent_State
\n",
"
Percent_Nation
\n",
"
Start Date
\n",
"
End Date
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0.0
\n",
"
670102
\n",
"
43
\n",
"
1
\n",
"
2
\n",
"
1.0
\n",
"
1.0
\n",
"
0.0000
\n",
"
0.0000
\n",
"
0.0000
\n",
"
2018101
\n",
"
2019231
\n",
"
\n",
"
\n",
"
1
\n",
"
0.0
\n",
"
670102
\n",
"
43
\n",
"
1
\n",
"
3
\n",
"
7.0
\n",
"
7.0
\n",
"
0.0000
\n",
"
0.0003
\n",
"
0.0003
\n",
"
2018101
\n",
"
2019231
\n",
"
\n",
"
\n",
"
2
\n",
"
48.0
\n",
"
670102
\n",
"
43
\n",
"
1
\n",
"
4
\n",
"
140.0
\n",
"
152.0
\n",
"
0.0017
\n",
"
0.0060
\n",
"
0.0070
\n",
"
2018101
\n",
"
2019231
\n",
"
\n",
"
\n",
"
3
\n",
"
0.0
\n",
"
670102
\n",
"
43
\n",
"
1
\n",
"
5
\n",
"
3.0
\n",
"
2.0
\n",
"
0.0000
\n",
"
0.0001
\n",
"
0.0001
\n",
"
2018101
\n",
"
2019231
\n",
"
\n",
"
\n",
"
4
\n",
"
0.0
\n",
"
670102
\n",
"
43
\n",
"
1
\n",
"
6
\n",
"
8.0
\n",
"
8.0
\n",
"
0.0000
\n",
"
0.0003
\n",
"
0.0004
\n",
"
2018101
\n",
"
2019231
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Avg_Hosp Facility ID State Period Claim Type Avg_State Avg_Nation \\\n",
"0 0.0 670102 43 1 2 1.0 1.0 \n",
"1 0.0 670102 43 1 3 7.0 7.0 \n",
"2 48.0 670102 43 1 4 140.0 152.0 \n",
"3 0.0 670102 43 1 5 3.0 2.0 \n",
"4 0.0 670102 43 1 6 8.0 8.0 \n",
"\n",
" Percent_Hosp Percent_State Percent_Nation Start Date End Date \n",
"0 0.0000 0.0000 0.0000 2018101 2019231 \n",
"1 0.0000 0.0003 0.0003 2018101 2019231 \n",
"2 0.0017 0.0060 0.0070 2018101 2019231 \n",
"3 0.0000 0.0001 0.0001 2018101 2019231 \n",
"4 0.0000 0.0003 0.0004 2018101 2019231 "
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# See the first 5 rows in the dataframe to see how the changed data looks\n",
"\n",
"table1.head()"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"# Drop Columns \"Start Date\" and \"End Date\". The dataset is only for 2018, hence all start and end dates\n",
"# are same in each row and does not impact the model\n",
"\n",
"table1.drop([\"Start Date\"], axis=1, inplace=True)\n",
"table1.drop([\"End Date\"], axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Avg_Hosp 0\n",
"Facility ID 0\n",
"State 0\n",
"Period 0\n",
"Claim Type 0\n",
"Avg_State 0\n",
"Avg_Nation 0\n",
"Percent_Hosp 0\n",
"Percent_State 0\n",
"Percent_Nation 0\n",
"dtype: int64"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Make sure the table do not have missing values. The following code line shows there are no missing values\n",
"# in the table\n",
"\n",
"table1.isna().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploratory Data Analysis (EDA):\n",
"In this section, we perform **Exploratory Data Analysis** of the data set and use various techniques for feature selection\n",
"\n",
"First, we see the scatter_matrix plot of the feature variables in the data frame as they relate to the prediction variable `Avg_Hosp` cost. For this we use the scatter_matrix function from pandas.plotting library.\n",
"\n",
"The entire dataset has 67826 data rows. For analysis, we take a random sample of 400 data rows for the scatter_matrix. Before selecting the 400 random data rows, we use the scale function from sklearn.preprocessing library to appropriately scale the values of the data columns. This helps the scatter_matrix plot label decorations fit properly. This plots helps in determining if we should keep all the feature columns while training the model.\n",
"\n",
"Next we use the SelectKBest class and chi2 statistical test available from sklearn.feature_selection library to find the scores of feature columns as they relate to the prediction column. This is another mechanism to determine which feature columns are relevant to keep in the model\n",
"\n",
"Following that, we create and visualize the correlation matrix. This is another mechanism towards feature selection prior to training the model."
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [],
"source": [
"# After selecting the random sample of 400 data rows for the scatter_matrix analysis, this step\n",
"# uses the scale function from sklearn.preprocessing library to scale the values. A new pandas data frame is created\n",
"# that holds the sampled 400 data rows. We want to keep the original data set intact so we can use the original\n",
"# data set for the subsequent training of the model\n",
"\n",
"table1_sample = table1.sample(n=400, random_state=2)\n",
"standardised_table1_sample = scale(table1_sample)\n",
"\n",
"standardised_table1_sample = pd.DataFrame(\n",
" standardised_table1_sample, index=table1_sample.index, columns=table1_sample.columns\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"# The scatter_matrix is plotted using a tight layout for ease of visualization within the notebook. Hence,\n",
"# in the scaled randomized sample of 400 data rows, the column names are renamed to shorter column names.\n",
"\n",
"standardised_table1_sample.rename(columns={\"Avg_Hosp\": \"A_Ho\"}, inplace=True)\n",
"standardised_table1_sample.rename(columns={\"Facility ID\": \"F_Id\"}, inplace=True)\n",
"standardised_table1_sample.rename(columns={\"State\": \"ST\"}, inplace=True)\n",
"standardised_table1_sample.rename(columns={\"Period\": \"Per\"}, inplace=True)\n",
"standardised_table1_sample.rename(columns={\"Claim Type\": \"Clm\"}, inplace=True)\n",
"standardised_table1_sample.rename(columns={\"Avg_State\": \"A_ST\"}, inplace=True)\n",
"standardised_table1_sample.rename(columns={\"Avg_Nation\": \"A_Na\"}, inplace=True)\n",
"standardised_table1_sample.rename(columns={\"Percent_Hosp\": \"P_Ho\"}, inplace=True)\n",
"standardised_table1_sample.rename(columns={\"Percent_State\": \"P_ST\"}, inplace=True)\n",
"standardised_table1_sample.rename(columns={\"Percent_Nation\": \"P_Na\"}, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# In this step, the scatter matrix is plotted between the prediction column \"Avg_Hosp\" whose shortened name is \"A_Ho\"\n",
"# and each of the other remaining feature columns. For clarity of visualization, we create two scatter_matrix\n",
"# plots. The first one showing Avg_Hosp with the columns \"Facility Id\", \"State\", \"Period\" and \"Claim Type\".\n",
"# The second plot shows Avg_Hosp relation with the columns \"Avg_State\", \"Avg_Nation\", \"Percent_Hosp\",\n",
"# \"Percent_State\" and \"Percent_Nation\"\n",
"\n",
"%matplotlib inline\n",
"\n",
"plt.figure(figsize=(11, 8))\n",
"pd.plotting.scatter_matrix(\n",
" standardised_table1_sample.loc[:, \"A_Ho\":\"Clm\"], diagonal=\"kde\"\n",
")\n",
"plt.tight_layout()\n",
"plt.show()\n",
"\n",
"pd.plotting.scatter_matrix(\n",
" standardised_table1_sample.loc[:, [\"A_Ho\", \"A_ST\", \"A_Na\", \"P_Ho\", \"P_ST\", \"P_Na\"]],\n",
" diagonal=\"kde\",\n",
")\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Features Score\n",
"0 Facility ID 9.760495e+08\n",
"4 Avg_State 7.996024e+08\n",
"5 Avg_Nation 7.950498e+08\n",
"1 State 8.320571e+04\n",
"6 Percent_Hosp 3.732397e+04\n",
"7 Percent_State 3.685061e+04\n",
"8 Percent_Nation 3.673161e+04\n",
"3 Claim Type 2.672140e+04\n",
"2 Period 1.624859e+04\n"
]
}
],
"source": [
"# In this step, we calculate the statistical scores of the feature columns as it relates to the prediction column\n",
"# Avg_Hosp using the SelectKBest library function based on the chi2 statistical test. The scores are displayed in a\n",
"# tabluar format for visualization. The X data frame has all the feature columns. The y data frame has the\n",
"# prediction column.\n",
"\n",
"X = table1.iloc[:, 1:10]\n",
"y = table1.iloc[:, 0:1]\n",
"\n",
"# We are selecting all the feature columns to see the scores for each feature column\n",
"selected = SelectKBest(score_func=chi2, k=9)\n",
"fit = selected.fit(X, y)\n",
"datascores = pd.DataFrame(fit.scores_)\n",
"datacolumns = pd.DataFrame(X.columns)\n",
"\n",
"# concat two dataframes for better visualization\n",
"featureScores = pd.concat([datacolumns, datascores], axis=1)\n",
"featureScores.columns = [\"Features\", \"Score\"] # naming the dataframe columns\n",
"print(featureScores.nlargest(9, \"Score\")) # print 9 best features"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Avg_Hosp Facility ID State Period \\\n",
"Avg_Hosp 1.000000 -8.863661e-06 -2.051140e-04 4.912669e-01 \n",
"Facility ID -0.000009 1.000000e+00 9.821420e-01 -7.250570e-16 \n",
"State -0.000205 9.821420e-01 1.000000e+00 -7.941509e-16 \n",
"Period 0.491267 -7.250570e-16 -7.941509e-16 1.000000e+00 \n",
"Claim Type 0.329614 1.675355e-14 1.650797e-14 1.815683e-01 \n",
"Avg_State 0.978819 8.851031e-04 6.733522e-04 4.994380e-01 \n",
"Avg_Nation 0.976734 -1.315365e-15 -1.331209e-15 5.009157e-01 \n",
"Percent_Hosp 0.982630 1.653416e-07 1.680870e-07 5.003226e-01 \n",
"Percent_State 0.977278 1.628740e-06 1.705365e-06 4.995657e-01 \n",
"Percent_Nation 0.976734 1.502742e-15 1.561542e-15 5.009206e-01 \n",
"\n",
" Claim Type Avg_State Avg_Nation Percent_Hosp \\\n",
"Avg_Hosp 3.296143e-01 0.978819 9.767342e-01 9.826304e-01 \n",
"Facility ID 1.675355e-14 0.000885 -1.315365e-15 1.653416e-07 \n",
"State 1.650797e-14 0.000673 -1.331209e-15 1.680870e-07 \n",
"Period 1.815683e-01 0.499438 5.009157e-01 5.003226e-01 \n",
"Claim Type 1.000000e+00 0.337399 3.388598e-01 3.351493e-01 \n",
"Avg_State 3.373986e-01 1.000000 9.984595e-01 9.939270e-01 \n",
"Avg_Nation 3.388598e-01 0.998460 1.000000e+00 9.939876e-01 \n",
"Percent_Hosp 3.351493e-01 0.993927 9.939876e-01 1.000000e+00 \n",
"Percent_State 3.375192e-01 0.999169 9.991468e-01 9.946930e-01 \n",
"Percent_Nation 3.388660e-01 0.998460 1.000000e+00 9.939871e-01 \n",
"\n",
" Percent_State Percent_Nation \n",
"Avg_Hosp 0.977278 9.767338e-01 \n",
"Facility ID 0.000002 1.502742e-15 \n",
"State 0.000002 1.561542e-15 \n",
"Period 0.499566 5.009206e-01 \n",
"Claim Type 0.337519 3.388660e-01 \n",
"Avg_State 0.999169 9.984596e-01 \n",
"Avg_Nation 0.999147 1.000000e+00 \n",
"Percent_Hosp 0.994693 9.939871e-01 \n",
"Percent_State 1.000000 9.991469e-01 \n",
"Percent_Nation 0.999147 1.000000e+00 \n"
]
}
],
"source": [
"# Calculate Correlation Matrix to see how the data is related\n",
"\n",
"corrMatrix = table1.corr()\n",
"print(corrMatrix)"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Visualize Correlation matrix with Searborn and Matplotlib\n",
"plt.subplots(figsize=(15, 10))\n",
"plt.tick_params(labelsize=14)\n",
"sn.heatmap(corrMatrix, annot=True, annot_kws={\"size\": 12}, fmt=\".2f\", robust=True)\n",
"plt.xlabel(\"Parameters\", fontsize=25)\n",
"plt.ylabel(\"Parameters\", fontsize=25)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Avg_Hosp 1.000000\n",
"Percent_Hosp 0.982630\n",
"Avg_State 0.978819\n",
"Percent_State 0.977278\n",
"Avg_Nation 0.976734\n",
"Percent_Nation 0.976734\n",
"Period 0.491267\n",
"Claim Type 0.329614\n",
"Facility ID -0.000009\n",
"State -0.000205\n",
"Name: Avg_Hosp, dtype: float64"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Our target prediction is the column Avg_Hosp, we want to see\n",
"# how this value is correlated with the other feature columns\n",
"# From the below matrix, we see that the prediction column Avg_Hosp has the highest correlation\n",
"# with the Avg_State feature column\n",
"\n",
"corrMatrix[\"Avg_Hosp\"].sort_values(ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Creating Train, Validation and Test Datasets:\n",
"In the last step of the previous section, we see that the prediction column `Avg_Hosp` cost has second highest correlation with the `Avg_State` feature column. In this section we examine the spread of values for the `Avg_State`. Then we use a technique called stratification to categorize each data row based on the category of the `Avg_State` cost. We did not use the column `Percent_Hosp` (highest correlation) for stratification because the `Percent_Hosp` column is derived from the `Avg_Hosp` value and the line which contains the Total Cost per Episode for a specific hospital. There is an obvious expected high degree of correlation between `Avg_Hosp` and `Percent_Hosp`. Using `Percent_Hosp` to stratify values will not give us a good representative sample for train, validation and test datasets.\n",
"\n",
"From the defined categories of the `Avg_State` values, we use the StratifiedShuffleSplit function from the Scikit-Learn library to split and randomly select data subsets for the training, validation and test data sets. We use this method two times. First we split the original data into train and test. Then we split the test set again into validation and test set. This stratification technique allows us to have good representation of data in each of the train, validation and test sets that are well spread across the `Avg_State` value categories. Then we move the prediction column `Avg_Hosp` to the first column in each of the train, validation and test data sets. The data sets are uploaded into S3 bucket location during the machine learning model creation process."
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/ec2-user/anaconda3/envs/python3/lib/python3.8/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).\n",
" warnings.warn(msg, FutureWarning)\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# We see the distribution of the data around the feature column \"Avg_State\" that has the highest\n",
"# correlation to the target prediction \"Avg_Hosp\"\n",
"# From the histogram below we see that most of the values are between 0 and 5000 for Avg_State\n",
"\n",
"sn.distplot(table1.Avg_State)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"# Before we divide this dataset into train, validation and test, we need to stratify the values of Avg_State\n",
"# to ensure we effectively select random sets of data into the train, validation and test data sets\n",
"# which is good representative sample based on the Avg_State values\n",
"# To do this, we introduce a new Feature column column called \"Avg_State_Category\", we make this column\n",
"# equal to the value of the respective Avg_State value divided by 1000 and then using the ceil (ceiling) function\n",
"\n",
"table1[\"Avg_State_Category\"] = np.ceil(table1[\"Avg_State\"] / 1000)"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0 14226\n",
"1.0 35069\n",
"2.0 6325\n",
"3.0 3561\n",
"4.0 2007\n",
"5.0 472\n",
"10.0 1060\n",
"11.0 1749\n",
"12.0 274\n",
"20.0 9\n",
"21.0 750\n",
"22.0 1482\n",
"23.0 544\n",
"24.0 298\n",
"Name: Avg_State_Category, dtype: int64"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now see how these categories are distributed by aggregating across all the data rows\n",
"\n",
"table1.Avg_State_Category.value_counts().sort_index()"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [],
"source": [
"# From the above distribution, we see approximately 61k out of the total 67k values have category 0.0 to 5.0\n",
"# Only about 6k values of the total (approx 9%) of the values have category greater than 5.0\n",
"# Hence we update the category to 6.0 for all data rows that have category greater than 5.0\n",
"\n",
"table1[\"Avg_State_Category\"].where(table1[\"Avg_State_Category\"] < 5, 6.0, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0 14226\n",
"1.0 35069\n",
"2.0 6325\n",
"3.0 3561\n",
"4.0 2007\n",
"6.0 6638\n",
"Name: Avg_State_Category, dtype: int64"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We see the distribution of the values for the Avg_State_Category across the category values 0.0 to 6.0\n",
"\n",
"table1.Avg_State_Category.value_counts().sort_index()"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [],
"source": [
"# We use the stratified shuffle split function available within sklearn library to create\n",
"# train, validation and test datasets based on our defined perentages and splitting the data appropriately\n",
"# and randomly across all the established categories with respect to the Avg_State values\n",
"\n",
"# The function splits the data into two parts. First we split into training set which will be 80% of the data\n",
"# and a test set which will be 20% of the data. Then we split this test set into validation set\n",
"# where the validation set will be 90% of the previous test set and the final test set is\n",
"# the remaining 10% of the previous test set. This is a two fold splitting\n",
"\n",
"# First lets create the training set and the temporray set\n",
"split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)\n",
"\n",
"for train_index, test_index in split.split(table1, table1[\"Avg_State_Category\"]):\n",
" strat_train_set = table1.loc[train_index]\n",
" strat_test_set = table1.loc[test_index]"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [],
"source": [
"# Here we take the test set and split it into the validation set and the test set\n",
"# as mentioned in the prior step\n",
"split = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=42)\n",
"\n",
"for train_index, test_index in split.split(\n",
" strat_test_set, strat_test_set[\"Avg_State_Category\"]\n",
"):\n",
" strat_validation_set = table1.loc[train_index]\n",
" strat_test_set = table1.loc[test_index]"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(67826, 54260, 12209, 1357)"
]
},
"execution_count": 83,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Print total rows of data from the original dataset, training set, validation set and test set\n",
"# To see the numbers match up, add the training, validation and test data set record counts to get the total in the\n",
"# original data set\n",
"\n",
"Total = table1.shape[0]\n",
"Train = strat_train_set.shape[0]\n",
"Validation = strat_validation_set.shape[0]\n",
"Test = strat_test_set.shape[0]\n",
"Total, Train, Validation, Test"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [],
"source": [
"# Drop the column \"Avg_State_category\" from train, validation and test datasets. This column\n",
"# was introduced to do stratification of the data for meaningful sampling and is not relevant\n",
"# for the model anymore\n",
"\n",
"strat_train_set.drop([\"Avg_State_Category\"], axis=1, inplace=True)\n",
"strat_validation_set.drop([\"Avg_State_Category\"], axis=1, inplace=True)\n",
"strat_test_set.drop([\"Avg_State_Category\"], axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [],
"source": [
"# Replace the bucket name with the your bucket name obtained form the CloudFormation output tab\n",
"# From the SageMaker library, we use the get_execution_role function to get the execution\n",
"# role for SageMaker to access AWS reqources while creating the machine learning model\n",
"\n",
"# Define IAM role\n",
"role = get_execution_role()\n",
"bucket = sagemaker_session.default_bucket()\n",
"prefix = \"linear_learner\""
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [],
"source": [
"# Split each of the Train, Validation and Test datasets from pandas data frame into two subsets.\n",
"# The x subset is the feature columns and the y subset is the label column\n",
"\n",
"x_train = strat_train_set.iloc[:, 1:10]\n",
"y_train = strat_train_set.iloc[:, 0:1]\n",
"\n",
"x_validation = strat_validation_set.iloc[:, 1:10]\n",
"y_validation = strat_validation_set.iloc[:, 0:1]\n",
"\n",
"x_test = strat_test_set.iloc[:, 1:10]\n",
"y_test = strat_test_set.iloc[:, 0:1]\n",
"\n",
"train_df = pd.concat([y_train, x_train], axis=1)\n",
"validation_df = pd.concat([y_validation, x_validation], axis=1)\n",
"test_df = pd.concat([y_test, x_test], axis=1)\n",
"\n",
"# copy the training dataframe to s3\n",
"train_df.to_csv(\"data/train_data.csv\", index=False, header=False)\n",
"validation_df.to_csv(\"data/validation_data.csv\", index=False, header=False)\n",
"test_df.to_csv(\"data/test_data.csv\", index=False, header=False)\n",
"\n",
"train_data_location = f\"s3://{bucket}/{prefix}/data/train\"\n",
"validation_data_location = f\"s3://{bucket}/{prefix}/data/validation\"\n",
"test_data_location = f\"s3://{bucket}/{prefix}/data/test\"\n",
"\n",
"# Upload the training data to S3\n",
"S3Uploader.upload(\n",
" local_path=\"data/train_data.csv\",\n",
" desired_s3_uri=train_data_location,\n",
" sagemaker_session=sagemaker_session,\n",
")\n",
"\n",
"# Upload the validation data to S3\n",
"S3Uploader.upload(\n",
" local_path=\"data/validation_data.csv\",\n",
" desired_s3_uri=validation_data_location,\n",
" sagemaker_session=sagemaker_session,\n",
")\n",
"\n",
"# Upload the testing data to S3\n",
"S3Uploader.upload(\n",
" local_path=\"data/test_data.csv\",\n",
" desired_s3_uri=test_data_location,\n",
" sagemaker_session=sagemaker_session,\n",
")\n",
"\n",
"\n",
"output_location = f\"s3://{bucket}/{prefix}/output\"\n",
"\n",
"train_data_location_input = sagemaker.inputs.TrainingInput(\n",
" train_data_location,\n",
" distribution=\"FullyReplicated\",\n",
" content_type=\"text/csv\",\n",
" s3_data_type=\"S3Prefix\",\n",
" record_wrapping=None,\n",
" compression=None,\n",
")\n",
"\n",
"validation_data_location_input = sagemaker.inputs.TrainingInput(\n",
" validation_data_location,\n",
" distribution=\"FullyReplicated\",\n",
" content_type=\"text/csv\",\n",
" s3_data_type=\"S3Prefix\",\n",
" record_wrapping=None,\n",
" compression=None,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train Machine Learning Model using Amazon SageMaker:\n",
"In this section, the SageMaker built-in Linear Learner algorithm is used to train the model using the training and validation data sets as input channels. The algorithm is used in the “Regressor” mode to train the model. The boto3 Python library for AWS and the SageMaker library for Python is used. In the step below, replace with your own **bucket name** from CloudFormation Outputs tab"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [],
"source": [
"# In this step, we initialize the Linear Learner Estimator\n",
"# We define the parameters for the estimator which will be used by SageMaker to train the model.\n",
"# In this example, we use an instance of type \"ml.c4.xlarge\". Note that the SageMaker built-in Linear Learner\n",
"# algorithm do not need GPU type instances mandatorily. GPU instances can be selected if the dataset is large\n",
"# and using GPU instances will help boost the performance of the model creation process.\n",
"\n",
"# from sagemaker import LinearLearner\n",
"# from sagemaker.sklearn.estimator import LinearLearner\n",
"\n",
"\n",
"container = image_uris.retrieve(\n",
" region=boto3.Session().region_name, framework=\"linear-learner\"\n",
")\n",
"\n",
"sess = sagemaker.Session()\n",
"\n",
"linear = sagemaker.estimator.Estimator(\n",
" container,\n",
" role,\n",
" instance_count=1,\n",
" instance_type=\"ml.c4.xlarge\",\n",
" output_path=output_location,\n",
" sagemaker_session=sagemaker_session,\n",
" disable_profiler=True,\n",
")\n",
"\n",
"linear.set_hyperparameters(predictor_type=\"regressor\", mini_batch_size=200)"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',\n",
" 'S3Uri': 's3://sagemaker-ap-southeast-2-431579215499/linear_learner/data/validation',\n",
" 'S3DataDistributionType': 'FullyReplicated'}},\n",
" 'ContentType': 'text/csv'}"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"validation_data_location_input.config"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',\n",
" 'S3Uri': 's3://sagemaker-ap-southeast-2-431579215499/linear_learner/data/train',\n",
" 'S3DataDistributionType': 'FullyReplicated'}},\n",
" 'ContentType': 'text/csv'}"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data_location_input.config"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"2022-06-06 01:44:54 Starting - Starting the training job.....\n",
"2022-06-06 01:45:20 Starting - Preparing the instances for training.............\n",
"2022-06-06 01:46:33 Downloading - Downloading input data..\n",
"2022-06-06 01:46:49 Training - Downloading the training image.........\n",
"2022-06-06 01:47:39 Training - Training image download completed. Training in progress..............................\n",
"2022-06-06 01:50:10 Uploading - Uploading generated training model.\n",
"2022-06-06 01:50:16 Completed - Training job completed\n"
]
}
],
"source": [
"# In this step, we call the fit function to train the model using the training dataset and the validation dataset\n",
"# llearner.fit([train_data_location,validation_data_location])\n",
"linear.fit(\n",
" inputs={\n",
" \"train\": train_data_location_input,\n",
" \"validation\": validation_data_location_input,\n",
" },\n",
" logs=\"None\",\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-------!"
]
}
],
"source": [
"# In this step, we deploy the model created in the previous step as an endpoint. In this example,\n",
"# we use an instance type of \"ml.m4.xlarge\" to deploy the model. Once deployed, the endpoint\n",
"# can be invoked to make inference and predict the value of the \"Avg_Hosp\" cost. Please note deploying the\n",
"# model to an endpoint takes a few minutes\n",
"\n",
"llearner_predictor = linear.deploy(\n",
" initial_instance_count=1, instance_type=\"ml.m4.xlarge\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([5.0107e+04, 5.0000e+00, 2.0000e+00, 4.0000e+00, 0.0000e+00,\n",
" 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00])"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x_test.values[0]"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'linear-learner-2022-06-06-00-56-35-125'"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llearner_predictor.endpoint_name"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Facility ID
\n",
"
State
\n",
"
Period
\n",
"
Claim Type
\n",
"
Avg_State
\n",
"
Avg_Nation
\n",
"
Percent_Hosp
\n",
"
Percent_State
\n",
"
Percent_Nation
\n",
"
\n",
" \n",
" \n",
"
\n",
"
5380
\n",
"
50107
\n",
"
5
\n",
"
2
\n",
"
4
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0000
\n",
"
0.0000
\n",
"
0.0000
\n",
"
\n",
"
\n",
"
2823
\n",
"
30103
\n",
"
4
\n",
"
1
\n",
"
6
\n",
"
8.0
\n",
"
8.0
\n",
"
0.0004
\n",
"
0.0003
\n",
"
0.0004
\n",
"
\n",
"
\n",
"
12310
\n",
"
90001
\n",
"
8
\n",
"
2
\n",
"
2
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0000
\n",
"
0.0000
\n",
"
0.0000
\n",
"
\n",
"
\n",
"
9535
\n",
"
50701
\n",
"
5
\n",
"
1
\n",
"
7
\n",
"
585.0
\n",
"
585.0
\n",
"
0.0312
\n",
"
0.0262
\n",
"
0.0270
\n",
"
\n",
"
\n",
"
10433
\n",
"
50779
\n",
"
5
\n",
"
1
\n",
"
2
\n",
"
0.0
\n",
"
1.0
\n",
"
0.0000
\n",
"
0.0000
\n",
"
0.0000
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
1923
\n",
"
20018
\n",
"
1
\n",
"
2
\n",
"
1
\n",
"
0.0
\n",
"
0.0
\n",
"
0.0000
\n",
"
0.0000
\n",
"
0.0000
\n",
"
\n",
"
\n",
"
12701
\n",
"
100025
\n",
"
10
\n",
"
1
\n",
"
4
\n",
"
72.0
\n",
"
152.0
\n",
"
0.0067
\n",
"
0.0033
\n",
"
0.0070
\n",
"
\n",
"
\n",
"
1579
\n",
"
10139
\n",
"
2
\n",
"
3
\n",
"
2
\n",
"
198.0
\n",
"
149.0
\n",
"
0.0079
\n",
"
0.0095
\n",
"
0.0069
\n",
"
\n",
"
\n",
"
5863
\n",
"
50138
\n",
"
5
\n",
"
2
\n",
"
3
\n",
"
10415.0
\n",
"
10139.0
\n",
"
0.5356
\n",
"
0.4669
\n",
"
0.4684
\n",
"
\n",
"
\n",
"
7136
\n",
"
50296
\n",
"
5
\n",
"
1
\n",
"
6
\n",
"
8.0
\n",
"
8.0
\n",
"
0.0004
\n",
"
0.0003
\n",
"
0.0004
\n",
"
\n",
" \n",
"
\n",
"
1357 rows × 9 columns
\n",
"
"
],
"text/plain": [
" Facility ID State Period Claim Type Avg_State Avg_Nation \\\n",
"5380 50107 5 2 4 0.0 0.0 \n",
"2823 30103 4 1 6 8.0 8.0 \n",
"12310 90001 8 2 2 0.0 0.0 \n",
"9535 50701 5 1 7 585.0 585.0 \n",
"10433 50779 5 1 2 0.0 1.0 \n",
"... ... ... ... ... ... ... \n",
"1923 20018 1 2 1 0.0 0.0 \n",
"12701 100025 10 1 4 72.0 152.0 \n",
"1579 10139 2 3 2 198.0 149.0 \n",
"5863 50138 5 2 3 10415.0 10139.0 \n",
"7136 50296 5 1 6 8.0 8.0 \n",
"\n",
" Percent_Hosp Percent_State Percent_Nation \n",
"5380 0.0000 0.0000 0.0000 \n",
"2823 0.0004 0.0003 0.0004 \n",
"12310 0.0000 0.0000 0.0000 \n",
"9535 0.0312 0.0262 0.0270 \n",
"10433 0.0000 0.0000 0.0000 \n",
"... ... ... ... \n",
"1923 0.0000 0.0000 0.0000 \n",
"12701 0.0067 0.0033 0.0070 \n",
"1579 0.0079 0.0095 0.0069 \n",
"5863 0.5356 0.4669 0.4684 \n",
"7136 0.0004 0.0003 0.0004 \n",
"\n",
"[1357 rows x 9 columns]"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"x_test"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"# In this step, we invoke the endpoint for a single inference. The first data row from the Test dataset is passed\n",
"# to the endpoint for prediction. The predicted value is returned in the key value pair.\n",
"\n",
"# result = llearner_predictor.predict(x_test.values[0].astype('float32'))\n",
"# print(result)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"llearner_predictor.serializer = CSVSerializer()\n",
"llearner_predictor.deserializer = JSONDeserializer()"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'predictions': [{'score': 55.096778869628906}]}"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"llearner_predictor.predict(\n",
" x_test.iloc()[0].tolist(), initial_args={\"ContentType\": \"text/csv\"}\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"all_predictions = [\n",
" llearner_predictor.predict(\n",
" x_test.iloc()[i].tolist(), initial_args={\"ContentType\": \"text/csv\"}\n",
" )\n",
" for i in range(0, x_test.shape[0])\n",
"]\n",
"all_predictions_scores = [\n",
" all_predictions[i][\"predictions\"][0][\"score\"]\n",
" for i in range(0, len(all_predictions))\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"matplotlib.pyplot.scatter(y_test[\"Avg_Hosp\"], all_predictions_scores)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you are finished testing, clean up the endpoint"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"llearner_predictor.delete_endpoint()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Summary\n",
"We see how the SageMaker built-in Linear Learner algorithm is used to train machine learning model and use this model for inference. The steps show how Jupyter notebooks in SageMaker can be used for build, train and deployment of machine learning models and evaluation of metrics from the model’s performance. This approach can be used in a wide variety of use cases at scale."
]
}
],
"metadata": {
"instance_type": "ml.m5.4xlarge",
"kernelspec": {
"display_name": "conda_python3",
"language": "python",
"name": "conda_python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}