{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Amazon SageMaker Autopilot Data Exploration Report\n",
"\n",
"This report contains insights about the dataset you provided as input to the AutoML job.\n",
"This data report was generated by **automl-health-10-15-38-25** AutoML job.\n",
"To check for any issues with your data and possible improvements that can be made to it,\n",
"consult the sections below for guidance.\n",
"You can use information about the predictive power of each feature in the **Data Sample** section and\n",
"from the correlation matrix in the **Cross Column Statistics** section to help select a subset of the data\n",
"that is most significant for making predictions.\n",
"\n",
"**Note**: SageMaker Autopilot data reports are subject to change and updates.\n",
"It is not recommended to parse the report using automated tools, as they may be impacted by such changes.\n",
"\n",
"## Dataset Summary\n",
"\n",
"
⚠️ High severity insight: “Skewed target\" \n",
"The distribution of values in the target column is skewed and might contain outliers. As the outliers induce high errors when optimizing MSE (or similar loss functions) ML algorithms tend to focus on them when training the model. That might result in sub-par prediction quality for the non-outlier rows. If it is important to predict the extreme target values well, then there might be no need for further action. If prediction of extreme values is not important, consider clipping extreme target values. Clipping or removing outliers can be done with Amazon SageMaker Data Wrangler using the “Robust standard deviation numeric outliers” transform under “Handle outliers”.\n",
"
⚠️ High severity insight: “Skewed target” \n",
"The distribution of values in the target column is skewed and might contain outliers. As the outliers induce high errors when optimizing MSE (or similar loss functions) ML algorithms tend to focus on them when training the model. That might result in sub-par prediction quality for the non-outlier rows. If it is important to predict the extreme target values well, then there might be no need for further action. If prediction of extreme values is not important, consider clipping extreme target values. Clipping or removing outliers can be done with Amazon SageMaker Data Wrangler using the “Robust standard deviation numeric outliers” transform under “Handle outliers”.\n",
"
\n",
"\n",
"\n",
"The column **CAR_HCPS_PMT_AMT** is used as the target column.\n",
"See the distribution of values (labels) in the target column below:\n",
"\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
Mean
\n",
"
Median
\n",
"
Minimum
\n",
"
Maximum
\n",
"
Skew
\n",
"
Kurtosis
\n",
"
Number of Uniques
\n",
"
Outliers Percentage
\n",
"
Invalid Percentage
\n",
"
Missing Percentage
\n",
"
Missing Count
\n",
"
\n",
" \n",
" \n",
"
\n",
"
85.26
\n",
"
45.00
\n",
"
0.0
\n",
"
10000.0
\n",
"
19.86
\n",
"
708.23
\n",
"
47
\n",
"
2.69%
\n",
"
0.00%
\n",
"
0.00%
\n",
"
0
\n",
"
\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" Histogram of the target column values. The orange bars contain outliers and the value below them is the outliers average.\n",
"
\n",
"\n",
"### Outlier Target Values\n",
"A randomly chosen sample of rows with outlier target values is shown in the following table.\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
CAR_HCPS_PMT_AMT
\n",
"
BENE_SEX_IDENT_CD
\n",
"
BENE_AGE_CAT_CD
\n",
"
CAR_LINE_ICD9_DGNS_CD
\n",
"
CAR_LINE_HCPCS_CD
\n",
"
CAR_LINE_BETOS_CD
\n",
"
CAR_LINE_SRVC_CNT
\n",
"
CAR_LINE_PRVDR_TYPE_CD
\n",
"
CAR_LINE_CMS_TYPE_SRVC_CD
\n",
"
CAR_LINE_PLACE_OF_SRVC_CD
\n",
"
CAR_LINE_CNT
\n",
"
\n",
" \n",
" \n",
"
\n",
"
10000
\n",
"
2
\n",
"
5
\n",
"
788
\n",
"
64590
\n",
"
P6C
\n",
"
1
\n",
"
5
\n",
"
F
\n",
"
24
\n",
"
6
\n",
"
\n",
"
\n",
"
5000
\n",
"
1
\n",
"
2
\n",
"
556
\n",
"
J1745
\n",
"
O1E
\n",
"
100
\n",
"
1
\n",
"
1
\n",
"
11
\n",
"
7
\n",
"
\n",
"
\n",
"
4000
\n",
"
2
\n",
"
1
\n",
"
786
\n",
"
A0431
\n",
"
O1A
\n",
"
1
\n",
"
3
\n",
"
D
\n",
"
42
\n",
"
9
\n",
"
\n",
"
\n",
"
4000
\n",
"
2
\n",
"
3
\n",
"
162
\n",
"
J9305
\n",
"
O1E
\n",
"
100
\n",
"
1
\n",
"
1
\n",
"
11
\n",
"
96
\n",
"
\n",
"
\n",
"
4000
\n",
"
1
\n",
"
3
\n",
"
535
\n",
"
A0431
\n",
"
O1A
\n",
"
1
\n",
"
3
\n",
"
D
\n",
"
42
\n",
"
2
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Sample\n",
"\n",
"The following table contains a random sample of **10** rows from the dataset.\n",
"The top two rows provide the type and prediction power of each column.\n",
"Verify the input headers correctly align with the columns of the dataset sample.\n",
"If they are incorrect, update the header names of your input dataset in Amazon Simple Storage Service (Amazon S3).\n",
"\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
CAR_HCPS_PMT_AMT
\n",
"
CAR_LINE_PLACE_OF_SRVC_CD
\n",
"
CAR_LINE_HCPCS_CD
\n",
"
CAR_LINE_BETOS_CD
\n",
"
CAR_LINE_CMS_TYPE_SRVC_CD
\n",
"
CAR_LINE_SRVC_CNT
\n",
"
CAR_LINE_PRVDR_TYPE_CD
\n",
"
CAR_LINE_ICD9_DGNS_CD
\n",
"
CAR_LINE_CNT
\n",
"
BENE_AGE_CAT_CD
\n",
"
BENE_SEX_IDENT_CD
\n",
"
\n",
" \n",
" \n",
"
\n",
"
Prediction Power
\n",
"
-
\n",
"
0.359197
\n",
"
0.151996
\n",
"
0.149623
\n",
"
0.0708892
\n",
"
0.0657154
\n",
"
0.030139
\n",
"
0.00207852
\n",
"
0
\n",
"
0
\n",
"
0
\n",
"
\n",
"
\n",
"
Column Types
\n",
"
-
\n",
"
numeric
\n",
"
numeric
\n",
"
text
\n",
"
categorical
\n",
"
numeric
\n",
"
numeric
\n",
"
numeric
\n",
"
numeric
\n",
"
numeric
\n",
"
numeric
\n",
"
\n",
"
\n",
"
0
\n",
"
10
\n",
"
81
\n",
"
83872
\n",
"
T1H
\n",
"
5
\n",
"
1
\n",
"
5
\n",
"
715
\n",
"
4
\n",
"
5
\n",
"
2
\n",
"
\n",
"
\n",
"
1
\n",
"
5
\n",
"
81
\n",
"
80076
\n",
"
T1B
\n",
"
5
\n",
"
1
\n",
"
5
\n",
"
274
\n",
"
11
\n",
"
3
\n",
"
2
\n",
"
\n",
"
\n",
"
2
\n",
"
25
\n",
"
81
\n",
"
82785
\n",
"
T1H
\n",
"
5
\n",
"
1
\n",
"
5
\n",
"
477
\n",
"
39
\n",
"
6
\n",
"
2
\n",
"
\n",
"
\n",
"
3
\n",
"
50
\n",
"
11
\n",
"
95990
\n",
"
O1C
\n",
"
1
\n",
"
1
\n",
"
1
\n",
"
724
\n",
"
6
\n",
"
6
\n",
"
1
\n",
"
\n",
"
\n",
"
4
\n",
"
35
\n",
"
21
\n",
"
99231
\n",
"
M2B
\n",
"
1
\n",
"
1
\n",
"
1
\n",
"
695
\n",
"
1
\n",
"
2
\n",
"
2
\n",
"
\n",
"
\n",
"
5
\n",
"
100
\n",
"
23
\n",
"
99284
\n",
"
M3
\n",
"
1
\n",
"
1
\n",
"
7
\n",
"
620
\n",
"
1
\n",
"
1
\n",
"
2
\n",
"
\n",
"
\n",
"
6
\n",
"
65
\n",
"
11
\n",
"
64495
\n",
"
P6B
\n",
"
2
\n",
"
1
\n",
"
1
\n",
"
721
\n",
"
180
\n",
"
1
\n",
"
2
\n",
"
\n",
"
\n",
"
7
\n",
"
35
\n",
"
11
\n",
"
92082
\n",
"
M5C
\n",
"
Q
\n",
"
1
\n",
"
0
\n",
"
362
\n",
"
1
\n",
"
2
\n",
"
2
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Summary\n",
"\n",
"\n",
"Prediction power is measured by stratified splitting the data into 80%/20% training and validation folds. We fit a\n",
"model for each feature separately on the training fold after applying minimal feature pre-processing and measure\n",
"prediction performance on the validation data. Higher prediction power scores, toward 1, indicate columns that are\n",
"more useful for predicting the target on their own. Lower scores, toward 0 point to columns that contain little useful\n",
"information for predicting the target on their own.\n",
" For ease of presentation, we are only showing the top 10 prediction power results.\n",
"
\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Duplicate Rows\n",
"\n",
"\n",
"\n",
"No duplicate rows were found when testing a random sample of 8000\n",
"rows from the dataset.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cross Column Statistics\n",
"\n",
"Amazon SageMaker Autopilot calculates Pearson’s correlation between columns in your dataset.\n",
"Removing highly correlated columns can reduce overfitting and training time.\n",
"Pearson’s correlation is in the range [-1, 1] where 0 implies no correlation, 1 implies perfect correlation,\n",
"and -1 implies perfect inverse correlation.\n",
"\n",
"The full correlation matrix between the 8\n",
"most predictive numeric features is presented below.\n",
"
\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Anomalous Rows\n",
"\n",
"Anomalous rows are detected using the Isolation forest algorithm on a sample of\n",
"**8000**\n",
"randomly chosen\n",
"rows after basic preprocessing. The isolation forest algorithm associates an anomaly score to each row of the dataset\n",
"it is trained on. Rows with negative anomaly scores are usually considered anomalous and rows with positive anomaly\n",
"scores are considered non-anomalous. When investigating an anomalous row, look for any unusual values -\n",
"in particular any that might have resulted from errors in the gathering and processing of data.\n",
"Deciphering whether a row is indeed anomalous, contains errors, or is in fact valid requires domain knowledge and\n",
"application of business logic.\n",
"\n",
"There were no anomalous rows found in the dataset.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Missing Values\n",
"Within the data sample, the following columns contained missing values, such as: `nan`, white spaces, or empty fields.\n",
"\n",
"SageMaker Autopilot will attempt to fill in missing values using various techniques. For example,\n",
"missing values can be replaced with a new 'unknown' category for `Categorical` features\n",
"and missing `Numerical` values can be replaced with the **mean** or **median** of the column.\n",
"\n",
"We found **0 of the 11** of the columns contained missing values.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cardinality\n",
"For `String` features, it is important to count the number of unique values to determine whether to treat a feature as `Categorical` or `Text`\n",
"and then processes the feature according to its type.\n",
"\n",
"For example, SageMaker Autopilot counts the number of unique entries and the number of unique words.\n",
"The following string column would have **3** total entries, **2** unique entries, and **3** unique words.\n",
"\n",
"| | String Column |\n",
"|-------|-------------------|\n",
"| **0** | \"red blue\" |\n",
"| **1** | \"red blue\" |\n",
"| **2** | \"red blue yellow\" |\n",
"\n",
"If the feature is `Categorical`, SageMaker Autopilot can look at the total number of unique entries and transform it using techniques such as one-hot encoding.\n",
"If the field contains a `Text` string, we look at the number of unique words, or the vocabulary size, in the string.\n",
"We can use the unique words to then compute text-based features, such as Term Frequency-Inverse Document Frequency (tf-idf).\n",
"\n",
"**Note:** If the number of unique values is too high, we risk data transformations expanding the dataset to too many features.\n",
"In that case, SageMaker Autopilot will attempt to reduce the dimensionality of the post-processed data,\n",
"such as by capping the number vocabulary words for tf-idf, applying Principle Component Analysis (PCA), or other dimensionality reduction techniques.\n",
"\n",
"The table below shows **11 of the 11** columns ranked by the number of unique entries.\n",
"\n",
"
💡 Suggested Action Items\n",
"\n",
"- Verify the number of unique values of a feature is as expected.\n",
" One explanation for unexpected number of unique values could be multiple encodings of a value.\n",
" For example `US` and `U.S.` will count as two different words.\n",
" You could correct the error at the data source or pre-process your dataset in your S3 bucket.\n",
"- If the number of unique values seems too high for Categorical variables,\n",
" investigate if multiple unique values can be grouped into a smaller set of possible values.\n",
"
\n",
"\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Number of Unique Entries
\n",
"
Number of Unique Words (if Text)
\n",
"
\n",
" \n",
" \n",
"
\n",
"
BENE_SEX_IDENT_CD
\n",
"
2
\n",
"
n/a
\n",
"
\n",
"
\n",
"
BENE_AGE_CAT_CD
\n",
"
6
\n",
"
n/a
\n",
"
\n",
"
\n",
"
CAR_LINE_PRVDR_TYPE_CD
\n",
"
6
\n",
"
n/a
\n",
"
\n",
"
\n",
"
CAR_LINE_CMS_TYPE_SRVC_CD
\n",
"
17
\n",
"
n/a
\n",
"
\n",
"
\n",
"
CAR_LINE_PLACE_OF_SRVC_CD
\n",
"
24
\n",
"
n/a
\n",
"
\n",
"
\n",
"
CAR_HCPS_PMT_AMT
\n",
"
47
\n",
"
n/a
\n",
"
\n",
"
\n",
"
CAR_LINE_SRVC_CNT
\n",
"
57
\n",
"
n/a
\n",
"
\n",
"
\n",
"
CAR_LINE_BETOS_CD
\n",
"
91
\n",
"
91
\n",
"
\n",
"
\n",
"
CAR_LINE_CNT
\n",
"
302
\n",
"
n/a
\n",
"
\n",
"
\n",
"
CAR_LINE_ICD9_DGNS_CD
\n",
"
631
\n",
"
n/a
\n",
"
\n",
"
\n",
"
CAR_LINE_HCPCS_CD
\n",
"
1455
\n",
"
n/a
\n",
"
\n",
" \n",
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Descriptive Stats\n",
"For each of the input features that has at least one numeric value, several descriptive statistics are computed from the data sample.\n",
"\n",
"SageMaker Autopilot may treat numerical features as `Categorical` if the number of unique entries is sufficiently low.\n",
"For `Numerical` features, we may apply numerical transformations such as normalization, log and quantile transforms,\n",
"and binning to manage outlier values and difference in feature scales.\n",
"\n",
"We found **10 of the 11** columns contained at least one numerical value.\n",
"The table below shows the **10** columns which have the largest percentage of numerical values.\n",
"Percentage of outliers is calculated only for columns which Autopilot detected to be of numeric type. Percentage of outliers is\n",
"not calculated for the target column.\n",
"\n",
"
💡 Suggested Action Items\n",
"\n",
"- Investigate the origin of the data field. Are some values non-finite (e.g. infinity, nan)?\n",
" Are they missing or is it an error in data input?\n",
"- Missing and extreme values may indicate a bug in the data collection process.\n",
" Verify the numerical descriptions align with expectations.\n",
" For example, use domain knowledge to check that the range of values for a feature meets with expectations.\n",
"
\n",
"\n",
"\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
% of Numerical Values
\n",
"
Mean
\n",
"
Median
\n",
"
Min
\n",
"
Max
\n",
"
% of Outlier Values
\n",
"
\n",
" \n",
" \n",
"
\n",
"
CAR_HCPS_PMT_AMT
\n",
"
100.0%
\n",
"
85.2588
\n",
"
45.0
\n",
"
0.0
\n",
"
10000.0
\n",
"
nan
\n",
"
\n",
"
\n",
"
BENE_SEX_IDENT_CD
\n",
"
100.0%
\n",
"
1.54462
\n",
"
2.0
\n",
"
1.0
\n",
"
2.0
\n",
"
0.0
\n",
"
\n",
"
\n",
"
BENE_AGE_CAT_CD
\n",
"
100.0%
\n",
"
3.37538
\n",
"
3.0
\n",
"
1.0
\n",
"
6.0
\n",
"
0.0
\n",
"
\n",
"
\n",
"
CAR_LINE_SRVC_CNT
\n",
"
100.0%
\n",
"
2.06637
\n",
"
1.0
\n",
"
1.0
\n",
"
500.0
\n",
"
4.8
\n",
"
\n",
"
\n",
"
CAR_LINE_PRVDR_TYPE_CD
\n",
"
100.0%
\n",
"
1.59375
\n",
"
1.0
\n",
"
0.0
\n",
"
8.0
\n",
"
0.0
\n",
"
\n",
"
\n",
"
CAR_LINE_PLACE_OF_SRVC_CD
\n",
"
100.0%
\n",
"
23.3712
\n",
"
21.0
\n",
"
0.0
\n",
"
99.0
\n",
"
0.0
\n",
"
\n",
"
\n",
"
CAR_LINE_CNT
\n",
"
100.0%
\n",
"
25.1174
\n",
"
3.0
\n",
"
1.0
\n",
"
7840.0
\n",
"
4.0
\n",
"
\n",
"
\n",
"
CAR_LINE_ICD9_DGNS_CD
\n",
"
95.66%
\n",
"
526.804
\n",
"
528.0
\n",
"
5.0
\n",
"
999.0
\n",
"
0.0
\n",
"
\n",
"
\n",
"
CAR_LINE_CMS_TYPE_SRVC_CD
\n",
"
94.03%
\n",
"
3.00745
\n",
"
2.0
\n",
"
1.0
\n",
"
9.0
\n",
"
nan
\n",
"
\n",
"
\n",
"
CAR_LINE_HCPCS_CD
\n",
"
91.54%
\n",
"
78248.4
\n",
"
88189.0
\n",
"
104.0
\n",
"
99406.0
\n",
"
0.0
\n",
"
\n",
" \n",
"
\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Definitions\n",
"\n",
"### Feature types\n",
"\n",
"**Numeric:** Numeric values, either floats or integers. For example: age, income. When training a machine learning model, it is assumed that numeric values are ordered and a distance is defined between them. For example, 3 is closer to 4 than to 10 and 3 < 4 < 10.\n",
"\n",
"**Categorical:** The column entries belong to a set of unique values that is usually much smaller than number of rows in the dataset. For example, a column from datasets with 100 rows with the unique values \"Dog\", \"Cat\" and \"Mouse\". The values could be numeric, textual, or combination of both. For example, \"Horse\", \"House\", 8, \"Love\" and 3.1 are\n",
"all valid values and can be found in the same categorical column. When manipulating column of categorical values, a machine learning model does not assume that they are ordered or that distance function is defined on them, even if all of the values are numbers.\n",
"\n",
"**Binary:** A special case of categorical column for which the cardinality of the set of unique values is 2.\n",
"\n",
"**Text:** A text column that contains many non-numeric unique values, often a human readable text. In extreme cases, all the elements of the column are unique, so no two entries are the same.\n",
"\n",
"**Datetime:** This column contains date and/or time information.\n",
"\n",
"### Feature statistics\n",
"\n",
"**Prediction power:** Prediction power of a column (feature) is a measure of how useful it is for predicting the target variable. It is measured using a stratified split into 80%/20% training and validation folds. We fit a model for each feature separately on the training fold after applying minimal feature pre-processing and measure prediction performance on the validation data. The scores are normalized to the range [0,1]. A higher prediction power score near 1 indicate that a column is more useful for predicting the target on its own. A lower score near 0 indicate that a column contains little useful information for predicting the target on their own. Although it is possible that a column that is uninformative on its own can be useful in predicting the target when used in tandem with other features, a low score usually indicates the feature is redundant. A score of 1 implies perfect predictive abilities, which often indicates an error called target leakage. The cause is typically a column present in dataset that is hard or impossible to obtain at prediction time, such as a duplicate of the target.\n",
"\n",
"**Outliers:** Outliers are detected using two statistics that are robust to outliers: median and robust standard deviation (RSTD). RSTD is derived by clipping the feature values to the range [5 percentile, 95 percentile] and calculating the standard deviation of the clipped vector. All values larger than median + 5 * RSTD or smaller than median - 5 * RSTD are considered to be outliers.\n",
"\n",
"**Skew:** Skew measures the symmetry of the distribution and is defined as the third moment of the distribution divided by the third power of the standard deviation. The skewness of the normal distribution or any other symmetric distribution is zero. Positive values imply that the right tail of the distribution is longer than the left tail. Negative values imply that the left tail of the distribution is longer than the right tail. As a thumb rule, a distribution is considered skewed when the absolute value of the skew is larger than 3.\n",
"\n",
"**Kurtosis:** Pearson's kurtosis measures the heaviness of the tail of the distribution and is defined as the fourth moment of the distribution divided by the fourth power of the standard deviation. The kurtosis of the normal distribution is 3. Thus, kurtosis values lower than 3 imply that the distribution is more concentrated around the mean and the tails are lighter than the tails of the normal distribution. Kurtosis values higher than 3 imply heavier tails than the normal distribution or that the data contains outliers.\n",
"\n",
"**Missing Values:** Empty strings and strings composed of only white spaces are considered missing.\n",
"\n",
"**Valid values:**\n",
"\n",
"* **Numeric features / regression target:** All values that could be casted to finite floats are valid. Missing values are not valid.\n",
"* **Categorical / binary / text features / classification target:** All values that are not missing are valid.\n",
"* **Datetime features:** All values that could be casted to datetime object are valid. Missing values are not valid.\n",
"\n",
"**Invalid values:** values that are either missing or that could not be casted to the desired type. See the definition of valid values for more information"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}