# SageMaker Security Demo Notebook

In this notebook you will demonstrate how to perform common data science tasks in a secure fashion, consistent with the requirements of regulated customers. This notebook will focus on the data science workflow while the following notebook will focus on the DevOps workflow.

This notebook is divided into 6 parts:

1. Compute and Network Isolation

1. Authentication and Authorization

1. Artifact Management

1. Data Encryption

1. Traceability and Auditability

1. Explainability and Interpretability

> **Note**

 To create this notebook environment, we used lifecycle configurations to securely download a dataset from Amazon S3 and install necessary libraries at setup. We also associated the notebook with a "Private" Git Repo for maintaining source and code version control. 

 This private Git repo can be replaced with your own **Enterprise Git** hosted on-prem, or **BitBucket** or any publicly hosted repo of your choosing. 

 We also demonstrate some of the capabilities of pip installing required libraries via this SageMaker notebook. A common practice is to use LifeCycle configurations to manage all dependencies upfront.

## Section A: Environment Setup

### Part 1: Compute and Network Isolation 
---

In this exercise we have launched a Jupyter notebook server **without** Internet access. The server runs within a VPC without Internet connectivity but still maintains access to specific AWS services such as Elastic Container Registry and Amazon S3. Access to a shared services VPC has also been configured to allow connectivity to a centralized repository of Python packages.

#### Test Networking

To demonstrate a lack of Internet connectivity try to execute the below command, it will timeout without a path to the Internet or a proxy server.

In [None]:
!curl https://aws.amazon.com

By removing public internet access in this way, we have created a secure environment where all the dependencies are installed, but the notebook now has no way to access the internet, and internet traffic cannot reach the notebook either. 

### Part 2: Authentication and Authorization
---

SageMaker notebooks need to be assigned a role for accessing AWS services. Fine grained access control over which services a SageMaker notebook is allowed to access can be provided using Identity and Access Management (IAM). 

To control access at a user level, data scientists should typically not be allowed to create notebooks, provision or delete infrastructure. In some cases, even console access can be removed by creating PreSigned URLs, that directly launch a hosted Jupyter environment for data scientists to use from their laptops. 

Moreover, admins can use resource [tags for attribute-based access control (ABAC)](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction_attribute-based-access-control.html) to ensure that different teams of data scientists, with the same high-level IAM role, have different access rights to AWS services, such as only allowing read/write access to specific S3 buckets which match tag criteria. 

For customers with even more stringent data and code segregation requirements, admins can provision different accounts for individual teams and manage the billing from these accounts in a centralized Organizational Unit. 

In [None]:
# Let's inspect the role we have created for our notebook here:
import boto3
import sagemaker
from sagemaker import get_execution_role

sm = boto3.Session().client('sagemaker')
sess = sagemaker.Session()
region = boto3.session.Session().region_name

role = get_execution_role()
print ("Notebook is running with assumed role {}".format (role))
print("Working with AWS services in the {} region".format(region))

#### Sample Notebook IAM Role

As part of this workshop, we have assigned an IAM role to this notebook. This role will be used by the notebook instance to access AWS APIs. Look at the IAM policies attached to this role. 

Below is an example policy which provides least privilege access to various services like Amazon S3 and Amazon SageMaker that a data scientist would need to develop and conduct experiments. 

```json
{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Action": [
 "ssm:GetParameters",
 "ssm:GetParameter"
 ],
 "Resource": "arn:aws:ssm:eu-west-2:0123456789012:parameter/ds-*",
 "Effect": "Allow"
 },
 {
 "Condition": {
 "Null": {
 "sagemaker:VpcSubnets": "true"
 }
 },
 "Action": [
 "sagemaker:CreateNotebookInstance",
 "sagemaker:CreateHyperParameterTuningJob",
 "sagemaker:CreateProcessingJob",
 "sagemaker:CreateTrainingJob",
 "sagemaker:CreateModel"
 ],
 "Resource": "*",
 "Effect": "Deny"
 },
 {
 "Condition": {
 "ForAllValues:StringEqualsIfExists": {
 "sagemaker:VpcSubnets": [
 "subnet-012341dabe787cc21",
 "subnet-0123457cd6518f8af",
 "subnet-01234da97259ab887"
 ],
 "sagemaker:VpcSecurityGroupIds": [
 "sg-012347ba900d25251"
 ]
 }
 },
 "Action": [
 "sagemaker:*"
 ],
 "Resource": "*",
 "Effect": "Allow"
 },
 {
 "Action": [
 "application-autoscaling:DeleteScalingPolicy",
 "application-autoscaling:DeleteScheduledAction",
 "application-autoscaling:DeregisterScalableTarget",
 "application-autoscaling:DescribeScalableTargets",
 "application-autoscaling:DescribeScalingActivities",
 "application-autoscaling:DescribeScalingPolicies",
 "application-autoscaling:DescribeScheduledActions",
 "application-autoscaling:PutScalingPolicy",
 "application-autoscaling:PutScheduledAction",
 "application-autoscaling:RegisterScalableTarget",
 "cloudwatch:DeleteAlarms",
 "cloudwatch:DescribeAlarms",
 "cloudwatch:GetMetricData",
 "cloudwatch:GetMetricStatistics",
 "cloudwatch:ListMetrics",
 "cloudwatch:PutMetricAlarm",
 "cloudwatch:PutMetricData",
 "ec2:CreateNetworkInterface",
 "ec2:CreateNetworkInterfacePermission",
 "ec2:DeleteNetworkInterface",
 "ec2:DeleteNetworkInterfacePermission",
 "ec2:DescribeDhcpOptions",
 "ec2:DescribeNetworkInterfaces",
 "ec2:DescribeRouteTables",
 "ec2:DescribeSecurityGroups",
 "ec2:DescribeSubnets",
 "ec2:DescribeVpcEndpoints",
 "ec2:DescribeVpcs",
 "ecr:BatchCheckLayerAvailability",
 "ecr:BatchGetImage",
 "ecr:CreateRepository",
 "ecr:GetAuthorizationToken",
 "ecr:GetDownloadUrlForLayer",
 "ecr:Describe*",
 "elastic-inference:Connect",
 "iam:ListRoles",
 "kms:CreateGrant",
 "kms:Decrypt",
 "kms:DescribeKey",
 "kms:Encrypt",
 "kms:GenerateDataKey",
 "kms:ListAliases",
 "lambda:ListFunctions",
 "logs:CreateLogGroup",
 "logs:CreateLogStream",
 "logs:DescribeLogStreams",
 "logs:GetLogEvents",
 "logs:PutLogEvents",
 "sns:ListTopics",
 "codecommit:BatchGetRepositories",
 "codecommit:GitPull",
 "codecommit:GitPush",
 "codecommit:CreateBranch",
 "codecommit:DeleteBranch",
 "codecommit:GetBranch",
 "codecommit:ListBranches",
 "codecommit:CreatePullRequest",
 "codecommit:GetPullRequest",
 "codecommit:CreateCommit",
 "codecommit:GetCommit",
 "codecommit:GetCommitHistory",
 "codecommit:GetDifferences",
 "codecommit:GetReferences",
 "codecommit:CreateRepository",
 "codecommit:GetRepository",
 "codecommit:ListRepositories"
 ],
 "Resource": "*",
 "Effect": "Allow"
 },
 {
 "Action": [
 "ecr:SetRepositoryPolicy",
 "ecr:CompleteLayerUpload",
 "ecr:BatchDeleteImage",
 "ecr:UploadLayerPart",
 "ecr:DeleteRepositoryPolicy",
 "ecr:InitiateLayerUpload",
 "ecr:DeleteRepository",
 "ecr:PutImage"
 ],
 "Resource": "arn:aws:ecr:*:*:repository/*sagemaker*",
 "Effect": "Allow"
 },
 {
 "Action": [
 "s3:GetObject",
 "s3:PutObject",
 "s3:DeleteObject"
 ],
 "Resource": [
 "arn:aws:s3:::ds-data-bucket-project-dev",
 "arn:aws:s3:::ds-data-bucket-project-dev/*",
 "arn:aws:s3:::ds-model-bucket-project-dev",
 "arn:aws:s3:::ds-model-bucket-project-dev/*"
 ],
 "Effect": "Allow"
 },
 {
 "Action": [
 "s3:GetBucketLocation",
 "s3:ListBucket",
 "s3:ListAllMyBuckets"
 ],
 "Resource": "*",
 "Effect": "Allow"
 },
 {
 "Condition": {
 "StringEqualsIgnoreCase": {
 "s3:ExistingObjectTag/SageMaker": "true"
 }
 },
 "Action": [
 "s3:GetObject"
 ],
 "Resource": "*",
 "Effect": "Allow"
 },
 {
 "Action": [
 "lambda:InvokeFunction"
 ],
 "Resource": [
 "arn:aws:lambda:*:*:function:*SageMaker*",
 "arn:aws:lambda:*:*:function:*sagemaker*",
 "arn:aws:lambda:*:*:function:*Sagemaker*",
 "arn:aws:lambda:*:*:function:*LabelingFunction*"
 ],
 "Effect": "Allow"
 },
 {
 "Condition": {
 "StringLike": {
 "iam:AWSServiceName": "sagemaker.application-autoscaling.amazonaws.com"
 }
 },
 "Action": "iam:CreateServiceLinkedRole",
 "Resource": "arn:aws:iam::*:role/aws-service-role/sagemaker.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint",
 "Effect": "Allow"
 },
 {
 "Action": [
 "sns:Subscribe",
 "sns:CreateTopic"
 ],
 "Resource": [
 "arn:aws:sns:*:*:*SageMaker*",
 "arn:aws:sns:*:*:*Sagemaker*",
 "arn:aws:sns:*:*:*sagemaker*"
 ],
 "Effect": "Allow"
 },
 {
 "Condition": {
 "StringEquals": {
 "iam:PassedToService": [
 "sagemaker.amazonaws.com"
 ]
 }
 },
 "Action": [
 "iam:PassRole"
 ],
 "Resource": "*",
 "Effect": "Allow"
 }
 ]
}
```

**Optional IAM Activity** Visit the AWS IAM console and review the role for this notebook and its associated permissions.

#### Complete Setup: Import libraries and set global definitions.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
from time import sleep, gmtime, strftime
import time

In [None]:
# Import SageMaker Experiments 
! pip install sagemaker-experiments
from sagemaker.analytics import ExperimentAnalytics
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

#### Import Networking definitions: VPC Id, KMS keys and security groups and subnets

This SageMaker notebook is associated with a Lifecycle Configuration script which created a convenience Python module for you. This module is defined in ~/.ipython/sagemaker_environment.py and provides Python constants for values such as the AWS VPC configuration to be used in conjunction with Amazon SageMaker resources or the KMS encryption key ID to be used with Amazon S3. As part of this notebook you will import this module in the following cells. Feel free to inspect the source code as well.

In [None]:
# Create Networking configuration required for all APIs. 
from sagemaker.network import NetworkConfig
import sagemaker_environment as smenv

cmk_id = smenv.SAGEMAKER_KMS_KEY_ID 
sec_groups = smenv.SAGEMAKER_SECURITY_GROUPS
subnets = smenv.SAGEMAKER_SUBNETS
network_config = NetworkConfig(security_group_ids = sec_groups, subnets = subnets)

#### Install Libraries using pip (while still being offline!)

Typically when you use pip to install packages the code is downloaded over the public internet from the PyPI servers. However most customers do not allow public internet access from their notebook environment. To work within those guidelines, your notebook has been configured to work with a private centralised Shared Services PyPI mirror. This mirror will allow you to install and validate packages, as many regulated customers need to validate open source packages through their application security processes before they can be used by teams.

By using a shared services PyPI mirror you have created a separation between the private data scientist VPC and a potentially internet facing VPC. The notebook's Lifecylce Configuration has installed the libraries you need, but for the purposes of this demo, you will also pip install Shap, to demonstrate communication with the centralized, private PyPI mirror.

In [None]:
# Let's install the shap library from our local PyPi server. 
! pip install shap

In [None]:
# Import xgboost and a custom utilities package we use in this notebook
import xgboost as xgb
from util import utilsspec 

### Part 3: Artifact Management 
---

During the machine learning lifecycle a number of artifacts will be generated by our data processing jobs, training jobs and experimentation. To store these artifacts we specify the bucket locations where the model and data artifacts will reside below. These inputs are then fed into the SageMaker Estimators during data pre-processing and model training.

SageMaker will automatically look in the specified buckets for accessing any training/validation data, and ensure that model outputs are stored in the output directories specified.

Later on, we will see how to track these artifacts using SageMaker Experiments API.

The workshop pre-provisioned a set of buckets and their names are included in our `sagemaker_environment.py` file so we will simply import those here directly. 

In [None]:
# We have already created buckets as part of the Secure Data Science Workshop. Here we will simply import those buckets
# for your use.

# raw_bucket: stores raw data and any preprocessing job related code.
# data_bucket: stores train/test data for training/validating ML models.
# output_bucket: where the model artifacts and outputs will be stored.
# For our demo, these buckets are the same, but as best practice, we probably want to keep them separate with different permissions.

raw_bucket = smenv.SAGEMAKER_DATA_BUCKET 
data_bucket = smenv.SAGEMAKER_DATA_BUCKET 
output_bucket =smenv.SAGEMAKER_MODEL_BUCKET 

prefix = 'secure-sagemaker-demo' # use this prefix to store all files pertaining to this workshop.

dataprefix = prefix + '/data'
traindataprefix = prefix + '/train_data'
testdataprefix = prefix + '/test_data'

print("Storing training data to s3://{}".format (data_bucket))
print ("Training job output will be stored in s3://{}".format (output_bucket))

## Section B: Pre-processing and Feature Engineering

A key part of the data science lifecyle is data exploration, pre-processing and feature engineering. In this section you will demonstrate how to use SageMaker notebooks for data exploration and SageMaker Processing for feature engineering and pre-processing data.

### Download and Import the data

For this notebook, we use the public [Credit Card default dataset](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) downloaded from UCI and referenced in:

 Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

Since your notebook does not have internet connectivity the dataset has already been downloadeded and made available on your local notebook instance.

The dataset is using some user features (age, education level, marital status etc) and some prior user history of credit card payments to predict likelihood of dafault on next month's payment. Here a value of `1` indicates default and `0` indicates no default.

In [None]:
WORKDIR = os.getcwd()
BASENAME = os.path.dirname(WORKDIR)

In [None]:
data = pd.read_excel('credit_card_default_data.xls', header=1)
data = data.drop(columns = ['ID'])
data.head()

In [None]:
# Note that the categorical columns SEX, Education and Marriage have been Integer Encoded in this case.
# For example:
data.SEX.value_counts()

In [None]:
data.rename(columns={"default payment next month": "Label"}, inplace=True)
lbl = data.Label
data = pd.concat([lbl, data.drop(columns=['Label'])], axis = 1)
data.head()

### Data Exploration

The first step in any ML Lifecycle is to explore the dataset to understand the statistical distributions of the features, engineer new ones as well as transform existing features into ML ready features which can be consumed by machine learning models. 

#### Is the data imbalanced?

One of the first steps in feature engineering is to investigate imbalanced data, i.e. whether there is much more of one label over another. Here we see that we have about 80% class imbalance -- that can be okay for many Machine learning models and does not require special balancing methods. If we have datasets with over 90% imbalance, using some sampling technique to generate a more balanced dataset is usually a good idea.

In [None]:
import seaborn as sns
sns.countplot(data.Label)
plt.title('Counts of Default versus Non Default Labels')
plt.show()

#### Are features correlated?

Simpler models such as linear/logistic regression typically don't perform well if one has correlated features. A simple way to look into this is to plot a correlation matrix as shown below. As you can see, the Payment and Bill features are strongly correlated, which is not surprising. You will also want to see if any features are strongly correlated with the label: if so, you need to ask, "will this feature be available in the incoming data or is there some leakage of the dependent variable into one of the independent variables?".

Here you will include all these features as you have a small dataset; but in general, one may want to explore some kind of dimensionality reduction technique such as Principal Component Analysis (PCA).

In [None]:
## Correlation plot
f = plt.figure(figsize=(19, 15))
plt.matshow(data.corr(), fignum=f.number)
plt.xticks(range(data.shape[1]), data.columns, fontsize=14, rotation=45)
plt.yticks(range(data.shape[1]), data.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);

In [None]:
from pandas.plotting import scatter_matrix
SCAT_COLUMNS = ['BILL_AMT1', 'BILL_AMT2', 'PAY_AMT1', 'PAY_AMT2']
scatter_matrix(data[SCAT_COLUMNS],figsize=(10, 10), diagonal ='kde')
plt.show()

### Preprocessing and Feature Engineering in Notebook

First we will run the feature engineering directly in our SageMaker Notebook. While this is okay for small datasets, it is not really recommended at scale. Moreover, it is hard to track Feature Engineering jobs from a versioning and lineage perspective if it is run in an ad hoc manner inside a notebook instance. 

In the cells that follow you will see how to use SageMaker Processing to scale out our feature engineering jobs. 

In [None]:
if not os.path.exists('rawdata/rawdata.csv'):
 !mkdir rawdata
 data.to_csv('rawdata/rawdata.csv', index=None)

In [None]:
#upload the raw data to S3.
rawdataprefix = 'rawdata'
raw_data_location = sess.upload_data(rawdataprefix, bucket=raw_bucket, key_prefix=dataprefix)
print(raw_data_location)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.compose import make_column_transformer

COLS = data.columns
X_train, X_test, y_train, y_test = train_test_split(data.drop('Label', axis=1), data['Label'], 
 test_size=0.2, random_state=0)
newcolorder = ['PAY_AMT1','BILL_AMT1'] + list(COLS[1:])[:11] + list(COLS[1:])[12:17] + list(COLS[1:])[18:]

preprocess = make_column_transformer(
 (StandardScaler(),['PAY_AMT1']),
 (MinMaxScaler(), ['BILL_AMT1']),
 remainder='passthrough')
 
print('Running preprocessing and feature engineering transformations')
train_features = pd.DataFrame(preprocess.fit_transform(X_train), columns = newcolorder)
test_features = pd.DataFrame(preprocess.transform(X_test), columns = newcolorder)
train_full = pd.concat([pd.DataFrame(y_train.values, columns=['Label']), pd.DataFrame(train_features)], axis=1)
test_full = pd.concat([pd.DataFrame(y_test.values, columns=['Label']), pd.DataFrame(test_features)], axis=1)
train_full.to_csv('train_data.csv', index=False, header=False)
test_full.to_csv('test_data.csv', index=False, header=False) 
print("Completed transformation, training set has shape: {}".format (train_features.shape))

In [None]:
# upload the training and validation data in S3. 
print(sess.upload_data('train_data.csv', bucket=data_bucket, key_prefix=traindataprefix))
print(sess.upload_data('test_data.csv', bucket=data_bucket, key_prefix=testdataprefix))

#### Secure and scalable Feature Engineering pipeline using SageMaker Processing

While you can pre-process small amounts of data directly in a notebook as shown above, SageMaker Processing offloads the heavy lifting of pre-processing larger datasets by provisioning the underlying infrastructure, securely downloading the data from an S3 location to the processing container, running the processing scripts, storing the processed data in an output directory in Amazon S3 and deleting the underlying transient resources needed to run the processing job. Once the processing job is complete, the infrastructure used to run the job is wiped, and any temporary data stored on it is deleted.

Importantly as we see below, we can now track this part of our analysis process to ensure that the lineage of our downstream trained ML models can be versioned and tracked to a feature engineering pipeline.

### Write a preprocessing script (same as above)

In [None]:
%%writefile preprocessing.py

import argparse
import os
import warnings

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.exceptions import DataConversionWarning
from sklearn.compose import make_column_transformer

warnings.filterwarnings(action='ignore', category=DataConversionWarning)

if __name__=='__main__':
 parser = argparse.ArgumentParser()
 parser.add_argument('--train-test-split-ratio', type=float, default=0.3)
 parser.add_argument('--random-split', type=int, default=0)
 args, _ = parser.parse_known_args()
 
 print('Received arguments {}'.format(args))

 input_data_path = os.path.join('/opt/ml/processing/input', 'rawdata.csv')
 
 print('Reading input data from {}'.format(input_data_path))
 df = pd.read_csv(input_data_path)
 df.sample(frac=1)
 
 COLS = df.columns
 newcolorder = ['PAY_AMT1','BILL_AMT1'] + list(COLS[1:])[:11] + list(COLS[1:])[12:17] + list(COLS[1:])[18:]
 
 split_ratio = args.train_test_split_ratio
 random_state=args.random_split
 
 X_train, X_test, y_train, y_test = train_test_split(df.drop('Label', axis=1), df['Label'], 
 test_size=split_ratio, random_state=random_state)
 
 preprocess = make_column_transformer(
 (['PAY_AMT1'], StandardScaler()),
 (['BILL_AMT1'], MinMaxScaler()),
 remainder='passthrough')
 
 print('Running preprocessing and feature engineering transformations')
 train_features = pd.DataFrame(preprocess.fit_transform(X_train), columns = newcolorder)
 test_features = pd.DataFrame(preprocess.transform(X_test), columns = newcolorder)
 
 # concat to ensure Label column is the first column in dataframe
 train_full = pd.concat([pd.DataFrame(y_train.values, columns=['Label']), train_features], axis=1)
 test_full = pd.concat([pd.DataFrame(y_test.values, columns=['Label']), test_features], axis=1)
 
 print('Train data shape after preprocessing: {}'.format(train_features.shape))
 print('Test data shape after preprocessing: {}'.format(test_features.shape))
 
 train_features_headers_output_path = os.path.join('/opt/ml/processing/train_headers', 'train_data_headers.csv')
 
 train_features_output_path = os.path.join('/opt/ml/processing/train', 'train_data.csv')
 
 test_features_output_path = os.path.join('/opt/ml/processing/test', 'test_data.csv')
 
 print('Saving training features to {}'.format(train_features_output_path))
 train_full.to_csv(train_features_output_path, header=False, index=False)
 print("Complete")
 
 print("Save training data with headers to {}".format(train_features_headers_output_path))
 train_full.to_csv(train_features_headers_output_path, index=False)
 
 print('Saving test features to {}'.format(test_features_output_path))
 test_full.to_csv(test_features_output_path, header=False, index=False)
 print("Complete")
 

In [None]:
# Copy the preprocessing code over to the s3 bucket
codeprefix = prefix + '/code'
codeupload = sess.upload_data('preprocessing.py', bucket=raw_bucket, key_prefix=codeprefix)
print(codeupload)

In [None]:
train_data_location = 's3://'+ data_bucket + '/' + traindataprefix
train_header_location = 's3://'+ data_bucket +'/'+ prefix +'/train_headers'
test_data_location = 's3://'+ data_bucket+'/'+testdataprefix
print("Training data location = {}".format(train_data_location))
print("Test data location = {}".format(test_data_location))

### Part 4: Data Encryption
---

To ensure that the processed data is encrypted at rest on the processing cluster, we provide a customer managed key to the volume_kms_key command below. This instructs Amazon SageMaker to encrypt the EBS volumes used during the processing job with the specified key. Since our data stored in Amazon S3 buckets are already encrypted, data is encrypted at rest at all times.

Amazon SageMaker always uses TLS encrypted tunnels when working with Amazon SageMaker so data is also encrypted in transit when traveling from or to Amazon S3.

In [None]:
## Use SageMaker Processing with SKLearn. -- combine data into train and test at this stage if possible.
from sagemaker.sklearn.processing import SKLearnProcessor
sklearn_processor = SKLearnProcessor(
 framework_version='0.20.0',
 role=role,
 instance_type='ml.c4.xlarge',
 instance_count=1,
 network_config=network_config, # attach SageMaker resources to your VPC
 volume_kms_key=cmk_id # encrypt the EBS volume attached to SageMaker Processing instance
)

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor.run(
 code=codeupload,
 inputs=[
 ProcessingInput(
 source=raw_data_location, destination='/opt/ml/processing/input')
 ],
 outputs=[
 ProcessingOutput(
 output_name='train_data',
 source='/opt/ml/processing/train',
 destination=train_data_location),
 ProcessingOutput(
 output_name='test_data',
 source='/opt/ml/processing/test',
 destination=test_data_location),
 ProcessingOutput(
 output_name='train_data_headers',
 source='/opt/ml/processing/train_headers',
 destination=train_header_location)
 ],
 arguments=['--train-test-split-ratio', '0.2'])

preprocessing_job_description = sklearn_processor.jobs[-1].describe()

output_config = preprocessing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
 if output['OutputName'] == 'train_data':
 preprocessed_training_data = output['S3Output']['S3Uri']
 if output['OutputName'] == 'test_data':
 preprocessed_test_data = output['S3Output']['S3Uri']

## Section C: Model development and Training

### Part 5. Traceability and Auditability 
---

We use SageMaker Experiments for data scientists to track the lineage of the model from the raw data source to the preprocessing steps and the model training pipeline. With SageMaker Experiments, data scientists can compare, track and manage multiple diferent model training jobs, data processing jobs, and hyperparameter tuning jobs, retaining a lineage from the source data to the training job artifacts to the model hyperparameters and any custom metrics that they may want to monitor as part of the model training.

Here we used SageMaker's managed XGBoost container to train an XGBoost model. More details about the managed container can be found here: https://github.com/aws/sagemaker-xgboost-container

Many customers require tracking and lineage to the source code level, which keeps track of which user made the most recent commit that produced the training code, which generated the deployed production model. We demonstrate how this is done using Github APIs and integrated into SageMaker Experiments

In [None]:
# Create a SageMaker Experiment
cc_experiment = Experiment.create(
 experiment_name=f"CreditCardDefault-{int(time.time())}", 
 description="Predict credit card default from payments data", 
 sagemaker_boto_client=sm)
print(cc_experiment)


Now you can track your SageMaker processing job as shown below. Here you will track the train_test_split_ratio, but you can track all kinds of other metadata such as the underlying instance types used to run the processing job or any specific feature engineering steps such as the random seed used to generate the train, test splits.

In [None]:
# Start Tracking parameters used in the Pre-processing pipeline.
with Tracker.create(display_name="Preprocessing", sagemaker_boto_client=sm) as tracker:
 tracker.log_parameters({
 "train_test_split_ratio": 0.2
 })
 # we can log the s3 uri to the dataset we just uploaded
 tracker.log_input(name="ccdefault-raw-dataset", media_type="s3/uri", value=raw_data_location)
 tracker.log_input(name="ccdefault-train-dataset", media_type="s3/uri", value=train_data_location)
 tracker.log_input(name="ccdefault-test-dataset", media_type="s3/uri", value=test_data_location)
 

### Train the Model

The same security practices you applied previously during SM Processing apply to training jobs. You will also have SageMaker experiments track the training job and store metadata such as model artifact location, training and validation data location, and model hyperparameters.

**Managed Spot Training**: To save on cost, you can run the training using managed Spot instances. SageMaker will automatically look to see if any spot instances of the desired type are available for a max time less than the max wait time, and if one is available, run your training job on the lower cost instance. With Managed Spot, customers can benefit from up-to 90% savings in cost.

For bring your own containers, customers are responsible for checkpointing models for the spot instances to resume training in the event that a training job is interrupted. For some SageMaker built-in algorithms, as well as SageMaker managed containers for Tensorflow/PyTorch/MxNet, SageMaker will handle the model checkpointing. For others, such as XgBoost, you will limit the max_wait_time to 3600 seconds. 

#### Train Without VPC Configured:

To test the networking controls, run the following cell below. Here you will first attempt to train the model without an associated network configuration. You should see that the training job is stopped around the same time as the "Downloading - Downloading input data" message is emitted. 

#### Detective control explained

The training job was terminated by an AWS Lambda function that was executed in response to a CloudWatch Event that was triggered when the training job was created. 

To learn more about how the detective control does this, assume the role of the Data Science Administrator and review the code of the [AWS Lambda function SagemakerTrainingJobVPCEnforcer](https://console.aws.amazon.com/lambda/home?#/functions/SagemakerTrainingJobVPCEnforcer?tab=configuration). 

You can also review the [CloudWatch Event rule SagemakerTrainingJobVPCEnforcementRule](https://console.aws.amazon.com/cloudwatch/home?#rules:name=SagemakerTrainingJobVPCEnforcementRule) and take note of the event which triggers execution of the Lambda function.

---

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri
image = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')

In [None]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/'.format(data_bucket, traindataprefix), content_type='csv')
s3_input_test = sagemaker.s3_input(s3_data='s3://{}/{}/'.format(data_bucket, testdataprefix), content_type='csv')
print ("Training data at: {}".format (s3_input_train.config['DataSource']['S3DataSource']['S3Uri']))
print ("Test data at: {}".format (s3_input_test.config['DataSource']['S3DataSource']['S3Uri']))

In [None]:
xgb = sagemaker.estimator.Estimator(
 image,
 role,
 train_instance_count=1,
 train_instance_type='ml.m4.xlarge',
 train_max_run=3600,
 output_path='s3://{}/{}/models'.format(output_bucket, prefix),
 sagemaker_session=sess,
 train_use_spot_instances=True,
 train_max_wait=3600,
 encrypt_inter_container_traffic=False
) 

xgb.set_hyperparameters(
 max_depth=5,
 eta=0.2,
 gamma=4,
 min_child_weight=6,
 subsample=0.8,
 verbosity=0,
 objective='binary:logistic',
 num_round=100)

xgb.fit(inputs={'train': s3_input_train})


#### Train with VPC

This time provide the training job with the network settings that were defined above. This time we shouldn't see the **Client Error** as before!

In [None]:
preprocessing_trial_component = tracker.trial_component

trial_name = f"cc-fraud-training-job-{int(time.time())}"
cc_trial = Trial.create(
 trial_name=trial_name,
 experiment_name=cc_experiment.experiment_name,
 sagemaker_boto_client=sm)

cc_trial.add_trial_component(preprocessing_trial_component)
cc_training_job_name = "cc-training-job-{}".format(int(time.time()))
xgb = sagemaker.estimator.Estimator(
 image,
 role,
 train_instance_count=1,
 train_instance_type='ml.m4.xlarge',
 train_max_run=3600,
 output_path='s3://{}/{}/models'.format(output_bucket, prefix),
 sagemaker_session=sess,
 train_use_spot_instances=True,
 train_max_wait=3600,
 subnets=subnets, 
 security_group_ids=
 sec_groups, 
 train_volume_kms_key=cmk_id,
 encrypt_inter_container_traffic=False
) 

xgb.set_hyperparameters(
 max_depth=5,
 eta=0.2,
 gamma=4,
 min_child_weight=6,
 subsample=0.8,
 verbosity=0,
 objective='binary:logistic',
 num_round=100)

xgb.fit(
 inputs={'train': s3_input_train},
 job_name=cc_training_job_name,
 experiment_config={
 "TrialName":
 cc_trial.trial_name, #log training job in Trials for lineage
 "TrialComponentDisplayName": "Training",
 },
 wait=True,
)


### Part 5 cont: Traceability and Auditability from source control to Model artifacts
---

Having used SageMaker Experiments to track the training runs, you can now extract model metadata to get the entire lineage of the model from the source data to the model artifacts and the hyperparameters.

To do this, simply call the **describe_trial_component** API.

In [None]:
# Present the Model Lineage as a dataframe
from sagemaker.session import Session
sess = boto3.Session()
lineage_table = ExperimentAnalytics(
 sagemaker_session=Session(sess, sm), 
 search_expression={
 "Filters":[{
 "Name": "Parents.TrialName",
 "Operator": "Equals",
 "Value": trial_name
 }]
 },
 sort_by="CreationTime",
 sort_order="Ascending",
)
lineagedf= lineage_table.dataframe()

lineagedf

In [None]:
# get detailed information about a particular trial
import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint (sm.describe_trial_component(TrialComponentName=lineagedf.TrialComponentName[1]))

### Part 6: Explainability and Interpretability
---

Now you can download the model artifact locally and extract feature importances from the model. In this case, XGBoost provides out of box APIs to do so. Some utility functions to extract this information have also been provided.

You can use SHAP values to understand which features contribute most to the model performance.

In [None]:
import util
trial_component_name = lineagedf.TrialComponentName[1]
LOCAL_FILENAME = '{}-model.tar.gz'.format(trial_component_name) # training local file
utilsspec.download_artifacts(trial_component_name, LOCAL_FILENAME) # download training file to local SageMaker volume

In [None]:
model = utilsspec.unpack_model_file(LOCAL_FILENAME) # extract the XGBoost model

**WARNING**: If you get an error here above such as "no Module named xgboost.core", simply run !pip install xgboost in a new cell --> Restart the kernel and run again. 

In [None]:
utilsspec.plot_features(model, data.columns[1:]) # use XGBoost native functionality to plot feature importance

In [None]:
import shap

In [None]:
traindata = pd.read_csv('train_data.csv', names = ['Label']+newcolorder)
traindata.head()

In [None]:
shap_values = shap.TreeExplainer(model).shap_values(traindata.drop(columns =['Label'])) # or use SHAP values.
shap.summary_plot(shap_values, traindata.drop(columns =['Label']), plot_type="bar")

In [None]:
shap.summary_plot(shap_values, traindata.drop(columns =['Label']))

The SHAP value plot above shows that the most recent payment history is an important feature on average across the training dataset -- users that have high values of PAY_0 (i.e. have not paid their bill for several months) have a strong impact on the model's output of predicting a default. Notice that the user features (marital status, age) etc do not have much importance on average.

The information included in this notebook is for illustrative purposes only. Nothing in this notebook is intended to provide you legal, compliance, or regulatory guidance. You should review the laws that apply to you.

## Section D: Transition to Deployment

### Git Integration

At this stage you have engineered a feature set, trained a model on the data, and have explored how the model is making decisions. You are now ready to deploy the model and transition from experimentation into operational deployment. To start this transition use the Git repository associated with this project to share your work with other team members. With your code under version control other team members can work to push the model into production after conducting internal review of the code as well as any QA/integration or other testing to make it production ready.

In the next notebook, you will assume the model you trained here is ready to deploy to production. You will deploy the model and monitor its operation for anomalous behavior.

To push this notebook to your project's CodeCommit repository follow the following steps using either a Terminal window in Jupyter or using the Git extension in JupyterLabs.

**Via Jupyter Terminal**

In the Jupyter UI click `File` --> `New` and click `Terminal` from the drop down menu.

In the Terminal window, navigate to the local directory containing this project and run the following cells:

```bash
cd ~/SageMaker/
git add 01_SageMaker-DataScientist-Workflow.ipynb 
git commit -m "Completed experimentation and trained initial model" 
git push -u origin master
git log --pretty=oneline
```

**Via JupyterLab Extension**

On the left of the JupyterLab UI you will notice an icon for `Git`. Click this icon and you will see a list of *Changed* files. Hover over `01_SageMaker-DataScientist-Workflow.ipynb` in the list of *Changed* files and click the `+` associated with the file. Towards the bottom of the screen in the `Summary (required)` text field enter "Completed experimentation and trained initial model" and click `Commit`. This commits the changes to the local copy of the Git repository. To push those changes to the team repository click the `Push committed changes` button towards the top which looks like a cloud with an arrow pointing up.

## Conclusions of this notebook

To conclude this portion, you have seen key steps in the data scientist workflow:

1. **Security**: Data exploration and storage of raw data using encryption keys

1. **Pre-processing:** Data preprocessing both in notebook, and in a secure manner using SageMaker Processing with encryption and networking guardrails for data motion.

1. **Built-in algorithm training:** Use SageMaker built in algorithm for model training

1. **Cost Optimization:** Training using Spot Instances to save cost. 

1. **Lineage and Tracking:** Tracking of model lineage as well as pre-processing job parameters using SageMaker Experiments.

1. **Explainability and Interpretability**: Model Feature importance using SHAP.

In [None]:
# Store the values used in this notebook for use in the second demo notebook:
trial_name = trial_name 
experiment_name = cc_experiment.experiment_name
training_job_name = cc_training_job_name
%store trial_name 
%store experiment_name 
%store training_job_name