{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SageMaker Security Demo Notebook\n", "\n", "In this notebook you will demonstrate how to perform common data science tasks in a secure fashion, consistent with the requirements of regulated customers. This notebook will focus on the data science workflow while the following notebook will focus on the DevOps workflow.\n", "\n", "This notebook is divided into 6 parts:\n", "\n", "1. Compute and Network Isolation\n", "\n", "1. Authentication and Authorization\n", "\n", "1. Artifact Management\n", "\n", "1. Data Encryption\n", "\n", "1. Traceability and Auditability\n", "\n", "1. Explainability and Interpretability" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **Note**\n", "\n", "To create this notebook environment, we packaged the notebooks and example data set using a custom SageMaker image. The Python packages used in the notebook have been installed in shared CodeArtifact repository. Studio can only connect to a local git repository. You can associate a Studio notebook with a \"Private\" Git Repo for maintaining source and code version control.\n", "\n", "However in the SageMaker Studio `System Terminal` or from Studio Git menu you can configure git remote repository for local git repository pointing to your own **Enterprise Git** hosted on-prem, or **BitBucket** or any publicly hosted repo of your choosing. Configure VPC Interface Endpoints powered by AWS PrivateLinks to setup a network path to these repositories.\n", "\n", "We also demonstrate some of the capabilities of pip installing required libraries via this SageMaker Studio notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section A: Environment Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 1: Compute and Network Isolation \n", "---\n", "\n", "In this exercise we have launched a Studio Jupyter KernelGateway app. Note that we have configured SageMaker Studio with\n", "network access type of VPC Only, the shared service VPC used by Studio to send all network traffic has **no** Internet access.\n", "The VPC has no Internet connectivity but still maintains access to specific AWS services such as KMS, CloudWatch, CodeArtifact,\n", "CodeCommit, Amazon S3 and so on.\n", "\n", "#### Test Networking\n", "\n", "To demonstrate a lack of Internet connectivity try to execute the below command, it will timeout without a path to the Internet or a proxy server." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!curl https://aws.amazon.com" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By removing public internet access in this way, we have created a secure environment where all the dependencies are installed, but the notebook now has no way to access the internet, and internet traffic cannot reach the notebook either. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 2: Authentication and Authorization\n", "---\n", "\n", "SageMaker Studio [UserProfile](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-entity-status.html) needs to be assigned a role for accessing AWS services. Fine grained access control over which services a SageMaker notebook is allowed to access can be provided using Identity and Access Management (IAM).\n", "\n", "To control access at a user level, data scientists should typically not be allowed to provision or delete infrastructure,\n", "create or modify IAM roles or change Studio domain configuration. In some cases, even console access can be removed by\n", "creating PreSigned Studio domain URLs, that directly launch SageMaker Studio IDE for data scientists to use from their laptops.\n", "Moreover, admins can use resource [tags for attribute-based access control (ABAC)](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction_attribute-based-access-control.html) to ensure that different teams of data scientists, with the same high-level IAM role, have different access rights to AWS services, such as only allowing read/write access to specific S3 buckets which match tag criteria. \n", "\n", "For customers with even more stringent data and code segregation requirements, admins can provision different accounts for individual teams and manage the billing from these accounts in a centralized Organizational Unit. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's inspect the role we have created for our notebook here:\n", "import boto3\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "\n", "sm = boto3.Session().client('sagemaker')\n", "sess = sagemaker.Session()\n", "region = boto3.session.Session().region_name\n", "\n", "role = get_execution_role()\n", "print (\"Notebook is running with assumed role {}\".format (role))\n", "print(\"Working with AWS services in the {} region\".format(region))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Studio UserProfile IAM Role\n", "\n", "As part of this workshop, we have assigned an IAM role to the Studio UserProfile. This role will be used by the user's\n", "KernelGateway app (notebook is executed by a Studio KernelGateway app hosted by Studio) to access AWS APIs.\n", "Look at the IAM policies attached to this role.\n", "\n", "Below is an example policy which provides least privilege access to various services like Amazon S3 and Amazon SageMaker that a data scientist would need to develop and conduct experiments. \n", "\n", "```json\n", "{\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Action\": [\n", " \"ssm:GetParameters\",\n", " \"ssm:GetParameter\"\n", " ],\n", " \"Resource\": \"arn:aws:ssm:us-west-2:0123456789012:parameter/ds-*\",\n", " \"Effect\": \"Allow\"\n", " },\n", " {\n", " \"Condition\": {\n", " \"Null\": {\n", " \"sagemaker:OutputKmsKey\": \"false\",\n", " \"sagemaker:VolumeKmsKey\": \"false\"\n", " },\n", " \"BoolIfExists\": {\n", " \"sagemaker:InterContainerTrafficEncryption\": \"true\"\n", " }\n", " },\n", " \"Action\": [\n", " \"sagemaker:CreateHyperParameterTuningJob\",\n", " \"sagemaker:CreateProcessingJob\",\n", " \"sagemaker:CreateTrainingJob\",\n", " \"sagemaker:CreateAutoMLJob\",\n", " \"sagemaker:CreateTransformJob\"\n", " ],\n", " \"Resource\": \"*\",\n", " \"Effect\": \"Allow\"\n", " },\n", " {\n", " \"Condition\": {\n", " \"Null\": {\n", " \"sagemaker:VolumeKmsKey\": \"false\"\n", " },\n", " \"ForAllValues:StringLike\": {\n", " \"sagemaker:InstanceTypes\": \"ml.c5.large\"\n", " }\n", " },\n", " \"Action\": [\n", " \"sagemaker:CreateEndpointConfig\"\n", " ],\n", " \"Resource\": \"*\",\n", " \"Effect\": \"Allow\"\n", " },\n", " {\n", " \"Condition\": {\n", " \"ForAllValues:StringEqualsIfExists\": {\n", " \"sagemaker:VpcSubnets\": [\n", " \"subnet-002aef0488b7dd0d1\",\n", " \"subnet-05631f54272af4b12\",\n", " \"subnet-06c82d9070b95e2e3\"\n", " ],\n", " \"sagemaker:VpcSecurityGroupIds\": [\n", " \"sg-01d5ee03442c2fa4e\"\n", " ]\n", " }\n", " },\n", " \"Action\": [\n", " \"sagemaker:CreateHyperParameterTuningJob\",\n", " \"sagemaker:CreateProcessingJob\",\n", " \"sagemaker:CreateTrainingJob\",\n", " \"sagemaker:CreateAutoMLJob\",\n", " \"sagemaker:CreateModel\",\n", " \"sagemaker:CreateExperiment\",\n", " \"sagemaker:CreateModelPackage\",\n", " \"sagemaker:CreateModelPackageGroup\",\n", " \"sagemaker:CreateTrial\",\n", " \"sagemaker:CreateTrialComponent\",\n", " \"sagemaker:CreateApp\",\n", " \"sagemaker:DeleteApp\",\n", " \"sagemaker:DescribeApp\"\n", " \"sagemaker:AssociateTrialComponent\",\n", " \"sagemaker:List*\",\n", " \"sagemaker:Describe*\",\n", " \"sagemaker:DeleteExperiment\",\n", " \"sagemaker:DeleteEndpointConfig\",\n", " \"sagemaker:DeleteEndpoint\",\n", " \"sagemaker:DeleteModel\",\n", " \"sagemaker:DeleteModelPackage\",\n", " \"sagemaker:DeleteModelPackageGroup\",\n", " \"sagemaker:DeleteTrial\",\n", " \"sagemaker:DeleteTrialComponent\",\n", " \"sagemaker:StopAutoMLJob\",\n", " \"sagemaker:StopHyperParameterTuningJob\",\n", " \"sagemaker:StopTransformJob\",\n", " \"sagemaker:UpdateEndpoint\",\n", " \"sagemaker:UpdateEndpointWeightsAndCapacities\",\n", " \"sagemaker:UpdateExperiment\",\n", " \"sagemaker:UpdateTrial\",\n", " \"sagemaker:UpdateTrialComponent\",\n", " \"sagemaker:Search\",\n", " ],\n", " \"Resource\": \"*\",\n", " \"Effect\": \"Allow\"\n", " },\n", " {\n", " \"Action\": [\n", " \"application-autoscaling:DeleteScalingPolicy\",\n", " \"application-autoscaling:DeleteScheduledAction\",\n", " \"application-autoscaling:DeregisterScalableTarget\",\n", " \"application-autoscaling:DescribeScalableTargets\",\n", " \"application-autoscaling:DescribeScalingActivities\",\n", " \"application-autoscaling:DescribeScalingPolicies\",\n", " \"application-autoscaling:DescribeScheduledActions\",\n", " \"application-autoscaling:PutScalingPolicy\",\n", " \"application-autoscaling:PutScheduledAction\",\n", " \"application-autoscaling:RegisterScalableTarget\",\n", " \"cloudwatch:DeleteAlarms\",\n", " \"cloudwatch:DescribeAlarms\",\n", " \"cloudwatch:GetMetricData\",\n", " \"cloudwatch:GetMetricStatistics\",\n", " \"cloudwatch:ListMetrics\",\n", " \"cloudwatch:PutMetricAlarm\",\n", " \"cloudwatch:PutMetricData\",\n", " \"ec2:CreateNetworkInterface\",\n", " \"ec2:CreateNetworkInterfacePermission\",\n", " \"ec2:DeleteNetworkInterface\",\n", " \"ec2:DeleteNetworkInterfacePermission\",\n", " \"ec2:DescribeDhcpOptions\",\n", " \"ec2:DescribeNetworkInterfaces\",\n", " \"ec2:DescribeRouteTables\",\n", " \"ec2:DescribeSecurityGroups\",\n", " \"ec2:DescribeSubnets\",\n", " \"ec2:DescribeVpcEndpoints\",\n", " \"ec2:DescribeVpcs\",\n", " \"elastic-inference:Connect\",\n", " \"iam:ListRoles\",\n", " \"lambda:ListFunctions\",\n", " \"logs:CreateLogGroup\",\n", " \"logs:CreateLogStream\",\n", " \"logs:DescribeLogStreams\",\n", " \"logs:GetLogEvents\",\n", " \"logs:PutLogEvents\",\n", " \"sns:ListTopics\",\n", " \"codecommit:BatchGetRepositories\",\n", " \"codecommit:ListRepositories\"\n", " ],\n", " \"Resource\": \"*\",\n", " \"Effect\": \"Allow\"\n", " },\n", " {\n", " \"Action\": [\n", " \"kms:CreateGrant\",\n", " \"kms:Decrypt\",\n", " \"kms:DescribeKey\",\n", " \"kms:Encrypt\",\n", " \"kms:ReEncrypt\",\n", " \"kms:GenerateDataKey\",\n", " \"kms:ListAliases\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:kms:us-west-2:0123456789012:key/1ab27534-12a8-4b2a-9876-fd9209dc1234\",\n", " \"arn:aws:kms:us-west-2:0123456789012:key/2ab27534-23a8-4b3a-9876-fd9209dc1234\"\n", " ],\n", " \"Effect\": \"Allow\",\n", " \"Sid\": \"KMSKeyAccess\"\n", " },\n", " {\n", " \"Action\": [\n", " \"codecommit:GitPull\",\n", " \"codecommit:GitPush\",\n", " \"codecommit:*Branch*\",\n", " \"codecommit:*PullRequest*\",\n", " \"codecommit:*Commit*\",\n", " \"codecommit:GetDifferences\",\n", " \"codecommit:GetReferences\",\n", " \"codecommit:GetRepository\",\n", " \"codecommit:GetMerge*\",\n", " \"codecommit:Merge*\",\n", " \"codecommit:DescribeMergeConflicts\",\n", " \"codecommit:*Comment*\",\n", " \"codecommit:*File\",\n", " \"codecommit:GetFolder\",\n", " \"codecommit:GetBlob\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:codecommit:us-west-2:0123456789012:ds-source-fsi-smteam-dev\"\n", " ],\n", " \"Effect\": \"Allow\",\n", " \"Sid\": \"CodeCommitAccess\"\n", " },\n", " {\n", " \"Action\": [\n", " \"ecr:BatchCheckLayerAvailability\",\n", " \"ecr:GetDownloadUrlForLayer\",\n", " \"ecr:GetRepositoryPolicy\",\n", " \"ecr:DescribeRepositories\",\n", " \"ecr:DescribeImages\",\n", " \"ecr:ListImages\",\n", " \"ecr:BatchGetImage\",\n", " \"ecr:GetLifecyclePolicy\",\n", " \"ecr:GetLifecyclePolicyPreview\",\n", " \"ecr:ListTagsForResource\",\n", " \"ecr:DescribeImageScanFindings\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:ecr:*:*:repository/*sagemaker*\",\n", " \"arn:aws:ecr:*:*:repository/ds-shared-container-images\"\n", " ],\n", " \"Effect\": \"Allow\",\n", " \"Sid\": \"ECRAccess\"\n", " },\n", " {\n", " \"Action\": [\n", " \"ecr:GetAuthorizationToken\"\n", " ],\n", " \"Resource\": \"*\",\n", " \"Effect\": \"Allow\",\n", " \"Sid\": \"ECRAuthTokenAccess\"\n", " },\n", " {\n", " \"Action\": [\n", " \"s3:GetObject\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:s3:::sagemaker-*/*\"\n", " ],\n", " \"Effect\": \"Allow\"\n", " },\n", " {\n", " \"Action\": [\n", " \"s3:GetObject\",\n", " \"s3:PutObject\",\n", " \"s3:DeleteObject\",\n", " \"s3:ListBucket\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:s3:::ds-data-bucket-fsi-smteam-dev-*\",\n", " \"arn:aws:s3:::ds-data-bucket-fsi-smteam-dev-*/*\",\n", " \"arn:aws:s3:::ds-model-bucket-fsi-smteam-dev-*\",\n", " \"arn:aws:s3:::ds-model-bucket-fsi-smteam-dev-*/*\"\n", " ],\n", " \"Effect\": \"Allow\"\n", " },\n", " {\n", " \"Action\": [\n", " \"s3:GetBucketLocation\",\n", " \"s3:ListBucket\",\n", " \"s3:ListAllMyBuckets\"\n", " ],\n", " \"Resource\": \"*\",\n", " \"Effect\": \"Allow\"\n", " },\n", " {\n", " \"Action\": [\n", " \"lambda:InvokeFunction\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:lambda:*:*:function:*SageMaker*\",\n", " \"arn:aws:lambda:*:*:function:*sagemaker*\",\n", " \"arn:aws:lambda:*:*:function:*Sagemaker*\",\n", " \"arn:aws:lambda:*:*:function:*LabelingFunction*\"\n", " ],\n", " \"Effect\": \"Allow\"\n", " },\n", " {\n", " \"Condition\": {\n", " \"StringLike\": {\n", " \"iam:AWSServiceName\": \"sagemaker.application-autoscaling.amazonaws.com\"\n", " }\n", " },\n", " \"Action\": \"iam:CreateServiceLinkedRole\",\n", " \"Resource\": \"arn:aws:iam::*:role/aws-service-role/sagemaker.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint\",\n", " \"Effect\": \"Allow\"\n", " },\n", " {\n", " \"Action\": [\n", " \"sns:Subscribe\",\n", " \"sns:CreateTopic\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:sns:*:*:*SageMaker*\",\n", " \"arn:aws:sns:*:*:*Sagemaker*\",\n", " \"arn:aws:sns:*:*:*sagemaker*\"\n", " ],\n", " \"Effect\": \"Allow\"\n", " },\n", " {\n", " \"Condition\": {\n", " \"StringEquals\": {\n", " \"iam:PassedToService\": [\n", " \"sagemaker.amazonaws.com\"\n", " ]\n", " }\n", " },\n", " \"Action\": [\n", " \"iam:PassRole\"\n", " ],\n", " \"Resource\": \"*\",\n", " \"Effect\": \"Allow\"\n", " }\n", " ]\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": { "deletable": false }, "source": [ "**Optional IAM Activity** Visit the AWS IAM console and review the role for `UserProfile` and its associated permissions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Complete Setup: Import libraries and set global definitions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import os\n", "from time import sleep, gmtime, strftime\n", "import time" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import SageMaker Experiments \n", "! pip install sagemaker-experiments\n", "from sagemaker.analytics import ExperimentAnalytics\n", "from smexperiments.experiment import Experiment\n", "from smexperiments.trial import Trial\n", "from smexperiments.trial_component import TrialComponent\n", "from smexperiments.tracker import Tracker" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Import Networking definitions: VPC Id, KMS keys and security groups and subnets\n", "\n", "In this SageMaker Studio KernelGateway app image you used a bash script to create a convenience Python module.\n", "This module is defined in ~/.ipython/sagemaker_environment.py directory of the Studio image and provides Python constants\n", "for values such as the AWS VPC configuration to be used in conjunction with Amazon SageMaker resources or the KMS encryption\n", "key ID to be used with Amazon S3. As part of this notebook you will import this module in the following cells. Feel free\n", "to inspect the source code as well." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create Networking configuration required for all APIs. \n", "from sagemaker.network import NetworkConfig\n", "import sagemaker_environment as smenv\n", "\n", "cmk_id = smenv.SAGEMAKER_KMS_KEY_ID \n", "sec_groups = smenv.SAGEMAKER_SECURITY_GROUPS\n", "subnets = smenv.SAGEMAKER_SUBNETS\n", "network_config = NetworkConfig(security_group_ids = sec_groups, subnets = subnets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Install Libraries using pip (while still being offline!)\n", "\n", "Typically when you use pip to install packages the code is downloaded over the public internet from the PyPI servers.\n", "However most customers do not allow public internet access from their notebook environment. To work within those guidelines,\n", "your Studio environment has been configured to work with a Shared Services AWS CodeArtifact repository populated with approved Python\n", "packages from public PyPI repository. This CodeArtifact repository will allow you to install and validate packages,\n", "as many regulated customers need to validate open source packages through their application security processes before they\n", "can be used by teams. Once packages are installed in CodeArtifact repository external connection to public PyPI is\n", "disassociated to prevent download of unapproved packages. Packages installation is a Data Science Administrator task and it\n", "requires Data Science Administrator role.\n", "\n", "By using a shared services CodeArtifact repository with no external connection to Internet you prevent downloading of unauthorized\n", "Python packages. For the purposes of this demo, you will pip install Shap, to demonstrate communication with the centralized, Shared Services\n", "CodeArtifact repository." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's install the shap library from shared CodeArtifact repository.\n", "! pip install shap\n", "! pip install xgboost" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import xgboost and a custom utilities package we use in this notebook\n", "import xgboost as xgb\n", "from util import utilsspec " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 3: Artifact Management \n", "---\n", "\n", "During the machine learning lifecycle a number of artifacts will be generated by our data processing jobs, training jobs and experimentation. To store these artifacts we specify the bucket locations where the model and data artifacts will reside below. These inputs are then fed into the SageMaker Estimators during data pre-processing and model training.\n", "\n", "SageMaker will automatically look in the specified buckets for accessing any training/validation data, and ensure that model outputs are stored in the output directories specified.\n", "\n", "Later on, we will see how to track these artifacts using SageMaker Experiments API.\n", "\n", "The workshop pre-provisioned a set of buckets and their names are included in our `sagemaker_environment.py` file so we will simply import those here directly. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We have already created buckets as part of the Secure Data Science Workshop. Here we will simply import those buckets\n", "# for your use.\n", "\n", "# raw_bucket: stores raw data and any preprocessing job related code.\n", "# data_bucket: stores train/test data for training/validating ML models.\n", "# output_bucket: where the model artifacts and outputs will be stored.\n", "# For our demo, these buckets are the same, but as best practice, we probably want to keep them separate with different permissions.\n", "\n", "raw_bucket = smenv.SAGEMAKER_DATA_BUCKET \n", "data_bucket = smenv.SAGEMAKER_DATA_BUCKET \n", "output_bucket =smenv.SAGEMAKER_MODEL_BUCKET \n", "\n", "prefix = 'secure-sagemaker-demo' # use this prefix to store all files pertaining to this workshop.\n", "\n", "dataprefix = prefix + '/data'\n", "traindataprefix = prefix + '/train_data'\n", "testdataprefix = prefix + '/test_data'\n", "\n", "print(\"Storing training data to s3://{}\".format (data_bucket))\n", "print (\"Training job output will be stored in s3://{}\".format (output_bucket))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section B: Pre-processing and Feature Engineering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A key part of the data science lifecyle is data exploration, pre-processing and feature engineering. In this section you will demonstrate how to use SageMaker Studio notebooks for data exploration and SageMaker Processing for feature engineering and pre-processing data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download and Import the data\n", "\n", "For this notebook, we use the public [Credit Card default dataset](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) downloaded from UCI and referenced in:\n", "\n", " Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.\n", "\n", "Since your notebook does not have Internet connectivity, you cannot download the dataset using the notebook. Instead you should\n", "have downloaded the credit card default dataset from UCI repository in Cloud9 environment using the instructions provided\n", "in the workshop prior to start of this notebook. The Cloud9 environment has Internet access for the purpose of this workshop.\n", "You should have also uploaded the dataset to the Shared Service data lake S3 bucket. Next you will download the credit card\n", "default dataset from the data lake so that it is available on your SageMaker Studio image locally.\n", "\n", "The dataset is using some user features (age, education level, marital status etc) and some prior user history of credit card\n", "payments to predict likelihood of dafault on next month's payment. Here a value of `1` indicates default and `0` indicates\n", "no default." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "WORKDIR = os.getcwd()\n", "BASENAME = os.path.dirname(WORKDIR)\n", "\n", "# Download credit card default dataset from shared service data lake S3 bucket\n", "ds_data_lake_bucket = smenv.SAGEMAKER_DATA_LAKE_BUCKET\n", "print('Data Lake Bucket: ', ds_data_lake_bucket)\n", "cc_data_lake_prefix = prefix + '/credit-card-default'\n", "cc_uci_dataset_prefix = cc_data_lake_prefix + '/credit_card_default_data.xls'\n", "print('Downloading UCI credit card dataset from: ', cc_uci_dataset_prefix)\n", "# Download data set from s3 data lake\n", "sess.download_data(WORKDIR, bucket=ds_data_lake_bucket, key_prefix=cc_uci_dataset_prefix)\n", "print('Downloaded UCI dataset to: ', WORKDIR)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the dataset to a data frame\n", "data = pd.read_excel('credit_card_default_data.xls', header=1)\n", "data = data.drop(columns = ['ID'])\n", "data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Note that the categorical columns SEX, Education and Marriage have been Integer Encoded in this case.\n", "# For example:\n", "data.SEX.value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "data.rename(columns={\"default payment next month\": \"Label\"}, inplace=True)\n", "lbl = data.Label\n", "data = pd.concat([lbl, data.drop(columns=['Label'])], axis = 1)\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Exploration\n", "\n", "The first step in any ML Lifecycle is to explore the dataset to understand the statistical distributions of the features, engineer new ones as well as transform existing features into ML ready features which can be consumed by machine learning models. \n", "\n", "#### Is the data imbalanced?\n", "\n", "One of the first steps in feature engineering is to investigate imbalanced data, i.e. whether there is much more of one label over another. Here we see that we have about 80% class imbalance -- that can be okay for many Machine learning models and does not require special balancing methods. If we have datasets with over 90% imbalance, using some sampling technique to generate a more balanced dataset is usually a good idea." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "sns.countplot(data.Label)\n", "plt.title('Counts of Default versus Non Default Labels')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Are features correlated?\n", "\n", "Simpler models such as linear/logistic regression typically don't perform well if one has correlated features. A simple way to look into this is to plot a correlation matrix as shown below. As you can see, the Payment and Bill features are strongly correlated, which is not surprising. You will also want to see if any features are strongly correlated with the label: if so, you need to ask, \"will this feature be available in the incoming data or is there some leakage of the dependent variable into one of the independent variables?\".\n", "\n", "Here you will include all these features as you have a small dataset; but in general, one may want to explore some kind of dimensionality reduction technique such as Principal Component Analysis (PCA)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Correlation plot\n", "f = plt.figure(figsize=(19, 15))\n", "plt.matshow(data.corr(), fignum=f.number)\n", "plt.xticks(range(data.shape[1]), data.columns, fontsize=14, rotation=45)\n", "plt.yticks(range(data.shape[1]), data.columns, fontsize=14)\n", "cb = plt.colorbar()\n", "cb.ax.tick_params(labelsize=14)\n", "plt.title('Correlation Matrix', fontsize=16);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pandas.plotting import scatter_matrix\n", "SCAT_COLUMNS = ['BILL_AMT1', 'BILL_AMT2', 'PAY_AMT1', 'PAY_AMT2']\n", "scatter_matrix(data[SCAT_COLUMNS],figsize=(10, 10), diagonal ='kde')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preprocessing and Feature Engineering in Notebook\n", "\n", "First we will run the feature engineering directly in our SageMaker Studio notebook. While this is okay for small datasets, it is not really recommended at scale. Moreover, it is hard to track Feature Engineering jobs from a versioning and lineage perspective if it is run in an ad hoc manner inside a notebook instance.\n", "\n", "In the cells that follow you will see how to use SageMaker Processing to scale out our feature engineering jobs. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if not os.path.exists('rawdata/rawdata.csv'):\n", " !mkdir rawdata\n", " data.to_csv('rawdata/rawdata.csv', index=None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#upload the raw data to S3.\n", "rawdataprefix = 'rawdata'\n", "raw_data_location = sess.upload_data(rawdataprefix, bucket=raw_bucket, key_prefix=dataprefix)\n", "print(raw_data_location)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler, MinMaxScaler\n", "from sklearn.compose import make_column_transformer\n", "\n", "COLS = data.columns\n", "X_train, X_test, y_train, y_test = train_test_split(data.drop('Label', axis=1), data['Label'], \n", " test_size=0.2, random_state=0)\n", "newcolorder = ['PAY_AMT1','BILL_AMT1'] + list(COLS[1:])[:11] + list(COLS[1:])[12:17] + list(COLS[1:])[18:]\n", "\n", "preprocess = make_column_transformer(\n", " (StandardScaler(),['PAY_AMT1']),\n", " (MinMaxScaler(), ['BILL_AMT1']),\n", " remainder='passthrough')\n", " \n", "print('Running preprocessing and feature engineering transformations')\n", "train_features = pd.DataFrame(preprocess.fit_transform(X_train), columns = newcolorder)\n", "test_features = pd.DataFrame(preprocess.transform(X_test), columns = newcolorder)\n", "train_full = pd.concat([pd.DataFrame(y_train.values, columns=['Label']), pd.DataFrame(train_features)], axis=1)\n", "test_full = pd.concat([pd.DataFrame(y_test.values, columns=['Label']), pd.DataFrame(test_features)], axis=1)\n", "train_full.to_csv('train_data.csv', index=False, header=False)\n", "test_full.to_csv('test_data.csv', index=False, header=False) \n", "print(\"Completed transformation, training set has shape: {}\".format (train_features.shape))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# upload the training and validation data in S3. \n", "print(sess.upload_data('train_data.csv', bucket=data_bucket, key_prefix=traindataprefix))\n", "print(sess.upload_data('test_data.csv', bucket=data_bucket, key_prefix=testdataprefix))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Secure and scalable Feature Engineering pipeline using SageMaker Processing\n", "\n", "While you can pre-process small amounts of data directly in a notebook as shown above, SageMaker Processing offloads the heavy lifting of pre-processing larger datasets by provisioning the underlying infrastructure, securely downloading the data from an S3 location to the processing container, running the processing scripts, storing the processed data in an output directory in Amazon S3 and deleting the underlying transient resources needed to run the processing job. Once the processing job is complete, the infrastructure used to run the job is wiped, and any temporary data stored on it is deleted.\n", "\n", "Importantly as we see below, we can now track this part of our analysis process to ensure that the lineage of our downstream trained ML models can be versioned and tracked to a feature engineering pipeline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Write a preprocessing script (same as above)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%writefile preprocessing.py\n", "\n", "import argparse\n", "import os\n", "import warnings\n", "\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler, MinMaxScaler\n", "from sklearn.exceptions import DataConversionWarning\n", "from sklearn.compose import make_column_transformer\n", "\n", "warnings.filterwarnings(action='ignore', category=DataConversionWarning)\n", "\n", "if __name__=='__main__':\n", " parser = argparse.ArgumentParser()\n", " parser.add_argument('--train-test-split-ratio', type=float, default=0.3)\n", " parser.add_argument('--random-split', type=int, default=0)\n", " args, _ = parser.parse_known_args()\n", " \n", " print('Received arguments {}'.format(args))\n", "\n", " input_data_path = os.path.join('/opt/ml/processing/input', 'rawdata.csv')\n", " \n", " print('Reading input data from {}'.format(input_data_path))\n", " df = pd.read_csv(input_data_path)\n", " df.sample(frac=1)\n", " \n", " COLS = df.columns\n", " newcolorder = ['PAY_AMT1','BILL_AMT1'] + list(COLS[1:])[:11] + list(COLS[1:])[12:17] + list(COLS[1:])[18:]\n", " \n", " split_ratio = args.train_test_split_ratio\n", " random_state=args.random_split\n", " \n", " X_train, X_test, y_train, y_test = train_test_split(df.drop('Label', axis=1), df['Label'], \n", " test_size=split_ratio, random_state=random_state)\n", " \n", " preprocess = make_column_transformer(\n", " (['PAY_AMT1'], StandardScaler()),\n", " (['BILL_AMT1'], MinMaxScaler()),\n", " remainder='passthrough')\n", " \n", " print('Running preprocessing and feature engineering transformations')\n", " train_features = pd.DataFrame(preprocess.fit_transform(X_train), columns = newcolorder)\n", " test_features = pd.DataFrame(preprocess.transform(X_test), columns = newcolorder)\n", " \n", " # concat to ensure Label column is the first column in dataframe\n", " train_full = pd.concat([pd.DataFrame(y_train.values, columns=['Label']), train_features], axis=1)\n", " test_full = pd.concat([pd.DataFrame(y_test.values, columns=['Label']), test_features], axis=1)\n", " \n", " print('Train data shape after preprocessing: {}'.format(train_features.shape))\n", " print('Test data shape after preprocessing: {}'.format(test_features.shape))\n", " \n", " train_features_headers_output_path = os.path.join('/opt/ml/processing/train_headers', 'train_data_headers.csv')\n", " \n", " train_features_output_path = os.path.join('/opt/ml/processing/train', 'train_data.csv')\n", " \n", " test_features_output_path = os.path.join('/opt/ml/processing/test', 'test_data.csv')\n", " \n", " print('Saving training features to {}'.format(train_features_output_path))\n", " train_full.to_csv(train_features_output_path, header=False, index=False)\n", " print(\"Complete\")\n", " \n", " print(\"Save training data with headers to {}\".format(train_features_headers_output_path))\n", " train_full.to_csv(train_features_headers_output_path, index=False)\n", " \n", " print('Saving test features to {}'.format(test_features_output_path))\n", " test_full.to_csv(test_features_output_path, header=False, index=False)\n", " print(\"Complete\")\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Copy the preprocessing code over to the s3 bucket\n", "codeprefix = prefix + '/code'\n", "codeupload = sess.upload_data('preprocessing.py', bucket=raw_bucket, key_prefix=codeprefix)\n", "print(codeupload)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train_data_location = 's3://'+ data_bucket + '/' + traindataprefix\n", "train_header_location = 's3://'+ data_bucket +'/'+ prefix +'/train_headers'\n", "test_data_location = 's3://'+ data_bucket+'/'+testdataprefix\n", "print(\"Training data location = {}\".format(train_data_location))\n", "print(\"Test data location = {}\".format(test_data_location))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 4: Data Encryption\n", "---\n", "\n", "To ensure that the processed data is encrypted at rest on the processing cluster, we provide a customer managed key to the volume_kms_key command below. This instructs Amazon SageMaker to encrypt the EBS volumes used during the processing job with the specified key. Since our data stored in Amazon S3 buckets are already encrypted, data is encrypted at rest at all times.\n", "\n", "Amazon SageMaker always uses TLS encrypted tunnels when working with Amazon SageMaker so data is also encrypted in transit when traveling from or to Amazon S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Use SageMaker Processing with SKLearn. -- combine data into train and test at this stage if possible.\n", "from sagemaker.sklearn.processing import SKLearnProcessor\n", "sklearn_processor = SKLearnProcessor(\n", " framework_version='0.20.0',\n", " role=role,\n", " instance_type='ml.c4.xlarge',\n", " instance_count=1,\n", " network_config=network_config, # attach SageMaker resources to your VPC\n", " volume_kms_key=cmk_id # encrypt the EBS volume attached to SageMaker Processing instance\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.processing import ProcessingInput, ProcessingOutput\n", "\n", "sklearn_processor.run(\n", " code=codeupload,\n", " inputs=[\n", " ProcessingInput(\n", " source=raw_data_location, destination='/opt/ml/processing/input')\n", " ],\n", " outputs=[\n", " ProcessingOutput(\n", " output_name='train_data',\n", " source='/opt/ml/processing/train',\n", " destination=train_data_location),\n", " ProcessingOutput(\n", " output_name='test_data',\n", " source='/opt/ml/processing/test',\n", " destination=test_data_location),\n", " ProcessingOutput(\n", " output_name='train_data_headers',\n", " source='/opt/ml/processing/train_headers',\n", " destination=train_header_location)\n", " ],\n", " arguments=['--train-test-split-ratio', '0.2'])\n", "\n", "preprocessing_job_description = sklearn_processor.jobs[-1].describe()\n", "\n", "output_config = preprocessing_job_description['ProcessingOutputConfig']\n", "for output in output_config['Outputs']:\n", " if output['OutputName'] == 'train_data':\n", " preprocessed_training_data = output['S3Output']['S3Uri']\n", " if output['OutputName'] == 'test_data':\n", " preprocessed_test_data = output['S3Output']['S3Uri']" ] }, { "cell_type": "markdown", "source": [ "#### Upload training and test datasets to Shared Service Data Lake for detective control demonstration\n", "\n", "Now let's upload training and test datasets to the Shared Service data lake bucket. This is so that training job can read these datasets\n", "for detective control demonstration later in this notebook when training without specifying a VPC. The reason is that team's\n", "S3 buckets have a very restrictive bucket policy and the training job will fail to launch without configuring VPC since it can't access\n", "the datasets in team's bucket. For demonstration purposes, the data lake doesn't have a bucket policy. However, the S3\n", "gateway endpoint does have endpoint policy that is restrictive. It allows traffic only to the Shared Service data lake bucket\n", "and to data science team's data and model buckets. If you examine the S3 Gateway endpoint policy you will notice the following:\n", "\n", "```yaml\n", "{\n", " \"Effect\":\"Allow\",\n", " \"Principal\": \"*\",\n", " \"Action\":[\n", " \"s3:GetObject\",\n", " \"s3:PutObject\",\n", " \"s3:ListBucket\"\n", " ],\n", " \"Resource\":[\n", " \"arn:aws:s3:::ds-model-bucket-*\",\n", " \"arn:aws:s3:::ds-data-bucket-*\",\n", " \"arn:aws:s3:::ds-model-bucket-*/*\",\n", " \"arn:aws:s3:::ds-data-bucket-*/*\",\n", " \"arn:aws:s3:::*ds-data-lake*\",\n", " \"arn:aws:s3:::*ds-data-lake*/*\"\n", " ]\n", "}\n", "```\n", "\n", "This policy allows S3 Gateway endpoint to allow access to data science team's data and model buckets as well to the Shared\n", "Service data lake bucket but nothing else.\n", "\n", "Further if you examine the SageMaker execution role for the SageMaker Studio UserProfile, you will notice that it allows access to\n", "data science team's data and model buckets and to the Shared Service data lake which limits notebook's\n", "S3 bucket access to minimum required but nothing more:\n", "\n", "```yaml\n", "{\n", " \"Action\": [\n", " \"s3:GetObject\",\n", " \"s3:PutObject\",\n", " \"s3:DeleteObject\",\n", " \"s3:ListBucket\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:s3:::ds-data-bucket-fsi-smteam-dev-*\",\n", " \"arn:aws:s3:::ds-data-bucket-fsi-smteam-dev-*/*\",\n", " \"arn:aws:s3:::ds-model-bucket-fsi-smteam-dev-*\",\n", " \"arn:aws:s3:::ds-model-bucket-fsi-smteam-dev-*/*\",\n", " \"arn:aws:s3:::ds-data-lake*\",\n", " \"arn:aws:s3:::ds-data-lake*/*\"\n", " ],\n", " \"Effect\": \"Allow\"\n", " }\n", "```\n", "\n", "Note that this policy can be further tightened where write to the data lake bucket can be disallowed, making it read only.\n", "\n", "Finally, for the data science team's data and model buckets, the attached bucket policy only allows access from SageMaker\n", "Studio's VPC, further restricting access to these buckets:\n", "\n", "```yaml\n", "{\n", " \"Version\": \"2008-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Deny\",\n", " \"Principal\": \"*\",\n", " \"Action\": [\n", " \"s3:GetObject\",\n", " \"s3:PutObject\",\n", " \"s3:ListBucket\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:s3:::ds-data-bucket-fsi-smteam-dev-060560a8e64a/*\",\n", " \"arn:aws:s3:::ds-data-bucket-fsi-smteam-dev-060560a8e64a\"\n", " ],\n", " \"Condition\": {\n", " \"StringNotEquals\": {\n", " \"aws:SourceVpce\": \"vpce-055dd71a983f8724e\"\n", " }\n", " }\n", " }\n", " ]\n", "}\n", "```\n", "\n", "Together these policies provide fairly fine-grained access control to S3 buckets and are quite restrictive.\n", "\n", "Let's upload the training and test data sets generated from data pre-processing in the previous step to the data lake.\n", "Note that this is only so that we can demonstrate detective control in-action." ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "# Download the preprocessed datasets\n", "training_dataset_prefix = traindataprefix + '/train_data.csv'\n", "test_dataset_prefix = testdataprefix + '/test_data.csv'\n", "sess.download_data(WORKDIR, bucket=data_bucket, key_prefix=training_dataset_prefix)\n", "sess.download_data(WORKDIR, bucket=data_bucket, key_prefix=test_dataset_prefix)\n", "\n", "# Upload training and test data to data lake for detective control demonstration later in this notebook\n", "cc_s3_train_datalake_location = sess.upload_data('train_data.csv', bucket=ds_data_lake_bucket, key_prefix=cc_data_lake_prefix)\n", "cc_s3_test_datalake_location = sess.upload_data('test_data.csv', bucket=ds_data_lake_bucket, key_prefix=cc_data_lake_prefix)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section C: Model development and Training" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 5. Traceability and Auditability \n", "---\n", "\n", "We use SageMaker Experiments for data scientists to track the lineage of the model from the raw data source to the preprocessing steps and the model training pipeline. With SageMaker Experiments, data scientists can compare, track and manage multiple diferent model training jobs, data processing jobs, and hyperparameter tuning jobs, retaining a lineage from the source data to the training job artifacts to the model hyperparameters and any custom metrics that they may want to monitor as part of the model training.\n", "\n", "Here we used SageMaker's managed XGBoost container to train an XGBoost model. More details about the managed container can be found here: https://github.com/aws/sagemaker-xgboost-container\n", "\n", "Many customers require tracking and lineage to the source code level, which keeps track of which user made the most recent commit that produced the training code, which generated the deployed production model. We demonstrate how this is done using Github APIs and integrated into SageMaker Experiments" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a SageMaker Experiment\n", "cc_experiment = Experiment.create(\n", " experiment_name=f\"CreditCardDefault-{int(time.time())}\", \n", " description=\"Predict credit card default from payments data\", \n", " sagemaker_boto_client=sm)\n", "print(cc_experiment)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you can track your SageMaker processing job as shown below. Here you will track the train_test_split_ratio, but you can track all kinds of other metadata such as the underlying instance types used to run the processing job or any specific feature engineering steps such as the random seed used to generate the train, test splits." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Start Tracking parameters used in the Pre-processing pipeline.\n", "with Tracker.create(display_name=\"Preprocessing\", sagemaker_boto_client=sm) as tracker:\n", " tracker.log_parameters({\n", " \"train_test_split_ratio\": 0.2\n", " })\n", " # we can log the s3 uri to the dataset we just uploaded\n", " tracker.log_input(name=\"ccdefault-raw-dataset\", media_type=\"s3/uri\", value=raw_data_location)\n", " tracker.log_input(name=\"ccdefault-train-dataset\", media_type=\"s3/uri\", value=train_data_location)\n", " tracker.log_input(name=\"ccdefault-test-dataset\", media_type=\"s3/uri\", value=test_data_location)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train the Model\n", "\n", "The same security practices you applied previously during SM Processing apply to training jobs. You will also have SageMaker experiments track the training job and store metadata such as model artifact location, training and validation data location, and model hyperparameters.\n", "\n", "**Managed Spot Training**: To save on cost, you can run the training using managed Spot instances. SageMaker will automatically look to see if any spot instances of the desired type are available for a max time less than the max wait time, and if one is available, run your training job on the lower cost instance. With Managed Spot, customers can benefit from up-to 90% savings in cost.\n", "\n", "For bring your own containers, customers are responsible for checkpointing models for the spot instances to resume training in the event that a training job is interrupted. For some SageMaker built-in algorithms, as well as SageMaker managed containers for Tensorflow/PyTorch/MxNet, SageMaker will handle the model checkpointing. For others, such as XgBoost, you will limit the max_wait_time to 3600 seconds. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Train Without VPC Configured:\n", "\n", "To test the networking controls, run the following cell below. Here you will first attempt to train the model without an associated network configuration. You should see that the training job is stopped around the same time as the \"Downloading - Downloading input data\" message is emitted. \n", "\n", "#### Detective control explained\n", "\n", "The training job was terminated by an AWS Lambda function that was executed in response to a CloudWatch Event that was triggered when the training job was created. \n", "\n", "To learn more about how the detective control does this, assume the role of the Data Science Administrator and review the code of the [AWS Lambda function SagemakerTrainingJobVPCEnforcer](https://console.aws.amazon.com/lambda/home?#/functions/SagemakerTrainingJobVPCEnforcer?tab=configuration). \n", "\n", "You can also review the [CloudWatch Event rule SagemakerTrainingJobVPCEnforcementRule](https://console.aws.amazon.com/cloudwatch/home?#rules:name=SagemakerTrainingJobVPCEnforcementRule) and take note of the event which triggers execution of the Lambda function.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "image=sagemaker.image_uris.retrieve(framework='xgboost', region=boto3.Session().region_name, version='1.2-2')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_input_train = sagemaker.inputs.TrainingInput(s3_data=cc_s3_train_datalake_location, content_type='csv')\n", "s3_input_test = sagemaker.inputs.TrainingInput(s3_data=cc_s3_test_datalake_location, content_type='csv')\n", "\n", "print (\"Training data at: {}\".format (s3_input_train.config['DataSource']['S3DataSource']['S3Uri']))\n", "print (\"Test data at: {}\".format (s3_input_test.config['DataSource']['S3DataSource']['S3Uri']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "xgb = sagemaker.estimator.Estimator(\n", " image,\n", " role,\n", " instance_count=1,\n", " instance_type='ml.m4.xlarge',\n", " max_run=3600,\n", " output_path='s3://{}/{}/models'.format(output_bucket, prefix),\n", " sagemaker_session=sess,\n", " use_spot_instances=True,\n", " max_wait=3600,\n", " disable_profiler=True,\n", " encrypt_inter_container_traffic=False\n", ") \n", "\n", "xgb.set_hyperparameters(\n", " max_depth=5,\n", " eta=0.2,\n", " gamma=4,\n", " min_child_weight=6,\n", " subsample=0.8,\n", " verbosity=0,\n", " objective='binary:logistic',\n", " num_round=100)\n", "\n", "xgb.fit(inputs={'train': s3_input_train})\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE:** We disable SageMaker profiling to enable starting of SageMaker Training job above. This is because the profiler runs\n", "as a sidecar container, the startup of sidecar container requires access to the S3 model bucket but without VPC related\n", "parameters access to the S3 model bucket is denied, and start of training jobs fails. To demonstrate detective\n", "control in action, we turn off profiling so that training job can start without profiling but is stopped by\n", "the detective control.\n", "\n", "#### Train with VPC\n", "_NOTE_: You may have to interrupt the kernel before executing the next step.\n", "\n", "This time provide the training job with the network settings that were defined above. This time we shouldn't see the **Client Error** as before!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "preprocessing_trial_component = tracker.trial_component\n", "\n", "trial_name = f\"cc-fraud-training-job-{int(time.time())}\"\n", "cc_trial = Trial.create(\n", " trial_name=trial_name,\n", " experiment_name=cc_experiment.experiment_name,\n", " sagemaker_boto_client=sm)\n", "\n", "cc_trial.add_trial_component(preprocessing_trial_component)\n", "cc_training_job_name = \"cc-training-job-{}\".format(int(time.time()))\n", "xgb = sagemaker.estimator.Estimator(\n", " image,\n", " role,\n", " instance_count=1,\n", " instance_type='ml.m4.xlarge',\n", " max_run=3600,\n", " output_path='s3://{}/{}/models'.format(output_bucket, prefix),\n", " sagemaker_session=sess,\n", " use_spot_instances=True,\n", " max_wait=3600,\n", " subnets=subnets, \n", " security_group_ids=sec_groups,\n", " volume_kms_key=cmk_id,\n", " encrypt_inter_container_traffic=False\n", ") \n", "\n", "xgb.set_hyperparameters(\n", " max_depth=5,\n", " eta=0.2,\n", " gamma=4,\n", " min_child_weight=6,\n", " subsample=0.8,\n", " verbosity=0,\n", " objective='binary:logistic',\n", " num_round=100)\n", "\n", "xgb.fit(\n", " inputs={'train': s3_input_train},\n", " job_name=cc_training_job_name,\n", " experiment_config={\n", " \"TrialName\":\n", " cc_trial.trial_name, #log training job in Trials for lineage\n", " \"TrialComponentDisplayName\": \"Training\",\n", " },\n", " wait=True,\n", ")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 5 cont: Traceability and Auditability from source control to Model artifacts\n", "---\n", "\n", "Having used SageMaker Experiments to track the training runs, you can now extract model metadata to get the entire lineage of the model from the source data to the model artifacts and the hyperparameters.\n", "\n", "To do this, simply call the **describe_trial_component** API." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Present the Model Lineage as a dataframe\n", "from sagemaker.session import Session\n", "sess = boto3.Session()\n", "lineage_table = ExperimentAnalytics(\n", " sagemaker_session=Session(sess, sm), \n", " search_expression={\n", " \"Filters\":[{\n", " \"Name\": \"Parents.TrialName\",\n", " \"Operator\": \"Equals\",\n", " \"Value\": trial_name\n", " }]\n", " },\n", " sort_by=\"CreationTime\",\n", " sort_order=\"Ascending\",\n", ")\n", "lineagedf= lineage_table.dataframe()\n", "\n", "lineagedf" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get detailed information about a particular trial\n", "import pprint\n", "pp = pprint.PrettyPrinter(indent=4)\n", "pp.pprint (sm.describe_trial_component(TrialComponentName=lineagedf.TrialComponentName[1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 6: Explainability and Interpretability\n", "---\n", "\n", "Now you can download the model artifact locally and extract feature importances from the model. In this case, XGBoost provides out of box APIs to do so. Some utility functions to extract this information have also been provided.\n", "\n", "You can use SHAP values to understand which features contribute most to the model performance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import util\n", "trial_component_name = lineagedf.TrialComponentName[1]\n", "LOCAL_FILENAME = '{}-model.tar.gz'.format(trial_component_name) # training local file\n", "utilsspec.download_artifacts(trial_component_name, LOCAL_FILENAME) # download training file to local SageMaker volume" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = utilsspec.unpack_model_file(LOCAL_FILENAME) # extract the XGBoost model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**WARNING**: If you get an error here above such as \"no Module named xgboost.core\", simply run !pip install xgboost in a new cell --> Restart the kernel and run again. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "utilsspec.plot_features(model, data.columns[1:]) # use XGBoost native functionality to plot feature importance" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import shap" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "traindata = pd.read_csv('train_data.csv', names = ['Label']+newcolorder)\n", "traindata.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "shap_values = shap.TreeExplainer(model).shap_values(traindata.drop(columns =['Label'])) # or use SHAP values.\n", "shap.summary_plot(shap_values, traindata.drop(columns =['Label']), plot_type=\"bar\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "shap.summary_plot(shap_values, traindata.drop(columns =['Label']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SHAP value plot above shows that the most recent payment history is an important feature on average across the training dataset -- users that have high values of PAY_0 (i.e. have not paid their bill for several months) have a strong impact on the model's output of predicting a default. Notice that the user features (marital status, age) etc do not have much importance on average.\n", "\n", "The information included in this notebook is for illustrative purposes only. Nothing in this notebook is intended to provide you legal, compliance, or regulatory guidance. You should review the laws that apply to you." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section D: Transition to Deployment\n", "\n", "### Git Integration\n", "\n", "At this stage you have engineered a feature set, trained a model on the data, and have explored how the model is making decisions.\n", "You are now ready to deploy the model and transition from experimentation into operational deployment. To start this transition\n", "use the Git repository associated with this project to share your work with other team members. With your code under version\n", "control other team members can work to push the model into production after conducting internal review of the code as well\n", "as any QA/integration or other testing to make it production ready.\n", "\n", "In the next notebook, you will assume the model you trained here is ready to deploy to production. You will deploy the\n", "model and monitor its operation for anomalous behavior.\n", "\n", "To push this notebook to your project's CodeCommit repository follow the following steps using either a Studio System Terminal\n", "window or using the Git extension in Studio.\n", "\n", "**Via Studio System Terminal**\n", "\n", "In the Studio UI click `File` --> `New Launcher` and in `Launcher` tab under `Utilities and files` click on `System Terminal`.\n", "\n", "In the Terminal window, navigate to the local directory containing this project and run the following cells:\n", "\n", "```bash\n", "cd ~/\n", "git add 00_SageMaker-SysOps-Workflow.ipynb\n", "git commit -m \"Completed experimentation and trained initial model\"\n", "git push -u origin main\n", "git log --pretty=oneline\n", "```\n", "\n", "**Via SageMaker Studio Git Extension**\n", "\n", "On the left of the Studio UI you will notice an icon for `Git`. Click this icon and you will see git repository a list of\n", "*Changed* files under `Changes` tab. Hover over `01_SageMaker-DataScientist-Workflow.ipynb` in the list of *Changed* files\n", "and click the `+` associated with the file to stage changes. Towards the bottom of the screen in the `Summary (required)`\n", "text field enter \"Completed experimentation and trained initial model\" and click `Commit`. This commits the changes to the\n", "local copy of the Git repository. To push those changes to the team repository click the `Push committed changes` button\n", "towards the top which looks like a cloud with an arrow pointing up." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusions of this notebook\n", "\n", "To conclude this portion, you have seen key steps in the data scientist workflow:\n", "\n", "1. **Security**: Data exploration and storage of raw data using encryption keys\n", "\n", "1. **Pre-processing:** Data preprocessing both in notebook, and in a secure manner using SageMaker Processing with encryption and networking guardrails for data motion.\n", "\n", "1. **Built-in algorithm training:** Use SageMaker built in algorithm for model training\n", "\n", "1. **Cost Optimization:** Training using Spot Instances to save cost. \n", "\n", "1. **Lineage and Tracking:** Tracking of model lineage as well as pre-processing job parameters using SageMaker Experiments.\n", "\n", "1. **Explainability and Interpretability**: Model Feature importance using SHAP." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Store the values used in this notebook for use in the second demo notebook:\n", "trial_name = trial_name \n", "experiment_name = cc_experiment.experiment_name\n", "training_job_name = cc_training_job_name\n", "%store trial_name \n", "%store experiment_name \n", "%store training_job_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 4 }