{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# R Workflow in Amazon SageMaker\n", "\n", "**Summary:**\n", "\n", "This sample Notebook demonstrates an end-to-end workflow for R in SageMaker, including Hyperparameter tuning and how to generate predictions using two methods: (1) offline [Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html); and (2) [deploying the model](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html) as a SageMaker endpoint and making online inferences. Additionally, we'll see how to move a workflow from notebook prototyping to production with automated pipelines, in this case the SageMaker Pipelines feature. \n", "\n", "Since the focus is on demonstrating SageMaker mechanics, the use case is a simple one: we'll predict abalone age, which is measured by the number of rings in the shell. We'll use the public [abalone dataset](https://archive.ics.uci.edu/ml/datasets/abalone) hosted by [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). \n", "\n", "We will use two different libraries to interact with SageMaker:\n", "- [`Reticulate` library](https://rstudio.github.io/reticulate/): provides an R interface to make SageMaker API calls [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/latest/index.html). The `reticulate` package translates between R and Python objects, and Amazon SageMaker provides a serverless data science environment to train and deploy ML models at scale.\n", "- [`paws` library](https://cran.r-project.org/web/packages/paws/index.html): provides an interface to make API calls to AWS services, similar to how [`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) works. `boto3` is the Amazon Web Services (AWS) SDK for Python. It enables Python developers to create, configure, and manage AWS services, such as EC2 and S3. Boto provides an easy to use, object-oriented API, as well as low-level access to AWS services. `paws` provides the same capabilities in R.\n", "\n", "Table of Contents:\n", "- [Reticulating the Amazon SageMaker Python SDK](#Reticulating-the-Amazon-SageMaker-Python-SDK)\n", "- [Creating and Accessing the Data Storage](#Creating-and-accessing-the-data-storage)\n", "- [Downloading and Processing the Dataset](#Downloading-and-processing-the-dataset)\n", "- [Preparing the Dataset for Model Training](#Preparing-the-dataset-for-model-training)\n", "- [Hyperparameter Tuning for the XGBoost Model](#Hyperparameter-Tuning-for-the-XGBoost-Model)\n", " - [Using the paws Library to Interact with AWS Services and Get the Status of the Tuning Job](#Using-the-paws-Library-to-Interact-with-AWS-Services-and-Get-the-Status-of-the-Tuning-Job)\n", "- [Option 1: Batch Transform](#Option-1:-Batch-Transform)\n", " - [Create a Model using the Best Training Job](#Create-a-Model-using-the-Best-Training-Job)\n", " - [Batch Transform using the Tuned Estimator](#Batch-Transform-using-the-Tuned-Estimator)\n", " - [Download the batch job output](#Download-the-batch-job-output)\n", "- [Option 2: Generate Predictions from an Endpoint](#Option-2:-Generate-Predictions-from-an-Endpoint)\n", " - [Deploying the Tuner](#Deploying-the-Tuner)\n", " - [Generating Predictions with the Deployed Model](#Generating-Predictions-with-the-Deployed-Model)\n", " - [Deleting the Endpoint](#Deleting-the-Endpoint)\n", "- [Automating the workflow with SageMaker Pipelines](#Automating-the-workflow-with-SageMaker-Pipelines)\n", " - [Pipeline parameters](#Pipeline-parameters)\n", " - [Creating pipeline steps](#Creating-pipeline-steps)\n", " - [Setting up and starting the pipeline](#Setting-up-and-starting-the-pipeline)\n", " - [Viewing the pipeline in the Studio UI](#Viewing-the-pipeline-in-the-Studio-UI)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reticulating the Amazon SageMaker Python SDK\n", "\n", "First, load the `reticulate` library and import the high level `sagemaker` Python SDK. Once it is loaded, you can use the `$` notation in R instead of the `.` notation in Python to use available classes. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Turn warnings off globally\n", "options(warn=-1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install reticulate library and import sagemaker\n", "library(reticulate)\n", "sagemaker <- import('sagemaker')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating and Accessing the Data Storage\n", "\n", "The `Session` class provides operations for working with the following [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) resources with Amazon SageMaker:\n", "\n", "* [S3](https://boto3.readthedocs.io/en/latest/reference/services/s3.html): AWS object storage service.\n", "* [SageMaker](https://boto3.readthedocs.io/en/latest/reference/services/sagemaker.html): APIs covering most SageMaker functionality.\n", "* [SageMakerRuntime](https://boto3.readthedocs.io/en/latest/reference/services/sagemaker-runtime.html): APIs for SageMaker real time and asynchronous endpoints. (There are also separate APIs for SageMaker Feature Store and Edge Manager.)\n", "\n", "Let's create an [Amazon Simple Storage Service](https://aws.amazon.com/s3/) bucket for your data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session <- sagemaker$Session()\n", "bucket <- session$default_bucket()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note** - The `default_bucket` function creates a unique Amazon S3 bucket with the following name: \n", "\n", "`sagemaker--`\n", "\n", "An AWS IAM role provides granular permissions to use AWS services. Here we specify the IAM role's [ARN](https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html) to allow Amazon SageMaker to access the Amazon S3 bucket. You can use the same IAM role used to create this Notebook:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "role_arn <- sagemaker$get_execution_role()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Downloading and Processing the Dataset\n", "\n", "The model uses the [abalone dataset](https://archive.ics.uci.edu/ml/datasets/abalone) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). First, download the data and start the [exploratory data analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis). Use tidyverse packages to read, plot, and transform the data into ML format for Amazon SageMaker:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(readr)\n", "data_file <- 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'\n", "abalone <- read_csv(file = data_file, col_names = FALSE)\n", "names(abalone) <- c('sex', 'length', 'diameter', 'height', 'whole_weight', 'shucked_weight', 'viscera_weight', 'shell_weight', 'rings')\n", "head(abalone)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output above shows that `sex` is a factor data type but is currently a character data type (F is Female, M is male, and I is infant). Change `sex` to a factor and view the statistical summary of the dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "abalone$sex <- as.factor(abalone$sex)\n", "summary(abalone)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The summary above shows that the minimum value for `height` is 0.\n", "\n", "Visually explore which abalones have height equal to 0 by plotting the relationship between `rings` and `height` for each value of `sex`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(ggplot2)\n", "options(repr.plot.width = 5, repr.plot.height = 4) \n", "ggplot(abalone, aes(x = height, y = rings, color = sex)) + geom_point() + geom_jitter()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The plot shows multiple outliers: two infant abalones with a height of 0 and a few female and male abalones with greater heights than the rest. Let's filter out the two infant abalones with a height of 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(dplyr)\n", "abalone <- abalone %>%\n", " filter(height != 0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preparing the Dataset for Model Training\n", "\n", "We'll create three datasets: one each for training, testing, and validation. First, convert `sex` into a [dummy variable](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)) and move the target, `rings`, to the first column. The SageMaker XGBoost algorithm, which we'll use to train a model, requires the target to be in the first column of the dataset. (In general SageMaker can work with any data format required by the specific algorithm or framework.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "abalone <- abalone %>%\n", " mutate(female = as.integer(ifelse(sex == 'F', 1, 0)),\n", " male = as.integer(ifelse(sex == 'M', 1, 0)),\n", " infant = as.integer(ifelse(sex == 'I', 1, 0))) %>%\n", " select(-sex)\n", "abalone <- abalone %>%\n", " select(rings:infant, length:shell_weight)\n", "head(abalone)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, sample 70% of the data for training the ML algorithm. Split the remaining 30% into two halves, one for testing and one for validation:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "abalone_train <- abalone %>%\n", " sample_frac(size = 0.7)\n", "abalone <- anti_join(abalone, abalone_train)\n", "abalone_test <- abalone %>%\n", " sample_frac(size = 0.5)\n", "abalone_valid <- anti_join(abalone, abalone_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Later in the notebook, we'll use Batch Transform and Endpoints to generate predictions in two different ways, and compare the results. The maximum number of rows that we can send to an endpoint for inference in one batch is 500 rows (Batch Transform is better suited to batch scoring at scale than Endpoints). We are going to reduce the number of rows for the test dataset to 500 and use this for batch and online inference for comparison. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "num_predict_rows <- 500\n", "abalone_test <- abalone_test[1:num_predict_rows, ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we'll upload the training and validation data to Amazon S3 so we can train the model. First, write the training and validation datasets to the local filesystem in .csv format:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "write_csv(abalone_train, 'abalone_train.csv', col_names = FALSE)\n", "write_csv(abalone_valid, 'abalone_valid.csv', col_names = FALSE)\n", "\n", "# Remove target from test\n", "write_csv(abalone_test[-1], 'abalone_test.csv', col_names = FALSE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Second, upload the two datasets to the Amazon S3 bucket:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_train <- session$upload_data(path = 'abalone_train.csv', \n", " bucket = bucket, \n", " key_prefix = 'data')\n", "s3_valid <- session$upload_data(path = 'abalone_valid.csv', \n", " bucket = bucket, \n", " key_prefix = 'data')\n", "\n", "s3_test <- session$upload_data(path = 'abalone_test.csv', \n", " bucket = bucket, \n", " key_prefix = 'data')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, define the Amazon S3 input types for the Amazon SageMaker algorithm:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_train_input <- sagemaker$inputs$TrainingInput(s3_data = s3_train,\n", " content_type = 'csv')\n", "s3_valid_input <- sagemaker$inputs$TrainingInput(s3_data = s3_valid,\n", " content_type = 'csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hyperparameter Tuning for the XGBoost Model\n", "\n", "Amazon SageMaker algorithms are available via [Docker](https://www.docker.com/) containers. To train an [XGBoost](https://en.wikipedia.org/wiki/Xgboost) model, specify the URL of the SageMaker XGBoost container, which is in [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/) (Amazon ECR) for each AWS Region. We will use the `latest` version of the algorithm." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "container <- sagemaker$image_uris$retrieve(framework='xgboost', region= session$boto_region_name, version='latest')\n", "cat('XGBoost Container Image URL: ', container)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define an Amazon SageMaker [Estimator](http://sagemaker.readthedocs.io/en/latest/estimators.html), which can train any algorithm that has been containerized with Docker. There are many open source prebuilt containers for SageMaker for frameworks such as XGBoost, Scikit-learn, PyTorch, TensorFlow, etc. You also can BYO container.\n", "\n", "When creating the Estimator, use the following arguments:\n", "* **image_uri** - The container image to use for training\n", "* **role** - The Amazon SageMaker service role\n", "* **train_instance_count** - The number of Amazon EC2 instances to use for training\n", "* **train_instance_type** - The type of Amazon EC2 instance to use for training\n", "* **train_volume_size** - The size in GB of the [Amazon Elastic Block Store](https://aws.amazon.com/ebs/) (Amazon EBS) volume to use for storing input data during training\n", "* **train_max_run** - The timeout in seconds for training\n", "* **input_mode** - The input mode that the algorithm supports\n", "* **output_path** - The Amazon S3 location for saving the training results (model artifacts and output files)\n", "* **output_kms_key** - The [AWS Key Management Service](https://aws.amazon.com/kms/) (AWS KMS) key for encrypting the training output\n", "* **base_job_name** - The prefix for the name of the training job\n", "* **sagemaker_session** - The Session object that manages interactions with Amazon SageMaker API" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s3_output <- paste0('s3://', bucket, '/output')\n", "estimator <- sagemaker$estimator$Estimator(image_uri = container,\n", " role = role_arn,\n", " train_instance_count = 1L,\n", " train_instance_type = 'ml.m5.4xlarge',\n", " train_volume_size = 30L,\n", " train_max_run = 3600L,\n", " input_mode = 'File',\n", " output_path = s3_output,\n", " output_kms_key = NULL,\n", " base_job_name = NULL,\n", " sagemaker_session = NULL)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note** - The equivalent to `None` in Python is `NULL` in R.\n", "\n", "Next, we specify the [XGBoost hyperparameters](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) for the estimator, and also define the ranges of hyperparameters that we want to use for [SageMaker Automatic Model Tuning](https://sagemaker.readthedocs.io/en/stable/tuner.html). You can find the list of [Tunable Hyperparamters for the XGBoost algorithm here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-tuning.html). \n", "\n", "In addition, you need to specify the tuning evaluation metric. XGboost allows one of these nine objectives to be used (for the description of these objectives visit [\"Tune an XGBoost Model\"](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-tuning.html) page) :\n", "\n", "- validation:accuracy\n", "- validation:auc\n", "- validation:error\n", "- validation:f1\n", "- validation:logloss\n", "- validation:mae\n", "- validation:map\n", "- validation:merror\n", "- validation:mlogloss\n", "- validation:mse\n", "- validation:ndcg\n", "- validation:rmse\t\n", "\n", "In this case, since this is a regression problem, we select `validation:rmse` as the tuning objective.\n", "\n", "For tuning the hyperparameters you also need to specify the type and range of hyperparameters to be tuned. You can specify either a `ContinuousParameter` or an `IntegerParameter`, as outlined in the documentation. In addition, the algorithm documentation provides suggestions for the hyperparameter ranges.\n", "\n", "\n", "Once the Estimator and its hyperparameters and tunable hyperparameter ranges are specified, you can create a `HyperparameterTuner` (tuner) object. You can train (or fit) that tuner, which will conduct the tuning and identify the best model in relation to the tuning objective. You can then generate predictions using the model with Batch Transform, or by deploying the model as an endpoint and using it for online inference." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set Hyperparameters\n", "estimator$set_hyperparameters(eval_metric='rmse',\n", " objective='reg:linear',\n", " num_round=100L,\n", " rate_drop=0.3,\n", " tweedie_variance_power=1.4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set Hyperparameter Ranges\n", "hyperparameter_ranges <- list('eta' = sagemaker$parameter$ContinuousParameter(0,1),\n", " 'min_child_weight'= sagemaker$parameter$ContinuousParameter(0,10),\n", " 'alpha'= sagemaker$parameter$ContinuousParameter(0,2),\n", " 'max_depth'= sagemaker$parameter$IntegerParameter(0L,10L))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set the tuning objective to RMSE\n", "objective_metric_name <- 'validation:rmse'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `HyperparameterTuner` accepts multiple parameters. A short list of these parameters are described below. For the complete list and more details you can visit the [`HyperparameterTuner` Documentation](https://sagemaker.readthedocs.io/en/stable/tuner.html#hyperparametertuner) :\n", "\n", "- **estimator** (sagemaker.estimator.EstimatorBase) – An estimator object that has been initialized with the desired configuration. There does not need to be a training job associated with this instance.\n", "- **objective_metric_name** (str) – Name of the metric for evaluating training jobs.\n", "- **hyperparameter_ranges** (dict[str, sagemaker.parameter.ParameterRange]) – Dictionary of parameter ranges. These parameter ranges can be one of three types: Continuous, Integer, or Categorical. \n", "- **objective_type** (str) – The type of the objective metric for evaluating training jobs. This value can be either ‘Minimize’ or ‘Maximize’ (default: ‘Maximize’).\n", "- **max_jobs** (int) – Maximum total number of training jobs to start for the hyperparameter tuning job (default: 1).\n", "- **max_parallel_jobs** (int) – Maximum number of parallel training jobs to start (default: 1).\n", "\n", "Here we've set the maximum jobs as 12, with maximum parallelization of 4 training jobs at a time. Accordingly, the tuning job will run a series of 12/4 = 3 rounds of 4 concurrent training jobs. The default tuning strategy, Bayesian Optimization, performs best if parallel jobs are iteratively run in several rounds so the overall 'meta-model' can make progress focusing in on the optimal hyperparameter ranges. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a hyperparamter tuner\n", "tuner <- sagemaker$tuner$HyperparameterTuner(estimator,\n", " objective_metric_name,\n", " hyperparameter_ranges,\n", " objective_type='Minimize',\n", " max_jobs=12L,\n", " max_parallel_jobs=4L)\n", "\n", "# Create a tuning job name\n", "job_name <- paste('sagemaker-tune-xgboost', format(Sys.time(), '%H-%M-%S'), sep = '-')\n", "\n", "# Define the data channels for train and validation datasets\n", "input_data <- list('train' = s3_train_input,\n", " 'validation' = s3_valid_input)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To start the tuning job, simply call the `fit` method of the `HyperparameterTuner` object. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner$fit(inputs = input_data, job_name = job_name, wait=FALSE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Using the `boto3` SDK to Interact with AWS Services and Get the Status of the Tuning Job\n", "\n", "With [`boto3` Python SDK](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) you can create, configure, and manage AWS services, such as Amazon S3, Amazon SageMaker and other AWS services. The SDK provides an object-oriented API as well as low-level access to AWS services. Using `reticulate` library, you can leverage this SDK in R. \n", "\n", "Since running a tuning job may take a while, we are going to use `boto3` to get the status of the tuning job using `sagemaker$describe_hyper_parameter_tuning_job`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boto3_r <- import('boto3')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a paws SageMaker session\n", "sm <- boto3_r$client('sagemaker')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Periodically check the tuning job status by re-running the cell below. Eventually the status should become `Completed` with 12 'Succeeded' models after about 10 minutes of total job time. You also can view the job status in the SageMaker console under **Hyperparameter tuning jobs**, which appears underneath the **Training** drop down." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the status of the tuning job\n", "status <- sm$describe_hyper_parameter_tuning_job(\n", " HyperParameterTuningJobName=tuner$latest_tuning_job$job_name)\n", "\n", "cat('Hyperparameter Tuning Job Name: ', job_name,'\\n')\n", "cat('Hyperparameter Tuning Job Status: ', status$HyperParameterTuningJobStatus,'\\n')\n", "cat('Succeeded Models:', status$ObjectiveStatusCounters$Succeeded,'\\n')\n", "cat('InProgress Models:', status$ObjectiveStatusCounters$Pending,'\\n')\n", "cat('Failed Models:', status$ObjectiveStatusCounters$Failed,'\\n')\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print best training hyperparamters\n", "status$BestTrainingJob$TunedHyperParameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print Evaluation Metric\n", "status$BestTrainingJob$FinalHyperParameterTuningJobObjectiveMetric" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Name of the best training job model\n", "status$BestTrainingJob$TrainingJobName" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Option 1: Batch Transform" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a Model using the Best Training Job\n", "This section demonstrates how to create a model with the best training job results from the tuning job, using the model artifacts saved in S3.\n", "\n", "First, we create a model container, which requires the following parameters:\n", "- **Image:** URL of the algorithm container \n", "- **ModelDataUrl:** Location of the model tar ball (model.tar.gz) on S3 that is saved by the tuning job\n", "\n", "We can extract the **ModelDataUrl** by describing the best training job using `boto3` SDK and `describe_training_job()` method. [More details can be found here](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.describe_training_job).\n", " \n", "Then we will create a model using this model container. We will use `paws` library and the `create_model` method. [Documentation of this method can be found here](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Describe best model from tuning to get the location of the model artifact in S3\n", "model_artifact <- sm$describe_training_job(\n", " TrainingJobName = status$BestTrainingJob$TrainingJobName\n", ")$ModelArtifacts$S3ModelArtifacts\n", "\n", "model_artifact" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a model container wrapper\n", "model_container <- list(\n", " \"Image\"= container,\n", " \"ModelDataUrl\" = model_artifact\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a SageMaker Model object\n", "\n", "model_name <- paste('sagemaker-model-xgboost', format(Sys.time(), '%H-%M-%S'), sep = '-')\n", "\n", "best_model <- sm$create_model(\n", " ModelName = model_name,\n", " PrimaryContainer = model_container,\n", " ExecutionRoleArn = role_arn\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Batch Transform using the Tuned Estimator\n", "\n", "In many situations, using an endpoint for model deployment is not the best option, especially when the goal is not to generate online real-time predictions, but rather to generate predictions offline on a large stored dataset. In these situations, using Batch Transform may be more efficient and appropriate.\n", "\n", "This section of the notebook explains how to set up the Batch Transform Job and generate predictions.\n", "\n", "To do this, first we need to define the batch input data path on S3, and also where to save the generated predictions in S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define S3 path for Test data and output path\n", "\n", "s3_test_url <- paste('s3:/',bucket,'data','abalone_test.csv', sep = '/')\n", "output_path <- paste('s3:/',bucket,'output/batch_transform_output',job_name, sep = '/')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we instantiate a `Transformer` object. [Transformers](https://sagemaker.readthedocs.io/en/stable/transformer.html#transformer) take multiple parameters, including the following. For more details and the complete list visit the [documentation page](https://sagemaker.readthedocs.io/en/stable/transformer.html#transformer).\n", "\n", "- **model_name** (str) – Name of the SageMaker Model being used for the transform job.\n", "- **instance_count** (int) – Number of EC2 instances to use.\n", "- **instance_type** (str) – Type of EC2 instance to use, for example, ‘ml.c5.xlarge’.\n", "\n", "- **output_path** (str) – S3 location for saving the transform job results. If not specified, results are stored to a default bucket.\n", "\n", "- **base_transform_job_name** (str) – Prefix for the transform job when the transform() method launches. If not specified, a default prefix will be generated based on the training image name that was used to train the model associated with the transform job.\n", "\n", "- **sagemaker_session** (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the estimator creates one using the default AWS configuration chain.\n", "\n", "Once we instantiate a `Transformer` object, we can transform the batch input data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Instantiate a SageMaker Transformer\n", "transformer <- sagemaker$transformer$Transformer(\n", " model_name = model_name,\n", " instance_count=1L,\n", " instance_type='ml.m5.4xlarge',\n", " output_path=output_path,\n", " base_transform_job_name='R-Transformer',\n", " sagemaker_session=session)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Tranform the test data and wait until the task completes\n", "transformer$transform(s3_test_url)\n", "transformer$wait()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the status of Batch Transform\n", "sm$describe_transform_job(TransformJobName = transformer$latest_transform_job$job_name)$TransformJobStatus\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download the batch job output\n", "\n", "The next step is to download the Batch Transform job results from where they are stored in S3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sagemaker$s3$S3Downloader$download(paste(output_path,\"abalone_test.csv.out\",sep = '/'),\n", " \"batch_output\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Read the batch csv from sagemaker local files\n", "library(readr)\n", "predictions <- read_csv(file = 'batch_output/abalone_test.csv.out', col_names = 'predicted_rings')\n", "head(predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Column-bind the predicted rings to the test data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Concatenate predictions and test for comparison\n", "abalone_predictions <- cbind(predicted_rings = predictions, \n", " abalone_test)\n", "# Convert predictions to Integer\n", "abalone_predictions$predicted_rings = as.integer(abalone_predictions$predicted_rings);\n", "head(abalone_predictions)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define a function to calculate RMSE\n", "rmse <- function(m, o){\n", " sqrt(mean((m - o)^2))\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calucalte RMSE\n", "abalone_rmse <- rmse(abalone_predictions$rings, abalone_predictions$predicted_rings)\n", "cat('RMSE for Batch Transform: ', round(abalone_rmse, digits = 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Option 2: Generate Predictions from an Endpoint\n", "### Deploying the Tuner\n", "\n", "This section walks you through the endpoint deployment process for the tuned/trained model. We will then use the deployed model (as an endpoint) to make predictions using the test data. Deploying the model as as endpoint is suitable for cases where you need to make online predictions in response to data streaming in, for example from a consumer-facing app. For making predictions using offline batch data, the preferred method is using Batch Transform, which was demonstrated in the previous section.\n", "\n", "Amazon SageMaker lets you [deploy your model](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html) by providing an endpoint that consumers can invoke by a secure and simple API call using an HTTPS request. Let's deploy our trained model to a `ml.t2.medium` instance. This will take a couple of minutes or less." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "model_endpoint <- tuner$deploy(initial_instance_count = 1L,\n", " instance_type = 'ml.t2.medium')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generating Predictions with the Deployed Model\n", "\n", "Use the test data to generate predictions. Pass comma-separated text to be serialized into JSON format by specifying `text/csv` and `csv_serializer` for the endpoint:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_endpoint$serializer <- sagemaker$serializers$CSVSerializer(content_type='text/csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove the target column and convert the dataframe to a matrix with no column names:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_sample <- as.matrix(abalone_test[-1])\n", "dimnames(test_sample)[[2]] <- NULL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate predictions from the endpoint and convert the returned comma-separated string:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(stringr)\n", "predictions_ep <- model_endpoint$predict(test_sample)\n", "predictions_ep <- str_split(predictions_ep, pattern = ',', simplify = TRUE)\n", "predictions_ep <- as.integer(unlist(predictions_ep))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Column-bind the predicted ring numbers to those in the test data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Convert predictions to Integer\n", "abalone_predictions_ep <- cbind(predicted_rings = predictions_ep, \n", " abalone_test)\n", "# abalone_predictions = as.integer(abalone_predictions)\n", "head(abalone_predictions_ep)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calucalte RMSE\n", "abalone_rmse_ep <- rmse(abalone_predictions_ep$rings, abalone_predictions_ep$predicted_rings)\n", "cat('RMSE for Endpoint 500-Row Prediction: ', round(abalone_rmse_ep, digits = 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Deleting the Endpoint\n", "\n", "When you're done with the model, delete the endpoint to avoid incurring deployment costs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "session$delete_endpoint(model_endpoint$endpoint)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Automating the workflow with SageMaker Pipelines\n", "\n", "In the previous sections of this notebook, we prototyped various steps of a project within the notebook itself, with some steps being run on external SageMaker resources (model tuning, batch transform, hosted endpoints). Notebooks are great for prototyping, but generally are not used in production-ready machine learning pipelines.\n", "\n", "A very simple pipeline in SageMaker includes processing the dataset to get it reading for training, performing the actual training, and then using the model to perform some form of inference such as batch prediction. We'll use SageMaker Pipelines to automate some of these steps, keeping the pipeline simple for now: it easily can be extended into a far more complex pipeline with conditional steps and more." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pipeline parameters\n", "\n", "Before we begin to create the pipeline itself, we should think about how to parameterize it. For example, we may use different instance types for different purposes, such as CPU-based types for batch scoring and GPU-based or more powerful types for model training. These are all \"knobs\" of the pipeline that we can parameterize. Parameterizing enables custom pipeline executions and schedules without having to modify the pipeline definition.\n", "\n", "In the next cell several different pipeline parameters are specified. These include input data location (enabling future pipeline runs on different data slices without changing any code), and instance types and counts. Many others are possible." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# input data\n", "train_input_data_slice = sagemaker$workflow$parameters$ParameterString(name=\"TrainInputData\", default_value=s3_train)\n", "validation_input_data_slice = sagemaker$workflow$parameters$ParameterString(name=\"ValidationInputData\", default_value=s3_valid)\n", "\n", "# training step parameters (default instance type is a compute-optimized, powerful c5)\n", "training_instance_type = sagemaker$workflow$parameters$ParameterString(name=\"TrainingInstanceType\", default_value=\"ml.c5.2xlarge\")\n", "training_instance_count = sagemaker$workflow$parameters$ParameterInteger(name=\"TrainingInstanceCount\", default_value=1L)\n", "\n", "# batch inference step parameters (default instance type is a general purpose m5)\n", "batch_instance_type = sagemaker$workflow$parameters$ParameterString(name=\"BatchInstanceType\", default_value=\"ml.m5.4xlarge\")\n", "batch_instance_count = sagemaker$workflow$parameters$ParameterInteger(name=\"BatchInstanceCount\", default_value=1L)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating pipeline steps\n", "\n", "The following code sets up a pipeline step for a training job. The training step simply wraps the estimator created above, making it easy to move from prototyping to pipelines with a minimal learning curve. Note that instead of a `TrainingStep`, we could have used a `TuningStep` incorporating Automatic Model Tuning similar to the code above (we're doing one training job here to shorten the time required to run this notebook)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline_estimator <- sagemaker$estimator$Estimator(image_uri = container,\n", " role = role_arn,\n", " instance_count = training_instance_count,\n", " instance_type = training_instance_type,\n", " volume_size = 30L,\n", " train_max_run = 3600L,\n", " input_mode = 'File',\n", " output_path = s3_output,\n", " output_kms_key = NULL,\n", " base_job_name = NULL,\n", " sagemaker_session = NULL)\n", "\n", "pipeline_estimator$set_hyperparameters(eval_metric='rmse',\n", " objective='reg:linear',\n", " num_round=100L,\n", " rate_drop=0.3,\n", " tweedie_variance_power=1.4)\n", "\n", "input_data <- list('train' = sagemaker$inputs$TrainingInput(s3_data = train_input_data_slice, content_type = 'csv'),\n", " 'validation' = sagemaker$inputs$TrainingInput(s3_data = validation_input_data_slice, content_type = 'csv'))\n", "\n", "step_train <- sagemaker$workflow$steps$TrainingStep(\n", " name = \"R-XGBoost-Train\",\n", " estimator = pipeline_estimator,\n", " inputs = input_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll also create a SageMaker Model object to wrap the model artifact, and associate it with a separate SageMaker prebuilt XGBoost container to potentially use later for inference. For example, the Model could be used with SageMaker Batch Transform." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline_model <- sagemaker$model$Model(\n", " image_uri = container,\n", " model_data = step_train$properties$ModelArtifacts$S3ModelArtifacts,\n", " sagemaker_session = session,\n", " role = role_arn\n", " )\n", "\n", "model_hw_inputs <- sagemaker$inputs$CreateModelInput(instance_type = \"ml.m5.large\")\n", "\n", "step_create_model <- sagemaker$workflow$steps$CreateModelStep(\n", " name = \"R-XGBoost-Create-Model\",\n", " model = pipeline_model,\n", " inputs = model_hw_inputs\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a further step, we also can register the model in the SageMaker Model Registry for improved model governance. For example, with Model Registry you can catalog models for production, manage model versions, and associate metadata with models. Model Registry also can be used as part of a CI/CD workflow for model deployment to SageMaker Endpoints. By default, each time this pipeline is executed, a new model version will be registered." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "step_register <- sagemaker$workflow$step_collections$RegisterModel(\n", " name = \"R-XGBoost-Register-Model\",\n", " estimator = estimator,\n", " model_data = step_train$properties$ModelArtifacts$S3ModelArtifacts,\n", " content_types = c(\"text/csv\", \"text/csv\"),\n", " response_types = c(\"text/csv\", \"text/csv\"),\n", " inference_instances = c(\"ml.c5.xlarge\", \"ml.m5.xlarge\"),\n", " transform_instances = c(\"ml.c5.xlarge\", \"ml.m5.xlarge\"),\n", " model_package_group_name = \"R-XGBoost-ModelPackageGroup\",\n", " image_uri = container\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The final step in this pipeline is offline, batch scoring (inference) using the SageMaker Batch Transform feature. The inputs to this step will be the model trained in the pipeline training step, and the test data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline_transformer <- sagemaker$transformer$Transformer(\n", " model_name = step_create_model$properties$ModelName,\n", " instance_count = batch_instance_count,\n", " instance_type = batch_instance_type,\n", " output_path = output_path,\n", " sagemaker_session = session\n", " )\n", "\n", "step_transform <- sagemaker$workflow$steps$TransformStep(\n", " name = \"R-XGBoost-Transform\", \n", " transformer = pipeline_transformer, \n", " inputs = sagemaker$inputs$TransformInput(data = s3_test_url)\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting up and starting the pipeline\n", "\n", "With all of the pipeline steps now defined, we can define the pipeline itself as a Pipeline object comprising a series of those steps. Parallel and conditional steps also are possible. The pipeline's directed acyclic graph (DAG) structure is inferred from the specified inputs and outputs, there is no need to explicitly specify the DAG or whether steps are parallel or in series." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline <- sagemaker$workflow$pipeline$Pipeline(\n", " name = 'R-XGBoost',\n", " parameters = c(train_input_data_slice,\n", " validation_input_data_slice,\n", " training_instance_type,\n", " training_instance_count,\n", " batch_instance_type,\n", " batch_instance_count),\n", " steps = c(step_train, step_create_model, step_register, step_transform)\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can quickly do a sanity check on the pipeline definition:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline$definition()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we upsert the pipeline definition. Using the upsert method allows us to modify the pipeline definition as needed for each execution. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline$upsert(role_arn = role_arn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A pipeline execution can be started with the `start` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "execution <- pipeline$start()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now confirm that the pipeline is executing. In the log output below, confirm that PipelineExecutionStatus is `Executing`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "execution$describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Typically this pipeline should take about 10 minutes to complete. We can wait for completion by invoking wait(). After execution is complete, we can list the status of the pipeline steps." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "execution$wait()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "execution$list_steps()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Viewing the pipeline in the Studio UI\n", "\n", "#### Pipeline details\n", "\n", "After the pipeline has started, we can examine the progress of the pipeline in the Studio UI. To do so,\n", "\n", "- Click the tilted triangle icon in the left panel toolbar, and select **Pipelines** from the drop down menu.\n", "- Double-click the name of your pipeline, which should be `R-XGBoost`.\n", "- In the new pipeline tab that opens, you should see a table of pipeline executions. Double-click the first execution.\n", "- Another tab opens with pipeline execution details. Under the **Graph** sub-tab, you should see a pipeline node graph. Typically in progress nodes are blue, succeeded are green, and failed are red.\n", "- Click any of the the pipeline graph nodes that have finished to see details such as input, output and logs (which saves the effort of switching to a different UI, CloudWatch Logs, to inspect logs). \n", "- The **Settings** sub-tab will have general metadata about the pipeline execution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Experiments\n", "\n", "By default, a pipeline execution also will generate a SageMaker Experiment for tracking the overall workflow. To examine this:\n", "\n", "- Click the tilted triangle icon in the left panel toolbar, and select **Experiments and trials** from the drop down menu.\n", "- Double-click the name of your Experiment, which should be `r-xgboost`.\n", "- In the Trials table that opens, right-click the first trial and select **Open in trial components list**. \n", "- A Trial Components List tab will open. For this simple pipeline, it will only have a Training job and Transform job. You can right-click them and select **Open in trial details** to view metadata about each job. More advanced usage is possible as you'll see from all of the sub-tab options that are available, but for sake of simplicity were not used in this example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Model Registry\n", "\n", "Since we had a RegisterModel step in our pipeline, the model also was registered in the SageMaker Model Registry. To view the registry:\n", "\n", "- Click the tilted triangle icon in the left panel toolbar, and select **Model registry** from the drop down menu.\n", "- Double-click the name of your model group, which should be `R-XGBoost-ModelPackageGroup`.\n", "- A tab will open for the Model Package Group. It will have a list of model versions (by default they are numbered starting from 1 for each pipeline execution), and a sub-tab for Settings for the overall group. \n", "- You can double-click the model version to open up a version tab with metadata about the model version. Although we didn't make use of it, there is addtional functionality for tracking the metrics associated with the model, performing manual approvals for model deployment to an endpoint, etc." ] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "R (custom-r/latest)", "language": "python", "name": "ir__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:894087409521:image/custom-r" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.0.0" } }, "nbformat": 4, "nbformat_minor": 4 }