{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# SageMaker Pipelines workshop\n", "\n", "This workshop is based on the [amazon sagemaker drift detection project](https://github.com/aws-samples/amazon-sagemaker-drift-detection) on github, available in the aws-samples repository. \n", "\n", "Amazon SageMaker Model Building Pipelines offers machine learning (ML) application developers and operations engineers the ability to orchestrate SageMaker jobs and author reproducible ML pipelines. It also enables them to deploy custom-build models for inference in real-time with low latency, run offline inferences with Batch Transform, and track lineage of artifacts. They can institute sound operational practices in deploying and monitoring production workflows, deploying model artifacts, and tracking artifact lineage through a simple interface, adhering to safety and best practice paradigms for ML application development.\n", "\n", "The SageMaker Pipelines service supports a SageMaker Pipeline domain specific language (DSL), which is a declarative JSON specification. This DSL defines a directed acyclic graph (DAG) of pipeline parameters and SageMaker job steps. The SageMaker Python Software Developer Kit (SDK) streamlines the generation of the pipeline DSL using constructs that engineers and scientists are already familiar with." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## SageMaker Pipelines\n", "\n", "SageMaker Pipelines supports the following activities, which are demonstrated in this notebook:\n", "\n", "* Pipelines - A DAG of steps and conditions to orchestrate SageMaker jobs and resource creation.\n", "* Processing job steps - A simplified, managed experience on SageMaker to run data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation.\n", "* Training job steps - An iterative process that teaches a model to make predictions by presenting examples from a training dataset.\n", "* Conditional execution steps - A step that provides conditional execution of branches in a pipeline.\n", "* Register model steps - A step that creates a model package resource in the Model Registry that can be used to create deployable models in Amazon SageMaker.\n", "* Create model steps - A step that creates a model for use in transform steps or later publication as an endpoint.\n", "* Transform job steps - A batch transform to preprocess datasets to remove noise or bias that interferes with training or inference from a dataset, get inferences from large datasets, and run inference when a persistent endpoint is not needed.\n", "* Parametrized Pipeline executions - Enables variation in pipeline executions according to specified parameters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook Overview\n", "\n", "This notebook shows how to:\n", "\n", "* Define a set of Pipeline parameters that can be used to parametrize a SageMaker Pipeline.\n", "* Define a Processing step that performs cleaning, feature engineering, and splitting the input data into train and test data sets.\n", "* Define a Training step that trains a model on the preprocessed train data set.\n", "* Define a Processing step that evaluates the trained model's performance on the test dataset.\n", "* Define a Create Model step that creates a model from the model artifacts used in training.\n", "* Define a Transform step that performs batch transformation based on the model that was created.\n", "* Define a Register Model step that creates a model package from the estimator and model artifacts used to train the model.\n", "* Define a Conditional step that measures a condition based on output from prior steps and conditionally executes other steps.\n", "* Define and create a Pipeline definition in a DAG, with the defined parameters and steps.\n", "* Start a Pipeline execution and wait for execution to complete.\n", "* Download the model evaluation report from the S3 bucket for examination.\n", "* Start a second Pipeline execution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will consider 3 teams in this scenario:\n", "* Engineering team - ML or data engineering\n", "* Data science team - developing the models\n", "* Dev ops team - automating and integrating the pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A SageMaker Pipeline\n", "\n", "The pipeline that you create follows a typical machine learning (ML) application pattern of preprocessing, training, evaluation, model creation, batch transformation, and model registration:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset\n", "we use the publicly available [New York City Taxi and Limousine Commission (TLC) Trip Record Data](https://registry.opendata.aws/nyc-tlc-trip-records-pds/). This TLC record data contains the yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, tripc distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP), please refer to [NYC documentation](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "