{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Amazon SageMaker with XGBoost and Hyperparameter Tuning for Taxi Trip Fare Prediction\n", "#### Supervised Learning with Gradient Boosted Trees\n", "This notebook works well with the **Python 3 (Data Science)** kernel on SageMaker Studio, or conda_python3 on classic SageMaker Notebook Instances\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Objective\n", "\n", "This workshop aims to give you an **example of using and tuning a SageMaker built-in algorithm**: Focussing on the **data interfaces** and SageMaker's automatic **Hyperparameter Optimization** (HPO) capabilities.\n", "\n", "Teaching in-depth data science approaches for tabular data is outside this scope, and we hope you can use this notebook as a starting point to modify for the needs of your future projects." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Prepare our Environment\n", "\n", "We'll need to:\n", "\n", "- **import** some useful libraries (as in any Python notebook)\n", "- **configure** the S3 bucket and folder where data should be stored (to keep our environment tidy)\n", "- **connect** to AWS in general (with [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)) and SageMaker in particular (with the [sagemaker SDK](https://sagemaker.readthedocs.io/en/stable/)), to use the cloud services\n", "\n", "While `boto3` is the general AWS SDK for Python, `sagemaker` provides some powerful, higher-level interfaces designed specifically for ML workflows.\n", "\n", "**Note that, you need to complete Lab 1 as a prerequisite before running this notebook as the training, validation and test datasets are prepared in Lab 1**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import boto3\n", "import re\n", "import copy\n", "import time\n", "from time import gmtime, strftime\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "from sagemaker.debugger import Rule, rule_configs\n", "\n", "role = get_execution_role()\n", "\n", "region = boto3.Session().region_name\n", "\n", "sagemaker_session = sagemaker.Session()\n", "\n", "bucket=sagemaker.Session().default_bucket()\n", "prefix = 'sagemaker/DEMO-xgboost-tripfare'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store\n", "%store -r" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Understand the Algorithm\n", "\n", "We'll be using SageMaker's [built-in **XGBoost Algorithm**](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html): Benefiting from performance-optimized, pre-implemented functionality like multi-instance parallelization, and support for multiple input formats.\n", "\n", "In general to use the pre-built algorithms, we'll need to:\n", "\n", "- Refer to the [Common Parameters docs](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html) to see the **high-level configuration** and what features each algorithm has\n", "- Refer to the [algorithm docs](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) to understand the **detail** of the **data formats** and **(hyper)-parameters** it supports\n", "\n", "From these docs, we'll understand what data format we need to upload to S3 (next), and how to get the container image URI of the algorithm... which is listed on the Common Parameters page but can also be extracted through the SDK.\n", "\n", "We know from [the algorithm docs](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost) that SageMaker XGBoost expects data in the **libSVM** or **CSV** formats, with:\n", "\n", "- The target variable in the first column, and\n", "- No header row\n", "\n", "The data used in training has already been prepared during the Lab 1." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# training step for generating model artifacts\n", "training_instance_type = \"ml.m5.xlarge\"\n", "model_output = f\"s3://{bucket}/{prefix}/model\"\n", "\n", "# Define the XGBoost training report rules\n", "# see: https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-training-xgboost-report.html\n", "rules = [Rule.sagemaker(rule_configs.create_xgboost_report())]\n", "\n", "image_uri = sagemaker.image_uris.retrieve(\n", " framework=\"xgboost\",\n", " region=region,\n", " version=\"1.2-2\",\n", " py_version=\"py3\",\n", " instance_type=training_instance_type,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.inputs import TrainingInput\n", "content_type = \"csv\"\n", "train_input = TrainingInput(\n", " train_path, content_type=content_type, distribution='ShardedByS3Key'\n", ")\n", "validation_input = TrainingInput(\n", " validation_path, content_type=content_type, distribution='FullyReplicated'\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Train the Model\n", "\n", "Training a model on SageMaker follows the usual steps with other ML libraries (e.g. SciKit-Learn):\n", "1. Initiate a session (we did this up top).\n", "2. Instantiate an estimator object for our algorithm (XGBoost).\n", "3. Define its hyperparameters.\n", "4. Start the training job.\n", "\n", "#### A small competition!\n", "SageMaker's XGBoost includes 38 parameters. You can find more information about them [here](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html).\n", "For simplicity, we choose to experiment only with a few of them.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xgb_train = sagemaker.estimator.Estimator(\n", " image_uri=image_uri,\n", " instance_type=training_instance_type,\n", " instance_count=2,\n", " output_path=model_output,\n", " base_job_name=f\"{prefix.split('/')[-1]}-train\",\n", " sagemaker_session=sagemaker_session,\n", " role=role,\n", " disable_profiler=False, # Profile processing job\n", " rules=rules, # Report processing job\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set some hyper parameters\n", "# https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html\n", "xgb_train.set_hyperparameters(\n", " objective=\"reg:squarederror\",\n", " num_round=100,\n", " early_stopping_rounds=10,\n", " max_depth=9,\n", " eta=0.2,\n", " gamma=4,\n", " min_child_weight=300,\n", " subsample=0.8,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "xgb_train.fit({'train': train_input, 'validation': validation_input})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before you run the next cell, make sure you wait the training job finished. You should see the output summarizing the Training seconds and Billable seconds at then end of the training. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "training_job_name = xgb_train.latest_training_job.job_name\n", "model_url = xgb_train.model_data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%store training_job_name\n", "%store model_url" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Deploy and Evaluate the Model\n", "\n", "### Deployment\n", "\n", "Now that we've trained the xgboost algorithm on our data, deploying the model (hosting it behind a real-time endpoint) is just one function call!\n", "\n", "This deployment might take **up to 10 minutes**, and by default the code will wait for the deployment to complete.\n", "\n", "If you like, you can instead:\n", "\n", "- add the `wait=False` parameter to the deploy function\n", "- Use the [Endpoints page of the SageMaker Console](https://console.aws.amazon.com/sagemaker/home?#/endpoints) to check the status of the deployment\n", "- Skip over the *Evaluation* section below (which won't run until the deployment is complete), and start the Hyperparameter Optimization job - which will take a while to run too, so can be started in parallel" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xgboost_endpoint_name = \"xgboost-endpoint-\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", "xgboost_predictor = xgb_train.deploy(\n", " initial_instance_count=1, instance_type=\"ml.m5.xlarge\", endpoint_name=xgboost_endpoint_name\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# read a few rows from the test dataset in s3\n", "import awswrangler as wr\n", "test_df = wr.s3.read_csv(\n", " path=test_path, dataset=True, nrows=5, header=None\n", " )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from sagemaker.serializers import CSVSerializer\n", "from sagemaker.deserializers import CSVDeserializer\n", "\n", "xgboost_predictor.serializer = CSVSerializer()\n", "xgboost_predictor.deserializer = CSVDeserializer()\n", "xgboost_predictor.predict(test_df.iloc[:,1:].values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Hyperparameter Optimization (HPO)\n", "*Note, with the default settings below, the hyperparameter tuning job can take up to ~20 minutes to complete.*\n", "\n", "We will use SageMaker HyperParameter Optimization (HPO) to automate the searching process effectively. Specifically, we **specify a range**, or a list of possible values in the case of categorical hyperparameters, for each of the hyperparameter that we plan to tune.\n", "\n", "SageMaker hyperparameter tuning will automatically launch **multiple training jobs** with different hyperparameter settings, evaluate results of those training jobs based on a predefined \"objective metric\", and select the hyperparameter settings for future attempts based on previous results. For each hyperparameter tuning job, we will specify the maximum number of HPO tries (`max_jobs`) and how many of these can happen in parallel (`max_parallel_jobs`).\n", "\n", "Tip: `max_parallel_jobs` creates a **trade-off between performance and speed** (better hyperparameter values vs how long it takes to find these values). If `max_parallel_jobs` is large, then HPO is faster, but the discovered values may not be optimal. Smaller `max_parallel_jobs` will increase the chance of finding optimal values, but HPO will take more time to finish.\n", "\n", "Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: **validation:auc** and **train:auc**.\n", "\n", "Area Under the ROC Curve (AUC) measures the ability of a binary ML model to predict a higher score for positive examples as compared to negative examples. [See Machine Learning Key Concepts](https://docs.aws.amazon.com/machine-learning/latest/dg/amazon-machine-learning-key-concepts.html)\n", "\n", "We elected to monitor *validation:auc* as you can see below. In this case (because it's pre-built for us), we only need to specify the metric name.\n", "\n", "For more information on the documentation of the Sagemaker HPO please refer [here](https://sagemaker.readthedocs.io/en/stable/tuner.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner\n", "hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),\n", " 'min_child_weight': ContinuousParameter(1, 10),\n", " 'alpha': ContinuousParameter(0, 2),\n", " 'max_depth': IntegerParameter(1, 10)}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "objective_metric_name = 'validation:rmse'\n", "objective_type = 'Minimize'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner = HyperparameterTuner(xgb_train,\n", " objective_metric_name,\n", " hyperparameter_ranges,\n", " objective_type=objective_type,\n", " max_jobs=10,\n", " max_parallel_jobs=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Launch HPO\n", "Now we can launch a hyperparameter tuning job by calling *fit()* function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner.fit({'train': train_input, 'validation': validation_input})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "boto3.client('sagemaker').describe_hyper_parameter_tuning_job(\n", "HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# return the best training job name\n", "tuner.best_training_job()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Deploy the best trained or user specified model to an Amazon SageMaker endpoint\n", "tuner_predictor = tuner.deploy(initial_instance_count=1,\n", " instance_type='ml.m4.xlarge')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a serializer\n", "tuner_predictor.serializer = CSVSerializer()\n", "tuner_predictor.deserializer = CSVDeserializer()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Predict\n", "tuner_predictor.predict(test_df.iloc[:,1:].values)[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (Optional) Delete the Endpoint\n", "If you're done with this exercise, please run the delete_endpoint line in the cell below. This will remove the hosted endpoint and avoid any charges from a stray instance being left on." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xgboost_predictor.delete_endpoint(delete_endpoint_config=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tuner_predictor.delete_endpoint(delete_endpoint_config=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## End of Lab 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }