{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Regression with Amazon SageMaker XGBoost (Parquet input)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n", "\n", "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "---\n", "\n", "## Contents\n", "1. [Introduction](#Introduction)\n", "2. [Setup](#Setup)\n", "3. [Training](#Training)\n", " 1. [Training with SageMaker Training](#Training-with-sagemaker-training)\n", " 2. [Training with SageMaker Automatic Model Tuning](#Tuning-with-sagemaker-automatic-model-tuning)\n", "4. [Plotting Objective Metric](#Plotting-objective-metric)\n", "\n", "---" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Introduction\n", "\n", "This notebook exhibits the use of a Parquet dataset for use with the SageMaker XGBoost algorithm. The example here is almost the same as [Regression with Amazon SageMaker XGBoost algorithm](xgboost_abalone.ipynb).\n", "\n", "This notebook tackles the exact same problem with the same solution, but has been modified for a Parquet input. \n", "The original notebook provides details of dataset and the machine learning use-case.\n", "\n", "This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip3 install -U sagemaker" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import boto3\n", "import re\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "\n", "role = get_execution_role()\n", "region = boto3.Session().region_name\n", "\n", "# S3 bucket for saving code and model artifacts.\n", "# Feel free to specify a different bucket here if you wish.\n", "bucket = sagemaker.Session().default_bucket()\n", "prefix = \"sagemaker/DEMO-xgboost-parquet\"\n", "bucket_path = \"https://s3-{}.amazonaws.com/{}\".format(region, bucket)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We will use [PyArrow](https://arrow.apache.org/docs/python/) library to store the Abalone dataset in the Parquet format." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pyarrow" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.datasets import load_svmlight_file\n", "\n", "s3 = boto3.client(\"s3\")\n", "# Download the dataset and load into a pandas dataframe\n", "FILE_NAME = \"abalone.csv\"\n", "s3.download_file(\n", " f\"sagemaker-example-files-prod-{region}\", f\"datasets/tabular/uci_abalone/abalone.csv\", FILE_NAME\n", ")\n", "feature_names = [\n", " \"Sex\",\n", " \"Length\",\n", " \"Diameter\",\n", " \"Height\",\n", " \"Whole weight\",\n", " \"Shucked weight\",\n", " \"Viscera weight\",\n", " \"Shell weight\",\n", " \"Rings\",\n", "]\n", "data = pd.read_csv(FILE_NAME, header=None, names=feature_names)\n", "\n", "# SageMaker XGBoost has the convention of label in the first column\n", "data = data[feature_names[-1:] + feature_names[:-1]]\n", "data[\"Sex\"] = data[\"Sex\"].astype(\"category\").cat.codes\n", "\n", "# Split the downloaded data into train/test dataframes\n", "train, test = np.split(data.sample(frac=1), [int(0.8 * len(data))])\n", "\n", "# requires PyArrow installed\n", "train.to_parquet(\"abalone_train.parquet\")\n", "test.to_parquet(\"abalone_test.parquet\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "sagemaker.Session().upload_data(\n", " \"abalone_train.parquet\", bucket=bucket, key_prefix=prefix + \"/\" + \"training\"\n", ")\n", "\n", "sagemaker.Session().upload_data(\n", " \"abalone_test.parquet\", bucket=bucket, key_prefix=prefix + \"/\" + \"validation\"\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We obtain the new container by specifying the framework version (1.7-1). This version specifies the upstream XGBoost framework version (1.7) and an additional SageMaker version (1). If you have an existing XGBoost workflow based on the previous (1.0-1, 1.2-2, 1.3-1 or 1.5-1) container, this would be the only change necessary to get the same workflow working with the new container." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "container = sagemaker.image_uris.retrieve(\"xgboost\", region, \"1.7-1\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Training\n", "\n", "After setting training parameters, we kick off training, and poll for status until training is completed.\n", "\n", "Training can be done by either calling SageMaker Training with a set of hyperparameters values to train with, or by leveraging SageMaker Automatic Model Tuning ([AMT](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)). AMT, also known as hyperparameter tuning (HPO), finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.\n", "\n", "In this notebook, both methods are used for demonstration purposes, but the best training that the HPO job creates is the one that is eventually used for analytics purposes. You can instead choose to use the standalone training by changing the below variable `use_amt` to False.\n", "\n", "\n", "### Training with SageMaker Training\n", "\n", "Takes between 5 and 6 minutes in this example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "import time\n", "from time import gmtime, strftime\n", "\n", "client = boto3.client(\"sagemaker\", region_name=region)\n", "use_amt = True\n", "\n", "training_job_name = \"xgboost-parquet-example-training-\" + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n", "print(\"Training job\", training_job_name)\n", "\n", "# Ensure that the training and validation data folders generated above are reflected in the \"InputDataConfig\" parameter below.\n", "\n", "create_training_params = {\n", " \"AlgorithmSpecification\": {\"TrainingImage\": container, \"TrainingInputMode\": \"Pipe\"},\n", " \"RoleArn\": role,\n", " \"OutputDataConfig\": {\"S3OutputPath\": f\"{bucket_path}/{prefix}/single-xgboost\"},\n", " \"ResourceConfig\": {\"InstanceCount\": 1, \"InstanceType\": \"ml.m5.2xlarge\", \"VolumeSizeInGB\": 20},\n", " \"TrainingJobName\": training_job_name,\n", " \"HyperParameters\": {\n", " \"max_depth\": \"5\",\n", " \"eta\": \"0.2\",\n", " \"gamma\": \"4\",\n", " \"min_child_weight\": \"6\",\n", " \"subsample\": \"0.7\",\n", " \"objective\": \"reg:linear\",\n", " \"num_round\": \"10\",\n", " \"verbosity\": \"2\",\n", " },\n", " \"StoppingCondition\": {\"MaxRuntimeInSeconds\": 3600},\n", " \"InputDataConfig\": [\n", " {\n", " \"ChannelName\": \"train\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": f\"{bucket_path}/{prefix}/training\",\n", " \"S3DataDistributionType\": \"FullyReplicated\",\n", " }\n", " },\n", " \"ContentType\": \"application/x-parquet\",\n", " \"CompressionType\": \"None\",\n", " },\n", " {\n", " \"ChannelName\": \"validation\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": f\"{bucket_path}/{prefix}/validation\",\n", " \"S3DataDistributionType\": \"FullyReplicated\",\n", " }\n", " },\n", " \"ContentType\": \"application/x-parquet\",\n", " \"CompressionType\": \"None\",\n", " },\n", " ],\n", "}\n", "\n", "print(\n", " f\"Creating a training job with name: {training_job_name}. It will take between 5 and 6 minutes to complete.\"\n", ")\n", "client.create_training_job(**create_training_params)\n", "\n", "status = client.describe_training_job(TrainingJobName=training_job_name)[\"TrainingJobStatus\"]\n", "print(status)\n", "while status != \"Completed\" and status != \"Failed\":\n", " time.sleep(60)\n", " status = client.describe_training_job(TrainingJobName=training_job_name)[\"TrainingJobStatus\"]\n", " print(status)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Training with SageMaker Automatic Model Tuning\n", "\n", "To create a tuning job using the AWS SageMaker Automatic Model Tuning API, you need to define 3 attributes. \n", "\n", "1. the tuning job name (string)\n", "2. the tuning job config (to specify settings for the hyperparameter tuning job - JSON object)\n", "3. training job definition (to configure the training jobs that the tuning job launches - JSON object).\n", "\n", "To learn more about that, refer to the [Configure and Launch a Hyperparameter Tuning Job](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-tuning-job.html) documentation.\n", "\n", "Note that the tuning job will take between 7 and 10 minutes to complete." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from time import gmtime, strftime, sleep\n", "\n", "tuning_job_name = \"DEMO-xgboost-parquet-\" + strftime(\"%d-%H-%M-%S\", gmtime())\n", "\n", "tuning_job_config = {\n", " \"ParameterRanges\": {\n", " \"CategoricalParameterRanges\": [],\n", " \"ContinuousParameterRanges\": [\n", " {\n", " \"MaxValue\": \"0.5\",\n", " \"MinValue\": \"0.1\",\n", " \"Name\": \"eta\",\n", " },\n", " {\n", " \"MaxValue\": \"5\",\n", " \"MinValue\": \"0\",\n", " \"Name\": \"gamma\",\n", " },\n", " {\n", " \"MaxValue\": \"120\",\n", " \"MinValue\": \"0\",\n", " \"Name\": \"min_child_weight\",\n", " },\n", " {\n", " \"MaxValue\": \"1\",\n", " \"MinValue\": \"0.5\",\n", " \"Name\": \"subsample\",\n", " },\n", " {\n", " \"MaxValue\": \"2\",\n", " \"MinValue\": \"0\",\n", " \"Name\": \"alpha\",\n", " },\n", " ],\n", " \"IntegerParameterRanges\": [\n", " {\n", " \"MaxValue\": \"10\",\n", " \"MinValue\": \"0\",\n", " \"Name\": \"max_depth\",\n", " },\n", " {\n", " \"MaxValue\": \"4000\",\n", " \"MinValue\": \"1\",\n", " \"Name\": \"num_round\",\n", " },\n", " ],\n", " },\n", " # SageMaker sets the following default limits for resources used by automatic model tuning:\n", " # https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-limits.html\n", " \"ResourceLimits\": {\n", " # Increase the max number of training jobs for increased accuracy (and training time).\n", " \"MaxNumberOfTrainingJobs\": 6,\n", " # Change parallel training jobs run by AMT to reduce total training time. Constrained by your account limits.\n", " # if max_jobs=max_parallel_jobs then Bayesian search turns to Random.\n", " \"MaxParallelTrainingJobs\": 2,\n", " },\n", " \"Strategy\": \"Bayesian\",\n", " \"HyperParameterTuningJobObjective\": {\"MetricName\": \"validation:rmse\", \"Type\": \"Minimize\"},\n", "}\n", "\n", "training_job_definition = {\n", " \"AlgorithmSpecification\": {\"TrainingImage\": container, \"TrainingInputMode\": \"File\"},\n", " \"InputDataConfig\": [\n", " {\n", " \"ChannelName\": \"train\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": f\"{bucket_path}/{prefix}/training\",\n", " \"S3DataDistributionType\": \"FullyReplicated\",\n", " }\n", " },\n", " \"ContentType\": \"application/x-parquet\",\n", " \"CompressionType\": \"None\",\n", " },\n", " {\n", " \"ChannelName\": \"validation\",\n", " \"DataSource\": {\n", " \"S3DataSource\": {\n", " \"S3DataType\": \"S3Prefix\",\n", " \"S3Uri\": f\"{bucket_path}/{prefix}/validation\",\n", " \"S3DataDistributionType\": \"FullyReplicated\",\n", " }\n", " },\n", " \"ContentType\": \"application/x-parquet\",\n", " \"CompressionType\": \"None\",\n", " },\n", " ],\n", " \"OutputDataConfig\": {\"S3OutputPath\": f\"{bucket_path}/{prefix}/single-xgboost\"},\n", " \"ResourceConfig\": {\"InstanceCount\": 1, \"InstanceType\": \"ml.m5.2xlarge\", \"VolumeSizeInGB\": 5},\n", " \"RoleArn\": role,\n", " \"StaticHyperParameters\": {\n", " \"objective\": \"reg:linear\",\n", " \"verbosity\": \"2\",\n", " },\n", " \"StoppingCondition\": {\"MaxRuntimeInSeconds\": 43200},\n", "}\n", "\n", "print(\n", " f\"Creating a tuning job with name: {tuning_job_name}. It will take between 7 and 10 minutes to complete.\"\n", ")\n", "client.create_hyper_parameter_tuning_job(\n", " HyperParameterTuningJobName=tuning_job_name,\n", " HyperParameterTuningJobConfig=tuning_job_config,\n", " TrainingJobDefinition=training_job_definition,\n", ")\n", "\n", "status = client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)[\n", " \"HyperParameterTuningJobStatus\"\n", "]\n", "print(status)\n", "while status != \"Completed\" and status != \"Failed\":\n", " time.sleep(60)\n", " status = client.describe_hyper_parameter_tuning_job(\n", " HyperParameterTuningJobName=tuning_job_name\n", " )[\"HyperParameterTuningJobStatus\"]\n", " print(status)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Plotting Objective Metric" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if use_amt == True:\n", " training = client.describe_hyper_parameter_tuning_job(\n", " HyperParameterTuningJobName=tuning_job_name\n", " )[\"BestTrainingJob\"][\"TrainingJobName\"]\n", "else:\n", " training = training_job_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "from sagemaker.analytics import TrainingJobAnalytics\n", "\n", "metric_name = \"validation:rmse\"\n", "\n", "metrics_dataframe = TrainingJobAnalytics(\n", " training_job_name=training, metric_names=[metric_name]\n", ").dataframe()\n", "plt = metrics_dataframe.plot(\n", " kind=\"line\", figsize=(12, 5), x=\"timestamp\", y=\"value\", style=\"b.\", legend=False\n", ")\n", "plt.set_ylabel(metric_name);" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Notebook CI Test Results\n", "\n", "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n", "\n", "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n", "\n", "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/introduction_to_amazon_algorithms|xgboost_abalone|xgboost_parquet_input_training.ipynb)\n" ] } ], "metadata": { "anaconda-cloud": {}, "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science 3.0)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 4 }