# Regression with XGBoost


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

---


The notebook shows how to use the @remote and RemoteExecutor introduced to SageMaker SDK
to delegate data processing and model training workload to SageMaker job platform.

## Install the dependencies

In [None]:
%pip install -r ./requirements.txt

In [None]:
import sagemaker
from sagemaker.remote_function import remote, RemoteExecutor

sm_session = sagemaker.Session()
s3_root_folder = f"s3://{sm_session.default_bucket()}/remote_function_demo/xgb_abalone"

## Data Set
The dataset you use is the [UCI Machine Learning Abalone Dataset](http://archive.ics.uci.edu/ml) [1]. The aim for this task is to determine the age of an abalone snail from its physical measurements. At the core, this is a regression problem.

The dataset contains several features: length (the longest shell measurement), diameter (the diameter perpendicular to length), height (the height with meat in the shell), whole_weight (the weight of whole abalone), shucked_weight (the weight of meat), viscera_weight (the gut weight after bleeding), shell_weight (the weight after being dried), sex ('M', 'F', 'I' where 'I' is Infant), and rings (integer).

The number of rings turns out to be a good approximation for age (age is rings + 1.5). However, to obtain this number requires cutting the shell through the cone, staining the section, and counting the number of rings through a microscope, which is a time-consuming task. However, the other physical measurements are easier to determine. You use the dataset to build a predictive model of the variable rings through these other physical measurements.

Before you upload the data to an S3 bucket, install the SageMaker Python SDK and gather some constants you can use later in this notebook.

[1] Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

## Setup Configuration file path
We need set the directory in which the config.yaml file resides so that remote decorator can make use of the settings.

In [None]:
import os

# Set path to config file
os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

## Process the data set

The preprocessing function uses scikit-learn to do the following:

* Fill in missing sex category data and encode it so that it is suitable for training.
* Scale and normalize all numerical fields, aside from sex and rings numerical data.
* Split the data into training, validation, and test datasets.

In [None]:
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Since we get a headerless CSV file, we specify the column names here.
feature_columns_names = [
 "sex",
 "length",
 "diameter",
 "height",
 "whole_weight",
 "shucked_weight",
 "viscera_weight",
 "shell_weight",
]
label_column = "rings"

feature_columns_dtype = {
 "sex": str,
 "length": np.float64,
 "diameter": np.float64,
 "height": np.float64,
 "whole_weight": np.float64,
 "shucked_weight": np.float64,
 "viscera_weight": np.float64,
 "shell_weight": np.float64,
}
label_column_dtype = {"rings": np.float64}

input_path = f"s3://sagemaker-example-files-prod-{sm_session.boto_region_name}/datasets/tabular/uci_abalone/abalone.csv"


def merge_two_dicts(x, y):
 z = x.copy()
 z.update(y)
 return z


@remote(keep_alive_period_in_seconds=600)
def process():
 df = pd.read_csv(
 input_path,
 header=None,
 names=feature_columns_names + [label_column],
 dtype=merge_two_dicts(feature_columns_dtype, label_column_dtype),
 )
 numeric_features = list(feature_columns_names)
 numeric_features.remove("sex")
 numeric_transformer = Pipeline(
 steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
 )

 categorical_features = ["sex"]
 categorical_transformer = Pipeline(
 steps=[
 ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
 ("onehot", OneHotEncoder(handle_unknown="ignore")),
 ]
 )

 preprocess = ColumnTransformer(
 transformers=[
 ("num", numeric_transformer, numeric_features),
 ("cat", categorical_transformer, categorical_features),
 ]
 )

 y = df.pop("rings")
 x_pre = preprocess.fit_transform(df)
 y_pre = y.to_numpy().reshape(len(y), 1)

 x = np.concatenate((y_pre, x_pre), axis=1)

 np.random.shuffle(x)
 train, validation, test = np.split(x, [int(0.7 * len(x)), int(0.85 * len(x))])

 return pd.DataFrame(train), pd.DataFrame(validation), pd.DataFrame(test)

In [None]:
train_df, validation_df, test_df = process()

## Train XGBoost model and do hyperparameter tuning

In [None]:
import xgboost


def train(
 train_df,
 validation_df,
 *,
 num_round=50,
 objective="reg:linear",
 max_depth=5,
 eta=0.2,
 gamma=4,
 min_child_weight=6,
 subsample=0.7,
 use_gpu=False,
):
 y_train = train_df.iloc[:, 0].to_numpy()
 train_df.drop(train_df.columns[0], axis=1, inplace=True)
 x_train = train_df.to_numpy()
 train_dmatrix = xgboost.DMatrix(x_train, label=y_train)

 y_validation = validation_df.iloc[:, 0].to_numpy()
 validation_df.drop(validation_df.columns[0], axis=1, inplace=True)
 x_validation = validation_df.to_numpy()
 validation_dmatrix = xgboost.DMatrix(x_validation, label=y_validation)

 param = {
 "objective": objective,
 "max_depth": max_depth,
 "eta": eta,
 "gamma": gamma,
 "min_child_weight": min_child_weight,
 "subsample": subsample,
 "tree_method": "gpu_hist" if use_gpu else "hist", # Use GPU accelerated algorithm
 }

 evaluation__results = {} # Store accuracy result
 booster = xgboost.train(
 param,
 train_dmatrix,
 num_round,
 evals=[(train_dmatrix, "train"), (validation_dmatrix, "validation")],
 early_stopping_rounds=5,
 evals_result=evaluation__results,
 )

 return booster, evaluation__results

In [None]:
with RemoteExecutor(max_parallel_jobs=2, keep_alive_period_in_seconds=60) as e:
 futures = []
 for max_depth in [3, 5, 7, 9]:
 futures.append(e.submit(train, train_df, validation_df, max_depth=max_depth))

## Performs the model evaluation using test set

In [None]:
import numpy as np

from sklearn.metrics import mean_squared_error


@remote(keep_alive_period_in_seconds=600)
def evaluate(booster, test_df):
 y_test = test_df.iloc[:, 0].to_numpy()
 test_df.drop(test_df.columns[0], axis=1, inplace=True)
 x_test = test_df.to_numpy()

 predictions = booster.predict(xgboost.DMatrix(x_test))

 mse = mean_squared_error(y_test, predictions)
 std = np.std(y_test - predictions)

 return mse, std

In [None]:
# Retrieve the training result. Here we just pick the last trained model for demonstration.
trained_booster, evaluation_results = futures[3].result()

# Evaluate the trained model against the test data
evaluate(trained_booster, test_df)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-remote-function|xgboost_abalone|xgboost_abalone.ipynb)
