# XGBoost Algorithm **Note**: The contents of this folder was taken from the AWS Blog Post: [Build XGBoost models with Amazon Redshift ML](https://aws.amazon.com/blogs/machine-learning/build-xgboost-models-with-amazon-redshift-ml/). ## Introduction In this folder, we have a working example to create models in Amazon Redshift using the [XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) algorithm type. The XGBoost algorithm (eXtreme Gradient Boosting) is an optimized open-source implementation of the gradient boosted trees algorithm. XGBoost is designed from the ground up to handle many data science problems in a highly efficient, flexible, portable, and accurate way. XGBoost algorithm can be used for regression, binary classification, multi-class classification, and ranking problems. ## Pre-requisites Use the [Cloudformation template](../cloud-formation/cloud-formation-template.yaml) to setup your Amazon Redshift cluster with Redshift ML or create an IAM role allowing Amazon Redshift to communicate with Amazon SageMaker. ## Use Case To illustrate XGBoost algorithm, we use the [banknote authentication dataset](https://archive.ics.uci.edu/ml/datasets/banknote+authentication), which is a binary classification problem to predict whether a given banknote is genuine or forged. Image recognition and the Wavelet Transform tool were used to obtain spectral features like variance, skewness, kurtosis, and entropy (of an image) from banknote specimens used as the input parameters for the model. Based on these measurements, you can employ ML methodologies to predict an original note from a forged one. ### Using Amazon SageMaker Notebook You may follow the step-by-step instructions in the [xgboost-model notebook](./xgboost-model.ipynb) for this setup which is auto-provisioned in the [Cloudformation template](../cloud-formation/cloud-formation-template.yaml). ## Data Preparation Please use the the [data preparation SQL file](./data_preparation.sql) to populate the input tables for this example setup. ## Create Model We can use the [create model](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_MODEL.html) statement in Amazon Redshift to create our ML model using XGBoost. We create the model in Redshift ML with `AUTO OFF` and `XGBoost` as the model type. We specified the `class` column as the target (label) that we want to predict, and specified `func_model_banknoteauthentication_xgboost_binary` as the function. As an advanced data scientist, we also specify advanced options like `objective` and `hyperparameters`. Amazon SageMaker creates the function `func_model_banknoteauthentication_xgboost_binary`, which we use to do inference in Amazon Redshift. See the following code: ```sql --Create model CREATE MODEL model_banknoteauthentication_xgboost_binary FROM banknoteauthentication_train TARGET class FUNCTION func_model_banknoteauthentication_xgboost_binary IAM_ROLE '<>' AUTO OFF MODEL_TYPE xgboost OBJECTIVE 'binary:logistic' PREPROCESSORS 'none' HYPERPARAMETERS DEFAULT EXCEPT(NUM_ROUND '100') SETTINGS(S3_BUCKET '<>'); ``` ## Evaluation In this step, we evaluate the accuracy of our ML model against our test data in `banknoteauthentication_test` table. While creating the model, Amazon SageMaker Autopilot automatically splits the input data into train and validation sets, and selects the model with the best objective metric, which is deployed in the Amazon Redshift cluster. You can use the `show model` statement in your cluster to view various metrics, including the accuracy score. ```sql SHOW MODEL model_banknoteauthentication_xgboost_binary; ``` XGBoost with `AUTO OFF` provides `train:error` metric in `SHOW MODEL` command output, which is a measure of accuracy. For example a value of `0.99` indicates the model is close to 99% accurate. For binary and multi-class classification problems, we compute the accuracy as the model metric. Accuracy can be calculated based on the following: ``` accuracy = (sum (actual == predicted)/total) *100 ``` You may run below query to view the model accuracy: ```sql -- check accuracy WITH infer_data AS ( SELECT class AS label, func_model_banknoteauthentication_xgboost_binary (variance, skewness, curtosis, entrophy) AS predicted, CASE WHEN label IS NULL THEN 0 ELSE label END AS actual, CASE WHEN actual = predicted THEN 1::INT ELSE 0::INT END AS correct FROM banknoteauthentication_test), aggr_data AS ( SELECT SUM(correct) AS num_correct, COUNT(*) AS total FROM infer_data) SELECT (num_correct::FLOAT / total::FLOAT) AS accuracy FROM aggr_data; ``` ## ML Predictions Let's run the prediction query on the `banknoteauthentication_test` to get the count of original vs. counterfeit banknotes: ```sql --check prediction WITH infer_data AS ( SELECT func_model_banknoteauthentication_xgboost_binary(variance, skewness, curtosis, entrophy) AS predicted FROM banknoteauthentication_test )SELECT CASE WHEN predicted = '0' THEN 'Original banknote' WHEN predicted = '1' THEN 'Counterfeit banknote' ELSE 'NA' END AS banknote_authentication ,COUNT(1) AS count FROM infer_data GROUP BY 1; ```