# Data Preprocessing using RAPIDS and Training XGBoost for Fraud Detection

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

---



"rapids"

In this notebook we will walk through using [RAPIDS](https://rapids.ai/about.html) for GPU-accelerated data preprocessing and training of XGBoost model for a Fraud Detection use-case. This is the first notebook in a two notebook series. In the [second notebook](2_triton_xgb_fil_ensemble.ipynb) we will show how to deploy the trained XGBoost model in Triton on SageMaker. The RAPIDS suite of open source software libraries and APIs gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs. 


**Note:** Since the primary goal of this example is to get a trained XGBoost model to illustrate deployment of Tree-based ML models on Triton in SageMaker we don't perform any in-depth feature engineering or hyperparameter optimization. RAPIDS on SageMaker however is excellent for running cost-effective HPO in minimal amount of time as shown in the blog post [RAPIDS and Amazon SageMaker: Scale up and scale out to tackle ML challenges](https://aws.amazon.com/blogs/machine-learning/rapids-and-amazon-sagemaker-scale-up-and-scale-out-to-tackle-ml-challenges/). 

## To Run This Notebook Please Select RAPIDS 2106 Kernel from the Kernel Dropdown menu

This notebook was tested with the `rapids-2106` kernel on an Amazon SageMaker notebook instance of type `g4dn`.

## Get Data

For this example, we use the Tabformer [synthetic credit card transactions dataset](https://arxiv.org/abs/1910.03033) from IBM available on [Kaggle](https://www.kaggle.com/datasets/ealtman2019/credit-card-transactions). The origin of this dataset along with its licensing terms can be found at: [Kaggle link](https://www.kaggle.com/datasets/ealtman2019/credit-card-transactions).


### Download Dataset

First we download the dataset from our Amazon S3 bucket.

In [None]:
!python -m pip install --upgrade pip --quiet
!pip install -U awscli --quiet

In [None]:
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/synthetic_credit_card_transactions/credit_card_transactions-ibm_v2.csv ./

## Check on our GPU

Next, let's check the GPU resources we have by using the terminal command `nvidia-smi`.

In [None]:
!nvidia-smi
!nvidia-smi -L

Awesome, we have powerful NVIDIA GPU at our disposal. Let's get started with using it for Data Preprocessing.

## Data Preprocessing

In [None]:
import cudf
import cuml
import numpy as np
import pickle
import os

We read in the data and begin our data preprocessing.

In [None]:
data_path = "./"
data_csv = "credit_card_transactions-ibm_v2.csv"
full_data = cudf.read_csv(os.path.join(data_path, data_csv))
full_data.head()

Each row here is a credit card transaction with attributes like time and amount of transaction along with merchant attributes like Name, City, State, Zipcode and Merchant Category Code (MCC) and finally whether the transaction was fraudulent or legitimate (`Is Fraud?`). 

**Note:** `Merchant Name` is hashed so that's why we see integers instead of strings.

The full dataset has about 24 million rows but in this example we use random subset of about ~5 million transactions.

In [None]:
SEED = 42
data = full_data.sample(frac=0.2, random_state=SEED)
data = data.reset_index(drop=True)
print(data.shape)

We convert some categorical features to dtype objects.

In [None]:
data["Zip"] = data["Zip"].astype("object")
data["MCC"] = data["MCC"].astype("object")
data["Merchant Name"] = data["Merchant Name"].astype("object")

### Encode labels


Next we perform encoding on our binary labels `Is Fraud?` which indicate whether a transaction is fraudulent or not. After encoding, `1` will denote fraud and `0` will denote legitimate transaction.

In [None]:
y = data["Is Fraud?"]
data.drop(columns=["Is Fraud?"], inplace=True)
y = (y == "Yes").astype(int)

### Save subset for inference

We will also save a small subset of the data to submit Triton inference requests for later on in the [second notebook](2_triton_xgb_fil_ensemble.ipynb).

In [None]:
data_infer = data.iloc[625:630]
data_infer.to_csv("data_infer.csv", index=False)

### Handle Missing Values

Next let's handle the missing values in our data.

In [None]:
data.isna().sum() / len(data) * 100

We have some missing values in `Merchant State` and `Zip` columns. Turns out these correspond to ONLINE transactions so we will set those missing values to `ONLINE`.

In [None]:
data.loc[data["Merchant City"] == "ONLINE", "Merchant State"] = "ONLINE"
data.loc[data["Merchant City"] == "ONLINE", "Zip"] = "ONLINE"

We also have some foreign transactions where `Merchant City` and `Merchant State` is a foreign city and country and the Zipcode is missing. For those transactions we will set the Zipcode to `FOREIGN`.

In [None]:
us_states_plus_online = [
 "AK",
 "AL",
 "AR",
 "AZ",
 "CA",
 "CO",
 "CT",
 "DC",
 "DE",
 "FL",
 "GA",
 "HI",
 "IA",
 "ID",
 "IL",
 "IN",
 "KS",
 "KY",
 "LA",
 "MA",
 "MD",
 "ME",
 "MI",
 "MN",
 "MO",
 "MS",
 "MT",
 "NC",
 "ND",
 "NE",
 "NH",
 "NJ",
 "NM",
 "NV",
 "NY",
 "OH",
 "OK",
 "OR",
 "PA",
 "RI",
 "SC",
 "SD",
 "TN",
 "TX",
 "UT",
 "VA",
 "VT",
 "WA",
 "WI",
 "WV",
 "WY",
 "ONLINE",
]

# set zip of all transactions that are not in US States or Online to Foreign
data.loc[~data["Merchant State"].isin(us_states_plus_online), "Zip"] = "FOREIGN"

The `Errors?` column indicates whether or not the transaction had any errors like an Incorrect Pin associated with it. We make this a boolean indicator feature.

In [None]:
data["Errors?"] = data["Errors?"].notna()

In [None]:
data.isna().sum() / len(data) * 100

So now we have handled all the missing values in our data.

### Handle Amount and Time

Next, for the `Amount` column we remove the dollar symbol prefix and for `Time` column we extract out the Hour and Minute.

In [None]:
data["Amount"] = data["Amount"].str.slice(1)
data["Hour"] = data["Time"].str.slice(stop=2)
data["Minute"] = data["Time"].str.slice(start=3)
data.drop(columns=["Time"], inplace=True)

### Train-Test Split

Before doing any further preprocessing let's perform the train-test split. Here we use 70-30 train-test split.

In [None]:
from cuml.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
 data, y, test_size=0.3, random_state=SEED, stratify=y
)

In [None]:
# Free up some room on the GPU by explicitly deleting dataframes
import gc

del data
del y
gc.collect()

### Encoding Categorical Columns

Next, we handle categorical columns in our dataset by performing [label encoding](https://docs.rapids.ai/api/cuml/stable/api.html?highlight=label%20encoder#feature-and-label-encoding-single-gpu) on them which convert categorical values into numerical values. For some of these columns we have some unseen values which are present in test data but not train data. We handle those values by setting them to `UNKNOWN` before doing the label encoding so that at test time we have an encoding for these unseen values.

We also serialize the encodings for all categorical columns so that we can later use them for doing data preprocessing at inference time in the [second notebook](2_triton_xgb_fil_ensemble.ipynb).

In [None]:
from cuml.preprocessing import LabelEncoder

categorial_columns = ["Zip", "MCC", "Merchant Name", "Use Chip", "Merchant City", "Merchant State"]
encoders = {}

# handle unknown values present in test data but not in training data
for col in categorial_columns:
 # convert cudf series to numpy array with .values_host
 unique_values = X_train[col].unique().values_host
 X_test.loc[~X_test[col].isin(unique_values), col] = "UNKNOWN"
 unique_values = np.append(unique_values, ["UNKNOWN"])
 # convert numpy array to cudf series
 unique_values = cudf.Series(unique_values)
 le = LabelEncoder().fit(unique_values)
 X_train[col] = le.transform(X_train[col])
 X_test[col] = le.transform(X_test[col])
 encoders[col] = le.classes_.values_host

with open("label_encoders.pkl", "wb") as f:
 pickle.dump(encoders, f)

In [None]:
# convert all dtypes to fp32 for xgboost training
X_train = X_train.astype("float32")
X_test = X_test.astype("float32")

Let's look at our preprocessed data.

In [None]:
X_train.head()

## Train XGBoost

Now we train the XGBoost fraud detection model on our GPU. This will take about 2-3 minutes on `g4dn.xlarge` instance.

In [None]:
import xgboost as xgb
import time

dtrain = xgb.DMatrix(X_train, y_train)

dtest = xgb.DMatrix(X_test, y_test)

max_depth = 8
num_trees = 2000
xgb_params = {
 "max_depth": max_depth,
 "tree_method": "gpu_hist",
 "objective": "binary:logistic",
 "eval_metric": "aucpr",
 "predictor": "gpu_predictor",
}
model = xgb.train(params=xgb_params, dtrain=dtrain, num_boost_round=num_trees)

We quickly evaluate our trained model's predictions on the test set using F1-score.

In [None]:
from sklearn.metrics import f1_score

y_score = model.predict(dtest)
threshold = 0.5
y_pred = (y_score >= 0.5).astype(int)
y_true = y_test.values_host
f1 = f1_score(y_true, y_pred)
print(f"Test F1-Score: {f1: 0.4f}")

We can do further Hyperparameter tuning/Feature Engineering to improve the model accuracy but since the primary goal of this example is to walkthrough deployment of decision tree-based ML models like XGBoost on Triton in SageMaker we save our trained model and move on to the [second notebook](2_triton_xgb_fil_ensemble.ipynb).

### Save Trained Model

In [None]:
model_path = "./xgboost.json"
model.save_model(model_path)

## Next Step

Please open the [second notebook](2_triton_xgb_fil_ensemble.ipynb) to learn how to deploy this XGBoost model and other similar decision tree-based ML models on Triton in SageMaker.

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/sagemaker-triton|fil_ensemble|1_prep_rapids_train_xgb.ipynb)
