# SageMaker Payment Classification 


---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

---


## Background 

This notebook demonstrates how you can train and deploy a machine learning model to classify payment transactions. Enriching financial transactions with the category of the transaction. This can be used as an intermediate step in fraud detection, personalization or anomaly detection. As well as a method to provide end users (e.g. customers at a bank) with an overview of their spending habits. Amazon SageMaker can be used to train and deploy a XGBoost model, as well as the required underlying infrastructure. For this notebook a generated dataset is used where a payment consists of mostly an amount, sender, receiver and timestamp.


## Notebook overview 

This notebook consists of seven parts. First, we import and configure the required libraries. After that we prepare the data used in this example and create the feature store. With the newly created features we create a XGBoost model. An endpoint is created to host this model. We evaluate the performance of the model and end by cleaning up the used resources.

## Dataset 

For this notebook we use a synthetic dataset. This dataset has the following features 

* __transaction_category__: The category of the transaction, this is one of the next 19 options.

 'Uncategorized', 'Entertainment', 'Education',
 'Shopping', 'Personal Care', 'Health and Fitness',
 'Food and Dining', 'Gifts and Donations', 'Investments',
 'Bills and Utilities', 'Auto and Transport', 'Travel',
 'Fees and Charges', 'Business Services', 'Personal Services',
 'Taxes', 'Gambling', 'Home',
 'Pension and insurances'


* __receiver_id__: an identifier for the receiving party. The identifier consist of 16 numbers.
* __sender_id__: an identifier for the sending party. The identifier consist of 16 numbers.
* __amount__: the amount which is transferred.
* __timestamp__: the timestamp of the transaction in YYYY-MM-DD HH:MM:SS format.


### 1. Setup 

Before we start we need to update the sagemaker library

In [None]:
import sys

!{sys.executable} -m pip install --upgrade pip --quiet # upgrade pip to the latest vesion
!{sys.executable} -m pip install --upgrade sagemaker --quiet # upgrade SageMaker to the latest vesion

Now that we have the latest version we can import the libraries that we'll use in this notebook

In [None]:
import boto3
import io
import sagemaker
import time
import os

from time import sleep
from sklearn.metrics import classification_report
from sagemaker.feature_store.feature_group import FeatureGroup

import pandas as pd
import numpy as np

Let's set the session variables to ensure that SageMaker is configured correctly.

In [None]:
region = sagemaker.Session().boto_region_name
sm_client = boto3.client("sagemaker")
boto_session = boto3.Session(region_name=region)
sagemaker_session = sagemaker.session.Session(boto_session=boto_session, sagemaker_client=sm_client)
role = sagemaker.get_execution_role()
bucket_prefix = "payment-classification"
s3_bucket = sagemaker_session.default_bucket()

We define the factorize key which is used to map the '__transaction_category__' to numeric values

In [None]:
factorize_key = {
 "Uncategorized": 0,
 "Entertainment": 1,
 "Education": 2,
 "Shopping": 3,
 "Personal Care": 4,
 "Health and Fitness": 5,
 "Food and Dining": 6,
 "Gifts and Donations": 7,
 "Investments": 8,
 "Bills and Utilities": 9,
 "Auto and Transport": 10,
 "Travel": 11,
 "Fees and Charges": 12,
 "Business Services": 13,
 "Personal Services": 14,
 "Taxes": 15,
 "Gambling": 16,
 "Home": 17,
 "Pension and insurances": 18,
}

### 2. Data preparation 

We ingest the simulated data from the public SageMaker S3 training database:

In [None]:
s3 = boto3.client("s3")
s3.download_file(
 f"sagemaker-example-files-prod-{region}",
 "datasets/tabular/synthetic_financial/financial_transactions_mini.csv",
 "financial_transactions_mini.csv",
)

Let's start by loading the dataset from our csv file into a Pandas dataframe

In [None]:
data = pd.read_csv(
 "financial_transactions_mini.csv",
 parse_dates=["timestamp"],
 infer_datetime_format=True,
 dtype={"transaction_category": "string"},
)

The dataframe looks as follows:

| | transaction_category | receiver_id | sender_id | amount | timestamp |
|------:|:-----------------------|-----------------:|-----------------:|---------:|:--------------------|
| 39733 | Shopping | 4258863736072564 | 4630246970548037 | 91.58 | 2021-03-10 01:28:23 |
| 27254 | Shopping | 4356269497886716 | 4752313573239323 | 115.17 | 2021-01-22 23:28:24 |
| 30628 | Shopping | 4233636409552058 | 4635766441812956 | 90.98 | 2021-02-05 03:24:10 |
| 46614 | Shopping | 4054967431278644 | 4823810986511227 | 86.74 | 2021-04-02 14:42:45 |
| 37957 | Shopping | 4831814582525664 | 4254514582909482 | 123.27 | 2021-03-17 11:17:18 |
| 46878 | Shopping | 4425943481448900 | 4349267977109013 | 65.53 | 2021-03-17 15:47:49 |
| 81350 | Auto and Transport | 4146116413442105 | 4062723166078919 | 91.67 | 2021-03-29 13:23:44 |
| 10613 | Entertainment | 4788727923958282 | 4485838385631386 | 76.22 | 2021-02-11 17:45:53 |
| 46715 | Shopping | 4702782703461430 | 4944181591271506 | 86.67 | 2021-03-20 15:37:17 |
| 69110 | Investments | 4180233446952120 | 4702069426390603 | 530.39 | 2021-04-21 08:28:13 |

In [None]:
data.sample(10)

Next, we extract the year, month, day, hour, minute, second from the timestamp and remove the timestamp

In [None]:
data["year"] = data["timestamp"].dt.year
data["month"] = data["timestamp"].dt.month
data["day"] = data["timestamp"].dt.day
data["hour"] = data["timestamp"].dt.hour
data["minute"] = data["timestamp"].dt.minute
data["second"] = data["timestamp"].dt.second

del data["timestamp"]

We'll transform the transaction categories to numeric targets for the classification by factorization.

In [None]:
data["transaction_category"] = data["transaction_category"].replace(factorize_key)

### 3. Create feature store 

To enrich dataset we will use the [Feature Store](https://aws.amazon.com/sagemaker/feature-store/). 

Before creating the feature store itself we need to set a name for the feature group and identifier used

In [None]:
feature_group_name = "feature-group-payment-classification"
record_identifier_feature_name = "identifier"

With the name we defined we create the feature group, runtime and session

In [None]:
feature_group = FeatureGroup(name=feature_group_name, sagemaker_session=sagemaker_session)

featurestore_runtime = boto_session.client(
 service_name="sagemaker-featurestore-runtime", region_name=region
)

feature_store_session = sagemaker.Session(
 boto_session=boto_session,
 sagemaker_client=sm_client,
 sagemaker_featurestore_runtime_client=featurestore_runtime,
)

Once we have defined our feature store we need to put some data in it. We create a Pandas dataframe with the columns mean_amount, count, identifier and event time to store in the feature store

In [None]:
columns = ["mean_amount", "count", "identifier", "EventTime"]
feature_store_data = pd.DataFrame(columns=columns, dtype=object)

feature_store_data["identifier"] = range(19)
feature_store_data["mean_amount"] = 0.0
feature_store_data["count"] = 1
feature_store_data["EventTime"] = time.time()

Using the created dataframe we set the feature definitions

In [None]:
feature_group.load_feature_definitions(data_frame=feature_store_data)

With these definitions ready we can create the feature group itself

In [None]:
feature_group.create(
 s3_uri=f"s3://{s3_bucket}/{bucket_prefix}",
 record_identifier_name=record_identifier_feature_name,
 event_time_feature_name="EventTime",
 role_arn=role,
 enable_online_store=True,
)

It takes a couple of minutes for the feature group to be created, we need to wait for this to be done before trying to ingest data in the feature store

In [None]:
status = feature_group.describe().get("FeatureGroupStatus")
while status == "Creating":
 print("Waiting for Feature Group to be Created")
 time.sleep(5)
 status = feature_group.describe().get("FeatureGroupStatus")
print(f"FeatureGroup {feature_group.name} successfully created.")

Once the feature group is created we can ingest data into it

In [None]:
feature_group.ingest(data_frame=feature_store_data, max_workers=3, wait=True)

To retrieve data from our feature store we define a function that gets the current values from the feature store

In [None]:
def get_feature_store_values():
 response = featurestore_runtime.batch_get_record(
 Identifiers=[
 {
 "FeatureGroupName": feature_group_name,
 "RecordIdentifiersValueAsString": [str(i) for i in range(19)],
 }
 ]
 )

 columns = ["mean_amount", "count", "identifier", "EventTime"]

 feature_store_resp = pd.DataFrame(
 data=[
 [resp["Record"][i]["ValueAsString"] for i in range(len(columns))]
 for resp in response["Records"]
 ],
 columns=columns,
 )
 feature_store_resp["identifier"] = feature_store_resp["identifier"].astype(int)
 feature_store_resp["count"] = feature_store_resp["count"].astype(int)
 feature_store_resp["mean_amount"] = feature_store_resp["mean_amount"].astype(float)
 feature_store_resp["EventTime"] = feature_store_resp["EventTime"].astype(float)
 feature_store_resp = feature_store_resp.sort_values(by="identifier")

 return feature_store_resp


feature_store_resp = get_feature_store_values()

We update the values in the feature store with the real values of our data

In [None]:
feature_store_data = pd.DataFrame()
feature_store_data["mean_amount"] = data.groupby(["transaction_category"]).mean()["amount"]
feature_store_data["count"] = data.groupby(["transaction_category"]).count()["amount"]
feature_store_data["identifier"] = feature_store_data.index
feature_store_data["EventTime"] = time.time()

feature_store_data["mean_amount"] = (
 pd.concat([feature_store_resp, feature_store_data])
 .groupby("identifier")
 .apply(lambda x: np.average(x["mean_amount"], weights=x["count"]))
)
feature_store_data["count"] = (
 pd.concat([feature_store_resp, feature_store_data]).groupby("identifier").sum()["count"]
)

feature_group.ingest(data_frame=feature_store_data, max_workers=3, wait=True)

And display them after getting them from the feature store

In [None]:
feature_store_data = get_feature_store_values()
feature_store_data

We use the feature store to calculate the distance between the average of every category and the current amount

In [None]:
additional_features = pd.pivot_table(
 feature_store_data, values=["mean_amount"], index=["identifier"]
).T.add_suffix("_dist")
additional_features_columns = list(additional_features.columns)
data = pd.concat([data, pd.DataFrame(columns=additional_features_columns, dtype=object)])
data[additional_features_columns] = additional_features.values[0]
for col in additional_features_columns:
 data[col] = abs(data[col] - data["amount"])

data

### 4. Create model 
In this notebook we will be using the [Extreme Gradient Boosting](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) (XGBoost) implementation of the gradient boosted trees algorithm. This model is selected due to it relatively fast training time and explainable properties. The model can be substituted at will a different [SageMaker estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) or a [model of your choosing](https://aws.amazon.com/blogs/machine-learning/bring-your-own-model-with-amazon-sagemaker-script-mode/).



Now that we have the dataset we can start preparing the model. First, we create a training, validation and testing split.

In [None]:
# Randomly sort the data then split out first 70%, second 20%, and last 10%
train_data, validation_data, test_data = np.split(
 data.sample(frac=1, random_state=42), [int(0.7 * len(data)), int(0.9 * len(data))]
)

We save these sets to a file.

In [None]:
train_data.to_csv("train.csv", index=False, header=False)
validation_data.to_csv("validation.csv", index=False, header=False)
test_data.to_csv("test.csv", index=False, header=False)

And upload these files to our s3 bucket

In [None]:
boto3.Session().resource("s3").Bucket(s3_bucket).Object(
 os.path.join(bucket_prefix, "train/train.csv")
).upload_file("train.csv")
boto3.Session().resource("s3").Bucket(s3_bucket).Object(
 os.path.join(bucket_prefix, "validation/validation.csv")
).upload_file("validation.csv")

Get the XGBoost sagemaker image

In [None]:
container = sagemaker.image_uris.retrieve(region=region, framework="xgboost", version="1.2-2")

Transform our data to a sagemaker input for training

In [None]:
s3_input_train = sagemaker.inputs.TrainingInput(
 s3_data="s3://{}/{}/train".format(s3_bucket, bucket_prefix), content_type="csv"
)
s3_input_validation = sagemaker.inputs.TrainingInput(
 s3_data="s3://{}/{}/validation/".format(s3_bucket, bucket_prefix), content_type="csv"
)

We define the XGBoost model

In [None]:
xgb = sagemaker.estimator.Estimator(
 container,
 role,
 instance_count=1,
 instance_type="ml.m4.xlarge",
 output_path="s3://{}/{}/output".format(s3_bucket, bucket_prefix),
 sagemaker_session=sagemaker_session,
)

Set the parameters

In [None]:
xgb.set_hyperparameters(
 max_depth=5,
 eta=0.2,
 gamma=4,
 min_child_weight=6,
 subsample=0.8,
 objective="multi:softprob",
 num_class=19,
 verbosity=0,
 num_round=100,
)

And train the model

In [None]:
xgb.fit({"train": s3_input_train, "validation": s3_input_validation})

### 5. Using the endpoint 

Deploy the model to an endpoint

In [None]:
xgb_predictor = xgb.deploy(
 initial_instance_count=1,
 instance_type="ml.m4.xlarge",
 serializer=sagemaker.serializers.CSVSerializer(),
)

### 6. Evaluate performance 

Run the model on our test data

In [None]:
def predict(data, predictor):
 predictions = []
 confidences = []
 for row in data:
 response = np.fromstring(predictor.predict(row).decode("utf-8")[1:], sep=",")
 pred = response.argmax()
 confidence = max(response)
 predictions.extend([pred])
 confidences.extend([confidence])

 return predictions, confidences

Running it on the first 3 rows in our dataset results in the following:

In [None]:
pred, conf = predict(test_data.drop(["transaction_category"], axis=1).to_numpy()[:3], xgb_predictor)
print(
 f"The predictions for the first 3 entries are {pred}, the confidence for these predictions are {conf}"
)

Now we run the predictions on the complete dataset

In [None]:
predictions, confidences = predict(
 test_data.drop(["transaction_category"], axis=1).to_numpy(), xgb_predictor
)

And report the prediction results

In [None]:
print(
 classification_report(
 test_data["transaction_category"].to_list(), predictions, target_names=factorize_key
 )
)

You should see results similar to this:

```
 precision recall f1-score support

 Uncategorized 1.00 0.92 0.96 51
 Entertainment 0.81 0.89 0.85 1486
 Education 1.00 0.94 0.97 80
 Shopping 0.86 0.94 0.90 3441
 Personal Care 1.00 0.98 0.99 132
 Health and Fitness 0.99 0.89 0.94 443
 Food and Dining 0.99 0.82 0.90 918
 Gifts and Donations 1.00 0.95 0.97 275
 Investments 0.99 0.97 0.98 88
 Bills and Utilities 1.00 0.99 1.00 332
 Auto and Transport 0.94 0.84 0.88 1967
 Travel 0.96 0.84 0.90 120
 Fees and Charges 1.00 0.94 0.97 106
 Business Services 1.00 0.99 1.00 146
 Personal Services 1.00 0.96 0.98 75
 Taxes 0.98 0.94 0.96 47
 Gambling 1.00 1.00 1.00 15
 Home 0.98 0.89 0.93 168
Pension and insurances 0.99 1.00 1.00 110

 accuracy 0.90 10000
 macro avg 0.97 0.93 0.95 10000
 weighted avg 0.91 0.90 0.90 10000
```

### 7. Clean up 

Remove the feature group and endpoint to clean up

In [None]:
feature_group.delete()
xgb_predictor.delete_endpoint(delete_endpoint_config=True)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/use-cases|financial_payment_classification|financial_payment_classification.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)
