{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "01b5c703",
   "metadata": {},
   "source": [
    "# SageMaker Payment Classification \n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "6498f087",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c2e49281",
   "metadata": {},
   "source": [
    "\n",
    "## Background <a class=\"anchor\" id=\"Background\"></a>\n",
    "\n",
    "This notebook demonstrates how you can train and deploy a machine learning model to classify payment transactions. Enriching financial transactions with the category of the transaction. This can be used as an intermediate step in fraud detection, personalization or anomaly detection. As well as a method to provide end users (e.g. customers at a bank) with an overview of their spending habits. Amazon SageMaker can be used to train and deploy a XGBoost model, as well as the required underlying infrastructure. For this notebook a generated dataset is used where a payment consists of mostly an amount, sender, receiver and timestamp.\n",
    "\n",
    "\n",
    "## Notebook overview <a class=\"anchor\" id=\"Notebook-overview\"></a>\n",
    "\n",
    "This notebook consists of seven parts. First, we import and configure the required libraries. After that we prepare the data used in this example and create the feature store. With the newly created features we create a XGBoost model. An endpoint is created to host this model. We evaluate the performance of the model and end by cleaning up the used resources.\n",
    "\n",
    "## Dataset <a class=\"anchor\" id=\"Dataset\"></a>\n",
    "\n",
    "For this notebook we use a synthetic dataset. This dataset has the following features \n",
    "\n",
    "* __transaction_category__: The category of the transaction, this is one of the next 19 options.\n",
    "\n",
    "               'Uncategorized', 'Entertainment', 'Education',\n",
    "                    'Shopping', 'Personal Care', 'Health and Fitness',\n",
    "             'Food and Dining', 'Gifts and Donations', 'Investments',\n",
    "         'Bills and Utilities', 'Auto and Transport', 'Travel',\n",
    "            'Fees and Charges', 'Business Services', 'Personal Services',\n",
    "                       'Taxes', 'Gambling', 'Home',\n",
    "      'Pension and insurances'\n",
    "\n",
    "\n",
    "* __receiver_id__: an identifier for the receiving party. The identifier consist of 16 numbers.\n",
    "* __sender_id__: an identifier for the sending party. The identifier consist of 16 numbers.\n",
    "* __amount__: the amount which is transferred.\n",
    "* __timestamp__: the timestamp of the transaction in YYYY-MM-DD HH:MM:SS format.\n",
    "\n",
    "\n",
    "### 1. Setup <a class=\"anchor\" id=\"Setup\"></a>\n",
    "\n",
    "Before we start we need to update the sagemaker library"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fff19d6b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "\n",
    "!{sys.executable} -m pip install --upgrade pip       --quiet # upgrade pip to the latest vesion\n",
    "!{sys.executable} -m pip install --upgrade sagemaker --quiet # upgrade SageMaker to the latest vesion"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1b17a94d",
   "metadata": {},
   "source": [
    "Now that we have the latest version we can import the libraries that we'll use in this notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42c5d6d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import io\n",
    "import sagemaker\n",
    "import time\n",
    "import os\n",
    "\n",
    "from time import sleep\n",
    "from sklearn.metrics import classification_report\n",
    "from sagemaker.feature_store.feature_group import FeatureGroup\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3af7c33d",
   "metadata": {},
   "source": [
    "Let's set the session variables to ensure that SageMaker is configured correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c0e4db17",
   "metadata": {},
   "outputs": [],
   "source": [
    "region = sagemaker.Session().boto_region_name\n",
    "sm_client = boto3.client(\"sagemaker\")\n",
    "boto_session = boto3.Session(region_name=region)\n",
    "sagemaker_session = sagemaker.session.Session(boto_session=boto_session, sagemaker_client=sm_client)\n",
    "role = sagemaker.get_execution_role()\n",
    "bucket_prefix = \"payment-classification\"\n",
    "s3_bucket = sagemaker_session.default_bucket()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4fe6a975",
   "metadata": {},
   "source": [
    "We define the factorize key which is used to map the '__transaction_category__' to numeric values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43946b9f",
   "metadata": {},
   "outputs": [],
   "source": [
    "factorize_key = {\n",
    "    \"Uncategorized\": 0,\n",
    "    \"Entertainment\": 1,\n",
    "    \"Education\": 2,\n",
    "    \"Shopping\": 3,\n",
    "    \"Personal Care\": 4,\n",
    "    \"Health and Fitness\": 5,\n",
    "    \"Food and Dining\": 6,\n",
    "    \"Gifts and Donations\": 7,\n",
    "    \"Investments\": 8,\n",
    "    \"Bills and Utilities\": 9,\n",
    "    \"Auto and Transport\": 10,\n",
    "    \"Travel\": 11,\n",
    "    \"Fees and Charges\": 12,\n",
    "    \"Business Services\": 13,\n",
    "    \"Personal Services\": 14,\n",
    "    \"Taxes\": 15,\n",
    "    \"Gambling\": 16,\n",
    "    \"Home\": 17,\n",
    "    \"Pension and insurances\": 18,\n",
    "}"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5e3dc3c4",
   "metadata": {},
   "source": [
    "### 2. Data preparation <a class=\"anchor\" id=\"Data-preparation\"></a>\n",
    "\n",
    "We ingest the simulated data from the public SageMaker S3 training database:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ff0d280",
   "metadata": {},
   "outputs": [],
   "source": [
    "s3 = boto3.client(\"s3\")\n",
    "s3.download_file(\n",
    "    f\"sagemaker-example-files-prod-{region}\",\n",
    "    \"datasets/tabular/synthetic_financial/financial_transactions_mini.csv\",\n",
    "    \"financial_transactions_mini.csv\",\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "08578d93",
   "metadata": {},
   "source": [
    "Let's start by loading the dataset from our csv file into a Pandas dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a477abd7",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(\n",
    "    \"financial_transactions_mini.csv\",\n",
    "    parse_dates=[\"timestamp\"],\n",
    "    infer_datetime_format=True,\n",
    "    dtype={\"transaction_category\": \"string\"},\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cf6be447",
   "metadata": {},
   "source": [
    "The dataframe looks as follows:\n",
    "\n",
    "| | transaction_category | receiver_id | sender_id | amount | timestamp |\n",
    "|------:|:-----------------------|-----------------:|-----------------:|---------:|:--------------------|\n",
    "| 39733 | Shopping | 4258863736072564 | 4630246970548037 | 91.58 | 2021-03-10 01:28:23 |\n",
    "| 27254 | Shopping | 4356269497886716 | 4752313573239323 | 115.17 | 2021-01-22 23:28:24 |\n",
    "| 30628 | Shopping | 4233636409552058 | 4635766441812956 | 90.98 | 2021-02-05 03:24:10 |\n",
    "| 46614 | Shopping | 4054967431278644 | 4823810986511227 | 86.74 | 2021-04-02 14:42:45 |\n",
    "| 37957 | Shopping | 4831814582525664 | 4254514582909482 | 123.27 | 2021-03-17 11:17:18 |\n",
    "| 46878 | Shopping | 4425943481448900 | 4349267977109013 | 65.53 | 2021-03-17 15:47:49 |\n",
    "| 81350 | Auto and Transport | 4146116413442105 | 4062723166078919 | 91.67 | 2021-03-29 13:23:44 |\n",
    "| 10613 | Entertainment | 4788727923958282 | 4485838385631386 | 76.22 | 2021-02-11 17:45:53 |\n",
    "| 46715 | Shopping | 4702782703461430 | 4944181591271506 | 86.67 | 2021-03-20 15:37:17 |\n",
    "| 69110 | Investments | 4180233446952120 | 4702069426390603 | 530.39 | 2021-04-21 08:28:13 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "558fa01c",
   "metadata": {},
   "outputs": [],
   "source": [
    "data.sample(10)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b5492919",
   "metadata": {},
   "source": [
    "Next, we extract the year, month, day, hour, minute, second from the timestamp and remove the timestamp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24f6090e",
   "metadata": {},
   "outputs": [],
   "source": [
    "data[\"year\"] = data[\"timestamp\"].dt.year\n",
    "data[\"month\"] = data[\"timestamp\"].dt.month\n",
    "data[\"day\"] = data[\"timestamp\"].dt.day\n",
    "data[\"hour\"] = data[\"timestamp\"].dt.hour\n",
    "data[\"minute\"] = data[\"timestamp\"].dt.minute\n",
    "data[\"second\"] = data[\"timestamp\"].dt.second\n",
    "\n",
    "del data[\"timestamp\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f7314f8a",
   "metadata": {},
   "source": [
    "We'll transform the transaction categories to numeric targets for the classification by factorization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea2ebdd5",
   "metadata": {},
   "outputs": [],
   "source": [
    "data[\"transaction_category\"] = data[\"transaction_category\"].replace(factorize_key)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9b9ed583",
   "metadata": {},
   "source": [
    "### 3. Create feature store <a class=\"anchor\" id=\"Create-feature-store\"></a>\n",
    "\n",
    "To enrich dataset we will use the [Feature Store](https://aws.amazon.com/sagemaker/feature-store/). "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "233b862a",
   "metadata": {},
   "source": [
    "Before creating the feature store itself we need to set a name for the feature group and identifier used"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9df53a9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_group_name = \"feature-group-payment-classification\"\n",
    "record_identifier_feature_name = \"identifier\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8d9b663f",
   "metadata": {},
   "source": [
    "With the name we defined we create the feature group, runtime and session"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1aef7f91",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_group = FeatureGroup(name=feature_group_name, sagemaker_session=sagemaker_session)\n",
    "\n",
    "featurestore_runtime = boto_session.client(\n",
    "    service_name=\"sagemaker-featurestore-runtime\", region_name=region\n",
    ")\n",
    "\n",
    "feature_store_session = sagemaker.Session(\n",
    "    boto_session=boto_session,\n",
    "    sagemaker_client=sm_client,\n",
    "    sagemaker_featurestore_runtime_client=featurestore_runtime,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3f3d69f5",
   "metadata": {},
   "source": [
    "Once we have defined our feature store we need to put some data in it. We create a Pandas dataframe with the columns mean_amount, count, identifier and event time to store in the feature store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1a250da",
   "metadata": {},
   "outputs": [],
   "source": [
    "columns = [\"mean_amount\", \"count\", \"identifier\", \"EventTime\"]\n",
    "feature_store_data = pd.DataFrame(columns=columns, dtype=object)\n",
    "\n",
    "feature_store_data[\"identifier\"] = range(19)\n",
    "feature_store_data[\"mean_amount\"] = 0.0\n",
    "feature_store_data[\"count\"] = 1\n",
    "feature_store_data[\"EventTime\"] = time.time()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "dea2565b",
   "metadata": {},
   "source": [
    "Using the created dataframe we set the feature definitions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "292571c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_group.load_feature_definitions(data_frame=feature_store_data)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "2a845e07",
   "metadata": {},
   "source": [
    "With these definitions ready we can create the feature group itself"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d046eeb4",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_group.create(\n",
    "    s3_uri=f\"s3://{s3_bucket}/{bucket_prefix}\",\n",
    "    record_identifier_name=record_identifier_feature_name,\n",
    "    event_time_feature_name=\"EventTime\",\n",
    "    role_arn=role,\n",
    "    enable_online_store=True,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "95ad73b1",
   "metadata": {},
   "source": [
    "It takes a couple of minutes for the feature group to be created, we need to wait for this to be done before trying to ingest data in the feature store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "530865ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "status = feature_group.describe().get(\"FeatureGroupStatus\")\n",
    "while status == \"Creating\":\n",
    "    print(\"Waiting for Feature Group to be Created\")\n",
    "    time.sleep(5)\n",
    "    status = feature_group.describe().get(\"FeatureGroupStatus\")\n",
    "print(f\"FeatureGroup {feature_group.name} successfully created.\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3df88321",
   "metadata": {},
   "source": [
    "Once the feature group is created we can ingest data into it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8168ebd5",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_group.ingest(data_frame=feature_store_data, max_workers=3, wait=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "485d1906",
   "metadata": {},
   "source": [
    "To retrieve data from our feature store we define a function that gets the current values from the feature store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f36a576",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_feature_store_values():\n",
    "    response = featurestore_runtime.batch_get_record(\n",
    "        Identifiers=[\n",
    "            {\n",
    "                \"FeatureGroupName\": feature_group_name,\n",
    "                \"RecordIdentifiersValueAsString\": [str(i) for i in range(19)],\n",
    "            }\n",
    "        ]\n",
    "    )\n",
    "\n",
    "    columns = [\"mean_amount\", \"count\", \"identifier\", \"EventTime\"]\n",
    "\n",
    "    feature_store_resp = pd.DataFrame(\n",
    "        data=[\n",
    "            [resp[\"Record\"][i][\"ValueAsString\"] for i in range(len(columns))]\n",
    "            for resp in response[\"Records\"]\n",
    "        ],\n",
    "        columns=columns,\n",
    "    )\n",
    "    feature_store_resp[\"identifier\"] = feature_store_resp[\"identifier\"].astype(int)\n",
    "    feature_store_resp[\"count\"] = feature_store_resp[\"count\"].astype(int)\n",
    "    feature_store_resp[\"mean_amount\"] = feature_store_resp[\"mean_amount\"].astype(float)\n",
    "    feature_store_resp[\"EventTime\"] = feature_store_resp[\"EventTime\"].astype(float)\n",
    "    feature_store_resp = feature_store_resp.sort_values(by=\"identifier\")\n",
    "\n",
    "    return feature_store_resp\n",
    "\n",
    "\n",
    "feature_store_resp = get_feature_store_values()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b5e4834e",
   "metadata": {},
   "source": [
    "We update the values in the feature store with the real values of our data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb025e68",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_store_data = pd.DataFrame()\n",
    "feature_store_data[\"mean_amount\"] = data.groupby([\"transaction_category\"]).mean()[\"amount\"]\n",
    "feature_store_data[\"count\"] = data.groupby([\"transaction_category\"]).count()[\"amount\"]\n",
    "feature_store_data[\"identifier\"] = feature_store_data.index\n",
    "feature_store_data[\"EventTime\"] = time.time()\n",
    "\n",
    "feature_store_data[\"mean_amount\"] = (\n",
    "    pd.concat([feature_store_resp, feature_store_data])\n",
    "    .groupby(\"identifier\")\n",
    "    .apply(lambda x: np.average(x[\"mean_amount\"], weights=x[\"count\"]))\n",
    ")\n",
    "feature_store_data[\"count\"] = (\n",
    "    pd.concat([feature_store_resp, feature_store_data]).groupby(\"identifier\").sum()[\"count\"]\n",
    ")\n",
    "\n",
    "feature_group.ingest(data_frame=feature_store_data, max_workers=3, wait=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e2f6395f",
   "metadata": {},
   "source": [
    "And display them after getting them from the feature store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10b23bf6",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_store_data = get_feature_store_values()\n",
    "feature_store_data"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cf148985",
   "metadata": {},
   "source": [
    "We use the feature store to calculate the distance between the average of every category and the current amount"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a3e85de",
   "metadata": {},
   "outputs": [],
   "source": [
    "additional_features = pd.pivot_table(\n",
    "    feature_store_data, values=[\"mean_amount\"], index=[\"identifier\"]\n",
    ").T.add_suffix(\"_dist\")\n",
    "additional_features_columns = list(additional_features.columns)\n",
    "data = pd.concat([data, pd.DataFrame(columns=additional_features_columns, dtype=object)])\n",
    "data[additional_features_columns] = additional_features.values[0]\n",
    "for col in additional_features_columns:\n",
    "    data[col] = abs(data[col] - data[\"amount\"])\n",
    "\n",
    "data"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "289eeca6",
   "metadata": {},
   "source": [
    "### 4. Create model <a class=\"anchor\" id=\"Create-model\"></a>\n",
    "In this notebook we will be using the [Extreme Gradient Boosting](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) (XGBoost) implementation of the gradient boosted trees algorithm. This model is selected due to it relatively fast training time and explainable properties. The model can be substituted at will a different [SageMaker estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) or a [model of your choosing](https://aws.amazon.com/blogs/machine-learning/bring-your-own-model-with-amazon-sagemaker-script-mode/).\n",
    "\n",
    "\n",
    "\n",
    "Now that we have the dataset we can start preparing the model. First, we create a training, validation and testing split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb4bdd8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Randomly sort the data then split out first 70%, second 20%, and last 10%\n",
    "train_data, validation_data, test_data = np.split(\n",
    "    data.sample(frac=1, random_state=42), [int(0.7 * len(data)), int(0.9 * len(data))]\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f81f65b9",
   "metadata": {},
   "source": [
    "We save these sets to a file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f849a7a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data.to_csv(\"train.csv\", index=False, header=False)\n",
    "validation_data.to_csv(\"validation.csv\", index=False, header=False)\n",
    "test_data.to_csv(\"test.csv\", index=False, header=False)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "de669936",
   "metadata": {},
   "source": [
    "And upload these files to our s3 bucket"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1ca2543",
   "metadata": {},
   "outputs": [],
   "source": [
    "boto3.Session().resource(\"s3\").Bucket(s3_bucket).Object(\n",
    "    os.path.join(bucket_prefix, \"train/train.csv\")\n",
    ").upload_file(\"train.csv\")\n",
    "boto3.Session().resource(\"s3\").Bucket(s3_bucket).Object(\n",
    "    os.path.join(bucket_prefix, \"validation/validation.csv\")\n",
    ").upload_file(\"validation.csv\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "22de532f",
   "metadata": {},
   "source": [
    "Get the XGBoost sagemaker image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a41b6a7d",
   "metadata": {},
   "outputs": [],
   "source": [
    "container = sagemaker.image_uris.retrieve(region=region, framework=\"xgboost\", version=\"1.2-2\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "66cae2a9",
   "metadata": {},
   "source": [
    "Transform our data to a sagemaker input for training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e51c917a",
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_input_train = sagemaker.inputs.TrainingInput(\n",
    "    s3_data=\"s3://{}/{}/train\".format(s3_bucket, bucket_prefix), content_type=\"csv\"\n",
    ")\n",
    "s3_input_validation = sagemaker.inputs.TrainingInput(\n",
    "    s3_data=\"s3://{}/{}/validation/\".format(s3_bucket, bucket_prefix), content_type=\"csv\"\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "6f2985d8",
   "metadata": {},
   "source": [
    "We define the XGBoost model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92c1fe8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb = sagemaker.estimator.Estimator(\n",
    "    container,\n",
    "    role,\n",
    "    instance_count=1,\n",
    "    instance_type=\"ml.m4.xlarge\",\n",
    "    output_path=\"s3://{}/{}/output\".format(s3_bucket, bucket_prefix),\n",
    "    sagemaker_session=sagemaker_session,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ecafdfe8",
   "metadata": {},
   "source": [
    "Set the parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "582adc6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb.set_hyperparameters(\n",
    "    max_depth=5,\n",
    "    eta=0.2,\n",
    "    gamma=4,\n",
    "    min_child_weight=6,\n",
    "    subsample=0.8,\n",
    "    objective=\"multi:softprob\",\n",
    "    num_class=19,\n",
    "    verbosity=0,\n",
    "    num_round=100,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b36463dd",
   "metadata": {},
   "source": [
    "And train the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c24e06fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb.fit({\"train\": s3_input_train, \"validation\": s3_input_validation})"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8b716cd7",
   "metadata": {},
   "source": [
    "### 5. Using the endpoint <a class=\"anchor\" id=\"Using-the-endpoint\"></a>\n",
    "\n",
    "Deploy the model to an endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "84c6a5b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb_predictor = xgb.deploy(\n",
    "    initial_instance_count=1,\n",
    "    instance_type=\"ml.m4.xlarge\",\n",
    "    serializer=sagemaker.serializers.CSVSerializer(),\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "712f4d35",
   "metadata": {},
   "source": [
    "### 6. Evaluate performance <a class=\"anchor\" id=\"Evaluate-performance\"></a>\n",
    "\n",
    "Run the model on our test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca4f7e49",
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict(data, predictor):\n",
    "    predictions = []\n",
    "    confidences = []\n",
    "    for row in data:\n",
    "        response = np.fromstring(predictor.predict(row).decode(\"utf-8\")[1:], sep=\",\")\n",
    "        pred = response.argmax()\n",
    "        confidence = max(response)\n",
    "        predictions.extend([pred])\n",
    "        confidences.extend([confidence])\n",
    "\n",
    "    return predictions, confidences"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "37c255b1",
   "metadata": {},
   "source": [
    "Running it on the first 3 rows in our dataset results in the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc80ee89",
   "metadata": {},
   "outputs": [],
   "source": [
    "pred, conf = predict(test_data.drop([\"transaction_category\"], axis=1).to_numpy()[:3], xgb_predictor)\n",
    "print(\n",
    "    f\"The predictions for the first 3 entries are {pred}, the confidence for these predictions are {conf}\"\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "91dff2af",
   "metadata": {},
   "source": [
    "Now we run the predictions on the complete dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1207448a",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions, confidences = predict(\n",
    "    test_data.drop([\"transaction_category\"], axis=1).to_numpy(), xgb_predictor\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d04c57fd",
   "metadata": {},
   "source": [
    "And report the prediction results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "895d2f35",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    classification_report(\n",
    "        test_data[\"transaction_category\"].to_list(), predictions, target_names=factorize_key\n",
    "    )\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "98d0b67e",
   "metadata": {},
   "source": [
    "You should see results similar to this:\n",
    "\n",
    "```\n",
    "                        precision    recall  f1-score   support\n",
    "\n",
    "         Uncategorized       1.00      0.92      0.96        51\n",
    "         Entertainment       0.81      0.89      0.85      1486\n",
    "             Education       1.00      0.94      0.97        80\n",
    "              Shopping       0.86      0.94      0.90      3441\n",
    "         Personal Care       1.00      0.98      0.99       132\n",
    "    Health and Fitness       0.99      0.89      0.94       443\n",
    "       Food and Dining       0.99      0.82      0.90       918\n",
    "   Gifts and Donations       1.00      0.95      0.97       275\n",
    "           Investments       0.99      0.97      0.98        88\n",
    "   Bills and Utilities       1.00      0.99      1.00       332\n",
    "    Auto and Transport       0.94      0.84      0.88      1967\n",
    "                Travel       0.96      0.84      0.90       120\n",
    "      Fees and Charges       1.00      0.94      0.97       106\n",
    "     Business Services       1.00      0.99      1.00       146\n",
    "     Personal Services       1.00      0.96      0.98        75\n",
    "                 Taxes       0.98      0.94      0.96        47\n",
    "              Gambling       1.00      1.00      1.00        15\n",
    "                  Home       0.98      0.89      0.93       168\n",
    "Pension and insurances       0.99      1.00      1.00       110\n",
    "\n",
    "              accuracy                           0.90     10000\n",
    "             macro avg       0.97      0.93      0.95     10000\n",
    "          weighted avg       0.91      0.90      0.90     10000\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "49fdc82d",
   "metadata": {},
   "source": [
    "### 7. Clean up <a class=\"anchor\" id=\"Clean-up\"></a>\n",
    "\n",
    "Remove the feature group and endpoint to clean up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f79b1164",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_group.delete()\n",
    "xgb_predictor.delete_endpoint(delete_endpoint_config=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e04b6fa6",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/use-cases|financial_payment_classification|financial_payment_classification.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science 2.0)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/sagemaker-data-science-38"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}