{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "02adde30-2633-41f1-a672-af8ff83a1b02",
    "_uuid": "22754d0cc0847be93bf947c7998d7fb65a2817d7"
   },
   "source": [
    "# Text Feature Engineering Notebook\n",
    " \n",
    "This notebook demonstrates feature engineering for text features in a dataset. This notebook runs locally to show iterative process of feature engineering.\n",
    "\n",
    "The objective is to accurately predict if a pacs.008 XML payment message will be successfully processed without exception, thereby resulting in a payment transfer from the debtor to the creditor. The target label is `Success i.e 1` or `Failure i.e. 0`.\n",
    "\n",
    "This notebook shows how to create new features from a text feature, combining the new features with other numerical or categorical features in the dataset. We will also check along the way if these new derived features will help us in prediction. \n",
    "\n",
    "We'll start with following steps:\n",
    "1. Select features from full labeled dataset for training.\n",
    "1. Perform some basic data analysis and visualization of selected features.\n",
    "1. Clean data such impute missing values.\n",
    "1. Dive deep into the feature engineering of a text feature.\n",
    "\n",
    "Let's start."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install Python Packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install nltk\n",
    "!conda install nltk -y\n",
    "#!pip install xgboost\n",
    "!conda install xgboost -y"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download Labeled Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import boto3\n",
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "sm_client = boto3.Session().client('sagemaker')\n",
    "sm_session = sagemaker.Session()\n",
    "region = boto3.session.Session().region_name\n",
    "\n",
    "role = get_execution_role()\n",
    "print (\"Notebook is running with assumed role {}\".format (role))\n",
    "print(\"Working with AWS services in the {} region\".format(region))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Working directory for the notebook\n",
    "WORKDIR = os.getcwd()\n",
    "BASENAME = os.path.dirname(WORKDIR)\n",
    "print(f\"WORKDIR: {WORKDIR}\")\n",
    "print(f\"BASENAME: {BASENAME}\")\n",
    "\n",
    "# Create a directory storing local data\n",
    "iso20022_data_path = 'iso20022-data'\n",
    "if not os.path.exists(iso20022_data_path):\n",
    "    # Create a new directory because it does not exist \n",
    "    os.makedirs(iso20022_data_path)\n",
    "\n",
    "# Store all prototype assets in this bucket\n",
    "s3_bucket_name = 'iso20022-prototype-t3'\n",
    "s3_bucket_uri = 's3://' + s3_bucket_name\n",
    "\n",
    "# Prefix for all files in this prototype\n",
    "prefix = 'iso20022'\n",
    "\n",
    "pacs008_prefix = prefix + '/pacs008'\n",
    "raw_data_prefix = pacs008_prefix + '/raw-data'\n",
    "labeled_data_prefix = pacs008_prefix + '/labeled-data'\n",
    "\n",
    "\n",
    "labeled_data_location = s3_bucket_uri + '/' + labeled_data_prefix\n",
    "print(f\"Raw labeled data location = {labeled_data_location}\")\n",
    "\n",
    "# Download labeled raw dataset from S3\n",
    "s3_client = boto3.client('s3')\n",
    "s3_client.download_file(s3_bucket_name, labeled_data_prefix + '/labeled_data.csv', 'iso20022-data/labeled_data.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Select Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "b31f62fb-bde8-410e-972c-3c092f22d497",
    "_uuid": "0fcdf81ce439d2215892af58f839edfc0ca80a91"
   },
   "outputs": [],
   "source": [
    "import numpy as np # linear algebra\n",
    "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import nltk\n",
    "nltk.download('stopwords')\n",
    "from nltk.corpus import stopwords\n",
    "import string\n",
    "import xgboost as xgb\n",
    "from xgboost import XGBClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer\n",
    "from sklearn.decomposition import TruncatedSVD\n",
    "from sklearn import ensemble, metrics, model_selection, naive_bayes\n",
    "\n",
    "color = sns.color_palette()\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "eng_stopwords = set(stopwords.words(\"english\"))\n",
    "pd.options.mode.chained_assignment = None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select Features from Labeled Raw Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "add1b71c-e802-408f-8f62-7ea71ed155cf",
    "_uuid": "ae86f9515b3be4956e223db4c2472c2c0c40d9fb",
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "## Read the train and test dataset and check the top few lines ##\n",
    "labeled_raw_df = pd.read_csv(\"iso20022-data/labeled_data.csv\")\n",
    "\n",
    "fts=[\n",
    " 'y_target',   \n",
    " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',\n",
    "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForCdtrAgt_InstrInf',\n",
    "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForCdtrAgt_Cd', \n",
    "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_Nm',\n",
    " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry',  \n",
    "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_PstCd',    \n",
    "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_StrtNm',\n",
    "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_TwnNm',\n",
    "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_Nm',\n",
    " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry', \n",
    "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_PstCd',  \n",
    "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_StrtNm', \n",
    "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_TwnNm', \n",
    " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_DbtCdtRptgInd', \n",
    " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Authrty_Ctry', \n",
    "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Authrty_Nm',\n",
    "# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Tp', \n",
    " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'\n",
    "]\n",
    "\n",
    "# New data frame with selected features\n",
    "selected_df = labeled_raw_df[fts]\n",
    "    \n",
    "selected_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rename columns\n",
    "selected_df = selected_df.rename(columns={\n",
    "    'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf': 'InstrForNxtAgt',\n",
    "    'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry': 'Dbtr_PstlAdr_Ctry',\n",
    "    'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry': 'Cdtr_PstlAdr_Ctry',\n",
    "    'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_DbtCdtRptgInd': 'RgltryRptg_DbtCdtRptgInd',\n",
    "    'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Authrty_Ctry': 'RgltryRptg_Authrty_Ctry',\n",
    "    'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd': 'RgltryRptg_Dtls_Cd'\n",
    "})\n",
    "\n",
    "selected_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Split into Training and Test Datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(selected_df, selected_df['y_target'], test_size=0.20, random_state=299, shuffle=True)\n",
    "train_df = X_train\n",
    "test_df = X_test\n",
    "\n",
    "print(\"Number of rows in train dataset : \",train_df.shape[0])\n",
    "print(\"Number of rows in test dataset : \",test_df.shape[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "35e22f81-77f9-438e-ba50-803aea15ad14",
    "_uuid": "83a9dfc6af30753f82f07ef6162d3b9ba155b06d"
   },
   "outputs": [],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Data Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "5bee9e9a-5fbf-4ee6-9e92-0c77821fe77d",
    "_uuid": "d7682e0812990e3409db6076e7570ce984e466a1"
   },
   "source": [
    "We can check the number of occurrence of each debtor country to see if the classes are balanced. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "95529569-3380-4794-8248-caf5e8c67f6a",
    "_uuid": "045fe5712b26a72855dad61f05c947ab829a295b",
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "dbtr_cntry_count = train_df['Dbtr_PstlAdr_Ctry'].value_counts()\n",
    "\n",
    "plt.figure(figsize=(8,4))\n",
    "sns.barplot(dbtr_cntry_count.index, dbtr_cntry_count.values, alpha=0.8)\n",
    "plt.ylabel('Number of Occurrences', fontsize=12)\n",
    "plt.xlabel('Debtor Country Name', fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cdtr_cntry_count = train_df['Cdtr_PstlAdr_Ctry'].value_counts()\n",
    "\n",
    "plt.figure(figsize=(8,4))\n",
    "sns.barplot(cdtr_cntry_count.index, cdtr_cntry_count.values, alpha=0.8)\n",
    "plt.ylabel('Number of Occurrences', fontsize=12)\n",
    "plt.xlabel('Creditor Country Name', fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "bff5de3a-b92f-4192-b142-60d0367e1bb2",
    "_uuid": "7b1ddc8bf782ae87ed5c24fdb6edbff986255d60"
   },
   "source": [
    "**Check missing values:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.isna().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Let's do a quick check on a few samples of text feature `InstrForNxtAgt - Instructions for Next Agent` by each debtor and creditor country:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped_df = train_df.groupby('Dbtr_PstlAdr_Ctry')\n",
    "for name, group in grouped_df:\n",
    "    print(\"Debtor Country name : \", name)\n",
    "    cnt = 0\n",
    "    for ind, row in group.iterrows():\n",
    "        print(row[\"InstrForNxtAgt\"])\n",
    "        cnt += 1\n",
    "        if cnt == 5:\n",
    "            break\n",
    "    print(\"\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "6c85f84f-d0ee-444e-8b5f-2bf62452624a",
    "_uuid": "cc6b841404940a0f84481c76f5c2328b373416e1",
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": [
    "grouped_df = train_df.groupby('Cdtr_PstlAdr_Ctry')\n",
    "for name, group in grouped_df:\n",
    "    print(\"Creditor Country name : \", name)\n",
    "    cnt = 0\n",
    "    for ind, row in group.iterrows():\n",
    "        print(row[\"InstrForNxtAgt\"])\n",
    "        cnt += 1\n",
    "        if cnt == 5:\n",
    "            break\n",
    "    print(\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a few special characters present in the text data primarily to indicate special service (/SVC/) or regulatory text (/REG/). Hence count of these special characters might be a good feature."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "84398b75-325f-4259-90ea-bb6293cc5235",
    "_uuid": "d5739c4a5967d2bcb9326d39529f84791c09dfb8"
   },
   "source": [
    "# Feature Engineering\n",
    "\n",
    "### Impute missing values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform simple Imputation for missing values in training dataset\n",
    "# Traing dataset\n",
    "train_df['InstrForNxtAgt'].fillna('none', inplace=True)\n",
    "train_df['RgltryRptg_DbtCdtRptgInd'].fillna('none',inplace=True)\n",
    "train_df['RgltryRptg_Authrty_Ctry'].fillna('none',inplace=True)\n",
    "train_df['RgltryRptg_Dtls_Cd'].fillna('none',inplace=True)\n",
    "\n",
    "train_df.isna().sum()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test dataset\n",
    "test_df['InstrForNxtAgt'].fillna('none', inplace=True)\n",
    "test_df['RgltryRptg_DbtCdtRptgInd'].fillna('none',inplace=True)\n",
    "test_df['RgltryRptg_Authrty_Ctry'].fillna('none',inplace=True)\n",
    "test_df['RgltryRptg_Dtls_Cd'].fillna('none',inplace=True)\n",
    "\n",
    "test_df.isna().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assign Datatypes \n",
    "\n",
    "**Pandas Datatypes**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Categorical data transformation.\n",
    "\n",
    "categorical_fts=[\n",
    "# 'InstrForCdtrAgt_Cd',\n",
    " 'Dbtr_PstlAdr_Ctry', \n",
    " 'Cdtr_PstlAdr_Ctry',\n",
    " 'RgltryRptg_DbtCdtRptgInd',    \n",
    " 'RgltryRptg_Authrty_Ctry', \n",
    " 'RgltryRptg_Dtls_Cd'\n",
    "]\n",
    "# Convert categorical features to categorical data type.\n",
    "for col in categorical_fts:\n",
    "    train_df[col] = pd.Categorical(train_df[col])\n",
    "\n",
    "# Convert from original feature values to categorical values\n",
    "for col in categorical_fts:\n",
    "    train_df[col] = train_df[col].cat.codes\n",
    "\n",
    "train_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Encode Target Label**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "label_encoder = LabelEncoder()\n",
    "train_df['y_target'] = label_encoder.fit_transform(train_df['y_target'])\n",
    "\n",
    "mapping = dict(zip(label_encoder.classes_, range(len(label_encoder.classes_))))\n",
    "print(f\"Label Mapping: {mapping}\")\n",
    "\n",
    "train_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test dataset\n",
    "categorical_fts=[\n",
    "# 'InstrForCdtrAgt_Cd',\n",
    " 'Dbtr_PstlAdr_Ctry', \n",
    " 'Cdtr_PstlAdr_Ctry',\n",
    " 'RgltryRptg_DbtCdtRptgInd',    \n",
    " 'RgltryRptg_Authrty_Ctry', \n",
    " 'RgltryRptg_Dtls_Cd'\n",
    "]\n",
    "# Convert categorical features to categorical data type.\n",
    "for col in categorical_fts:\n",
    "    test_df[col] = pd.Categorical(test_df[col])\n",
    "\n",
    "# Convert from original feature values to categorical values\n",
    "for col in categorical_fts:\n",
    "    test_df[col] = test_df[col].cat.codes\n",
    "\n",
    "test_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test dataset\n",
    "\n",
    "label_encoder = LabelEncoder()\n",
    "test_df['y_target'] = label_encoder.fit_transform(test_df['y_target'])\n",
    "test_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"No. of train_df Columns: {len(train_df.columns)}\")\n",
    "print(train_df.columns)\n",
    "\n",
    "print(f\"No. of test_df Columns: {len(test_df.columns)}\")\n",
    "print(test_df.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Text Feature Engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There a couple of approaches to feature engineer text:\n",
    "\n",
    " 1. Meta features - features based on the statistics on text like number of words, mean word length, number of stop words, number of punctuations, number of upper case words and so on. \n",
    " 1. Text based features - using common techniques for text features i.e. directly based on the content of text i.e. words in the text e.g. text frequency, tf-idf, word2vec/fasttext/blazingtext etc.\n",
    "\n",
    "In other words there are two types of features for text.  \n",
    "\n",
    "### Meta Features\n",
    "\n",
    "The meta features are:\n",
    "1. Number of words in the text\n",
    "1. Number of unique words in the text\n",
    "1. Mean word length\n",
    "1. Number of characters in the text\n",
    "1. Number of stopwords \n",
    "1. Number of punctuations\n",
    "1. Number of upper case words\n",
    "\n",
    "We use [nltk](https://www.nltk.org/) stop words to compute some of these meta feature. For more info see [nltk book](https://www.nltk.org/book/ch02.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "5758c1ff-ca4d-4c66-8c85-ac1725bd631b",
    "_uuid": "ec015064f318cfd7d0a405bd28cd002f72724bbe"
   },
   "outputs": [],
   "source": [
    "## Number of words in the text ##\n",
    "train_df[\"num_words\"] = train_df[\"InstrForNxtAgt\"].apply(lambda x: len(str(x).split()))\n",
    "test_df[\"num_words\"] = test_df[\"InstrForNxtAgt\"].apply(lambda x: len(str(x).split()))\n",
    "\n",
    "## Number of unique words in the text ##\n",
    "train_df[\"num_unique_words\"] = train_df[\"InstrForNxtAgt\"].apply(lambda x: len(set(str(x).split())))\n",
    "test_df[\"num_unique_words\"] = test_df[\"InstrForNxtAgt\"].apply(lambda x: len(set(str(x).split())))\n",
    "\n",
    "## Number of characters in the text ##\n",
    "train_df[\"num_chars\"] = train_df[\"InstrForNxtAgt\"].apply(lambda x: len(str(x)))\n",
    "test_df[\"num_chars\"] = test_df[\"InstrForNxtAgt\"].apply(lambda x: len(str(x)))\n",
    "\n",
    "## Number of stopwords in the text ##\n",
    "train_df[\"num_stopwords\"] = train_df[\"InstrForNxtAgt\"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))\n",
    "test_df[\"num_stopwords\"] = test_df[\"InstrForNxtAgt\"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))\n",
    "\n",
    "## Number of punctuations in the text ##\n",
    "train_df[\"num_punctuations\"] =train_df['InstrForNxtAgt'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))\n",
    "test_df[\"num_punctuations\"] =test_df['InstrForNxtAgt'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))\n",
    "\n",
    "## Average length of the words in the text ##\n",
    "train_df[\"mean_word_len\"] = train_df[\"InstrForNxtAgt\"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))\n",
    "test_df[\"mean_word_len\"] = test_df[\"InstrForNxtAgt\"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))\n",
    "\n",
    "## Number of upper case words in the text ##\n",
    "train_df[\"num_words_upper\"] = train_df[\"InstrForNxtAgt\"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))\n",
    "test_df[\"num_words_upper\"] = test_df[\"InstrForNxtAgt\"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))\n",
    "\n",
    "print(f\"No. of train_df Columns: {len(train_df.columns)}\")\n",
    "print(train_df.columns)\n",
    "\n",
    "print(f\"No. of test_df Columns: {len(test_df.columns)}\")\n",
    "print(test_df.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "164b0d54-3d9e-4b58-ae7c-20c447efdc69",
    "_uuid": "d79f96166a7fe609fd8431b341b2d5ad189ca07a"
   },
   "source": [
    "Let us now plot some of a couple of these new features to see of they will be helpful in predictions. We use [violin plot](https://en.wikipedia.org/wiki/Violin_plot)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "fe9a91d9-3df7-4e1b-9e69-f701cc714a15",
    "_uuid": "51962b52791977deefab8b7991d59a51211c8a5e"
   },
   "outputs": [],
   "source": [
    "train_df['num_words'].loc[train_df['num_words']>8] = 8 #truncation for better visuals\n",
    "plt.figure(figsize=(12,8))\n",
    "sns.violinplot(x='y_target', y='num_words', data=train_df)\n",
    "plt.xlabel('y_target', fontsize=12)\n",
    "plt.ylabel('Number of words in text', fontsize=12)\n",
    "plt.title(\"Number of words by y_target\", fontsize=15)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "66833874-08f2-48b8-9561-b54de997071d",
    "_uuid": "abe5dcadd42cd45169e44e3e299c303d5197b8ab"
   },
   "source": [
    "This is highly text dependent, and it might have some benefit but we are not sure yet."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "91514e74-d734-4642-a9dd-470cb8a65733",
    "_uuid": "1c4c0950d33e29d270fbc021aa8c5ff0e7ab205b"
   },
   "outputs": [],
   "source": [
    "train_df['num_chars'].loc[train_df['num_chars']>15] = 15 #truncation for better visuals\n",
    "plt.figure(figsize=(12,8))\n",
    "sns.violinplot(x='y_target', y='num_chars', data=train_df)\n",
    "plt.xlabel('y_target', fontsize=12)\n",
    "plt.ylabel('Number of characters in text', fontsize=12)\n",
    "plt.title(\"Number of characters by y_target\", fontsize=15)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "3a622bfc-da8b-4f13-b9d8-a63dd79a5218",
    "_uuid": "240a9ac16cf5d342e6d01cb020f1d8cea670ec84"
   },
   "source": [
    "We are not sure if this feature helps either.\n",
    "\n",
    "Here we build a xgboost model to check how these meta features are helping. \n",
    "\n",
    "**Drop** the `InstrForNxtAgt` column before training as it has been feature engineered."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "d97fc321-32a9-429e-9135-ae52300332ec",
    "_uuid": "c27a85339ef5cf5ac3c87527249f878b615d3996"
   },
   "outputs": [],
   "source": [
    "## Prepare the data for modeling ###\n",
    "#train_y = train_df['y_target']\n",
    "#train_id = train_df['id'].values\n",
    "#test_id = test_df['id'].values\n",
    "\n",
    "### recompute the trauncated variables again ###\n",
    "train_df[\"num_words\"] = train_df[\"InstrForNxtAgt\"].apply(lambda x: len(str(x).split()))\n",
    "test_df[\"num_words\"] = test_df[\"InstrForNxtAgt\"].apply(lambda x: len(str(x).split()))\n",
    "train_df[\"num_chars\"] = train_df[\"InstrForNxtAgt\"].apply(lambda x: len(str(x)))\n",
    "test_df[\"num_chars\"] = test_df[\"InstrForNxtAgt\"].apply(lambda x: len(str(x)))\n",
    "train_df[\"mean_word_len\"] = train_df[\"InstrForNxtAgt\"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))\n",
    "test_df[\"mean_word_len\"] = test_df[\"InstrForNxtAgt\"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))\n",
    "\n",
    "# Drop the `InstrForNxtAgt` column before training as it has been feature engineered\n",
    "cols_to_drop = ['y_target','InstrForNxtAgt']\n",
    "\n",
    "train_X = train_df.drop(cols_to_drop, axis=1)\n",
    "train_y = train_df['y_target']\n",
    "\n",
    "test_X = test_df.drop(cols_to_drop, axis=1)\n",
    "test_y = test_df['y_target']\n",
    "\n",
    "print(f\"Shape of train_X: {train_X.shape}\")\n",
    "print(train_X.columns)\n",
    "\n",
    "print(f\"Shape of train_y: {train_y.shape}\")\n",
    "\n",
    "print(f\"Shape of test_X Columns: {test_X.shape}\")\n",
    "print(test_X.columns)\n",
    "\n",
    "print(f\"Shape of test_y: {test_y.shape}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "# Init classifier\n",
    "xgb_cl = xgb.XGBClassifier()\n",
    "\n",
    "# Fit\n",
    "xgb_cl.fit(train_X, train_y)\n",
    "\n",
    "# Predict\n",
    "preds = xgb_cl.predict(test_X)\n",
    "\n",
    "# Score\n",
    "accuracy_score(test_y, preds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check feature importance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the feature importantance\n",
    "fig, ax = plt.subplots(figsize=(12,12))\n",
    "xgb.plot_importance(xgb_cl, max_num_features=50, height=0.8, ax=ax)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "c4198d3f-5fad-41b1-9bcf-3b2faef2287f",
    "_uuid": "1fddaea1621cba593f86d78851a26e1645f4954e"
   },
   "source": [
    "We can train a simple XGBoost model with these meta features alone."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "739ce74e-af56-4fd6-a7b3-fe809f33bc54",
    "_uuid": "25225b8d3a5ed46ec90308478b3788760eb5ea1d"
   },
   "source": [
    "### Text Based Features\n",
    "\n",
    "Now lets use `InstrForNxtAgt` to creating some text based features.\n",
    "\n",
    "One of the basic features which we could create is tf-idf values of the words present in the text. So we can start with that one.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "dd89dcab-b7a2-4b11-9564-ac8558f938e0",
    "_uuid": "41b4430c7e9699bd7f430c471009082bf7449928"
   },
   "outputs": [],
   "source": [
    "### Fit transform the tfidf vectorizer ###\n",
    "tfidf_vectr = TfidfVectorizer(stop_words='english', ngram_range=(1,3))\n",
    "full_tfidf_vec = tfidf_vectr.fit_transform(train_df['InstrForNxtAgt'].values.tolist() + test_df['InstrForNxtAgt'].values.tolist())\n",
    "train_tfidf_vec = tfidf_vectr.transform(train_df['InstrForNxtAgt'].values.tolist())\n",
    "test_tfidf_vec = tfidf_vectr.transform(test_df['InstrForNxtAgt'].values.tolist())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "fb46183a-43b6-4785-823a-f017a0821b78",
    "_uuid": "9c51efb2c3ff436baf8811a298b699b430b812bf"
   },
   "source": [
    "Now that we have got the tfidf vector but it is a sparse matrix and so if we have to use it with other dense features, we have to find a way to summarize information in tfidf vector. Common approaches for this are: \n",
    "1. We can choose to get the top 'n' features (depending on the system config) from the tfidf vectorizer, convert it into dense format and concat with other features.\n",
    "1. Build a model using just the sparse features and then use the predictions as one of the features along with other dense features. Simple models such Naive Bayes can be used or deep learning NLP models like word2vec or fastext can be used to create word embeddings (word vector) for the sentence.\n",
    "\n",
    "Based on the dataset, one might perform better than the other. Here we will use the second approach but with a Multinomial Naive Bayes model on tfidf vector (new features) to predict `Success` or `Failure` of payment message. [Multinomial Naive Bayes models](https://scikit-learn.org/stable/modules/naive_bayes.html) are fast to train, and are commonly used in text classification problems."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "a409c693-6d31-485d-b05f-d98008a1c185",
    "_uuid": "f3ef5dd66afde39552eaa8f59929b263b831b190"
   },
   "source": [
    "#### Naive Bayes on Word Tfidf Vectorizer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "61063815-507a-45f7-8866-5298da11e62d",
    "_uuid": "c0ed8394d5fbcfb833efa2836509583a2ec79de1"
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import KFold, cross_val_score\n",
    "\n",
    "# Train Naive Bayes classifier\n",
    "model = naive_bayes.MultinomialNB()\n",
    "# Train dataset\n",
    "model.fit(train_tfidf_vec, train_y)\n",
    "\n",
    "# Get prediction for classes: Failure(0) or Success (1)\n",
    "# predicted_class = model.predict(test_word_count_vec)\n",
    "# print(predicted_class)\n",
    "\n",
    "kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=199)\n",
    "# Train CV scores\n",
    "train_cv_scores = cross_val_score(model, train_tfidf_vec, train_y, cv=kf)\n",
    "print(\"Mean Train CV score : \", train_cv_scores.mean())\n",
    "\n",
    "#print(train_tfidf_vec.shape)\n",
    "#print(train_tfidf_vec)\n",
    "\n",
    "# Add the prediction probabilities for Failure or Success from text as new features\n",
    "train_y_pred_proba = model.predict_proba(train_tfidf_vec)\n",
    "#print(f\"train_y_pred_proba type: {type(train_y_pred_proba)}\")\n",
    "#print(f\"Shape train_y_pred_proba: {train_y_pred_proba.shape}\")\n",
    "#print(f\"train_y_predictions:{train_y_pred_proba}\")\n",
    "train_df[[\"nb_tfidf_word_failure\", \"nb_tfidf_word_success\"]] = train_y_pred_proba\n",
    "\n",
    "# Test CV scores\n",
    "test_cv_scores = cross_val_score(model, test_tfidf_vec, test_y, cv=kf)\n",
    "print(\"Mean Test CV score : \", test_cv_scores.mean())\n",
    "\n",
    "full_test_preds_cv_score = (train_cv_scores.mean() + test_cv_scores.mean())/2.\n",
    "print(\"Full pred CV score : \", full_test_preds_cv_score)\n",
    "\n",
    "# Add the prediction probabilities for Failure or Success from text as new features\n",
    "test_y_pred_proba = model.predict_proba(test_tfidf_vec)\n",
    "#print(f\"test_y_pred_proba type: {type(test_y_pred_proba)}\")\n",
    "#print(f\"Shape test_y_pred_proba: {test_y_pred_proba.shape}\")\n",
    "test_df[[\"nb_tfidf_word_failure\", \"nb_tfidf_word_success\"]] = test_y_pred_proba\n",
    "\n",
    "train_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "586f6708-8cc1-4422-90dd-6e4bef3cf2c0",
    "_kg_hide-input": true,
    "_uuid": "2bd4a5be4f492f1eb75836a7fbaed8c4374701cf"
   },
   "outputs": [],
   "source": [
    "### Function to create confusion matrix ###\n",
    "import itertools\n",
    "from sklearn.metrics import confusion_matrix\n",
    "\n",
    "### From http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py #\n",
    "def plot_confusion_matrix(cm, classes,\n",
    "                          normalize=False,\n",
    "                          title='Confusion matrix',\n",
    "                          cmap=plt.cm.Blues):\n",
    "    \"\"\"\n",
    "    This function prints and plots the confusion matrix.\n",
    "    Normalization can be applied by setting `normalize=True`.\n",
    "    \"\"\"\n",
    "    if normalize:\n",
    "        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
    "        #print(\"Normalized confusion matrix\")\n",
    "    #else:\n",
    "    #    print('Confusion matrix, without normalization')\n",
    "\n",
    "    #print(cm)\n",
    "\n",
    "    plt.imshow(cm, interpolation='nearest', cmap=cmap)\n",
    "    plt.title(title)\n",
    "    plt.colorbar()\n",
    "    tick_marks = np.arange(len(classes))\n",
    "    plt.xticks(tick_marks, classes, rotation=45)\n",
    "    plt.yticks(tick_marks, classes)\n",
    "\n",
    "    fmt = '.2f' if normalize else 'd'\n",
    "    thresh = cm.max() / 2.\n",
    "    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n",
    "        plt.text(j, i, format(cm[i, j], fmt),\n",
    "                 horizontalalignment=\"center\",\n",
    "                 color=\"white\" if cm[i, j] > thresh else \"black\")\n",
    "\n",
    "    plt.tight_layout()\n",
    "    plt.ylabel('True label')\n",
    "    plt.xlabel('Predicted label')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "800547fa-c070-4bdf-ad1a-1f97e7db3fdf",
    "_kg_hide-input": true,
    "_uuid": "b0792565a816242c8e29a242264c5e77ec90a7eb"
   },
   "outputs": [],
   "source": [
    "# predict using TD-IDF Naive Bayes Model\n",
    "pred_test_y = model.predict(test_tfidf_vec)\n",
    "\n",
    "# Confusion Matrix\n",
    "cnf_matrix = confusion_matrix(test_y, pred_test_y)\n",
    "np.set_printoptions(precision=2)\n",
    "\n",
    "# Plot non-normalized confusion matrix\n",
    "plt.figure(figsize=(8,8))\n",
    "plot_confusion_matrix(cnf_matrix, classes=['1', '0'],\n",
    "                      title='Confusion matrix, without normalization')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "404695a0-c33b-4cba-9dd1-bf2a5e8c12fc",
    "_uuid": "e16c6d867d6bf1bb4e2661cc28841c672091fe0a"
   },
   "source": [
    "It seems tfidf features are useful in predicting `Success` or `Failure` outcomes.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "77309c44-f947-46f4-814c-7d7a01768293",
    "_uuid": "468ef624b3f6d752737b9c112ae6f0592a929da2"
   },
   "source": [
    "#### Naive Bayes on Word Count Vectorizer:\n",
    "\n",
    "We also train Multinomial Naive Bayes mode on word counts in text and check if word count helps in predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "92ab9f7c-8c2c-4233-a6dc-eaa12ca7d144",
    "_uuid": "9cd43c20764c81a3216b8371c477ae5552e24c7d"
   },
   "outputs": [],
   "source": [
    "### Fit transform the count vectorizer ###\n",
    "word_count_vectr = CountVectorizer(stop_words='english', ngram_range=(1,3))\n",
    "word_count_vectr.fit(train_df['InstrForNxtAgt'].values.tolist() + test_df['InstrForNxtAgt'].values.tolist())\n",
    "# Get word counts\n",
    "train_word_count_vec = word_count_vectr.transform(train_df['InstrForNxtAgt'].values.tolist())\n",
    "test_word_count_vec = word_count_vectr.transform(test_df['InstrForNxtAgt'].values.tolist())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "48d68a84-6c6d-40b3-9069-d4c324355fc6",
    "_uuid": "2e05fb42bcd850e74c767961f58400b0dac463ec"
   },
   "source": [
    "Now let us build Multinomial NB model using count vectorizer based features.."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "b7927348-b6b1-4960-b76c-ebf14fae37f0",
    "_uuid": "4f9354473820d59076bf991b5110fc4dd309c204"
   },
   "outputs": [],
   "source": [
    "# Train Naive Bayes classifier\n",
    "model = naive_bayes.MultinomialNB()\n",
    "# Train dataset\n",
    "model.fit(train_word_count_vec, train_y)\n",
    "\n",
    "# clf_results = model.predict(test_word_count_vec)\n",
    "# print(clf_results)\n",
    "\n",
    "kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=199)\n",
    "# Train CV scores\n",
    "train_cv_scores = cross_val_score(model, train_word_count_vec, train_y, cv=kf)\n",
    "print(train_cv_scores)\n",
    "print(\"Mean Train cv score : \", train_cv_scores.mean())\n",
    "\n",
    "# Add the prediction probabilities for Failure or Success from text as new features\n",
    "train_y_pred_proba = model.predict_proba(train_word_count_vec)\n",
    "train_df[[\"nb_wcvec_failure\", \"nb_wcvec_success\"]] = train_y_pred_proba\n",
    "\n",
    "# Test CV scores\n",
    "test_cv_scores = cross_val_score(model, test_word_count_vec, test_y, cv=kf)\n",
    "print(\"Mean Test CV score : \", test_cv_scores.mean())\n",
    "\n",
    "full_test_preds_cv_score = (train_cv_scores.mean() + test_cv_scores.mean())/2.\n",
    "print(\"Full pred CV score : \", full_test_preds_cv_score)\n",
    "\n",
    "# Add the prediction probabilities for Failure or Success from text as new features\n",
    "test_y_pred_proba = model.predict_proba(test_word_count_vec)\n",
    "test_df[[\"nb_wcvec_failure\", \"nb_wcvec_success\"]] = test_y_pred_proba\n",
    "\n",
    "train_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "5d412e2f-8988-4895-9434-33df93c263db",
    "_kg_hide-input": true,
    "_uuid": "75658bb392011311ffefbfc562483ac5e11a0cab"
   },
   "outputs": [],
   "source": [
    "# predict using Word Count Vectorizer Naive Bayes Model\n",
    "pred_test_y = model.predict(test_word_count_vec)\n",
    "\n",
    "cnf_matrix = confusion_matrix(test_y, pred_test_y)\n",
    "np.set_printoptions(precision=2)\n",
    "\n",
    "# Plot non-normalized confusion matrix\n",
    "plt.figure(figsize=(8,8))\n",
    "plot_confusion_matrix(cnf_matrix, classes=['1', '0'],\n",
    "                      title='Confusion matrix of NB on word count, without normalization')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Word count feature does worse than tfidf feature but still helps."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "5a5d31ea-278c-432a-9840-ca9209410ff7",
    "_uuid": "fb099e48fde5f3960bc2409d323749e855b2e269"
   },
   "source": [
    "#### Naive Bayes on Character Count Vectorizer:\n",
    "\n",
    "In 'InstrForNxtAgt' text, counting the charaters might help. We use the count vectorizer at character level to get some features. Again we can run Multinomial NB on top of it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "aed10880-e717-4a04-85b9-a77f8f58aee9",
    "_uuid": "d26ba4563285c10418479465e10de172f527f2a5"
   },
   "outputs": [],
   "source": [
    "### Fit transform the tfidf vectorizer ###\n",
    "cc_vectr = CountVectorizer(ngram_range=(1,7), analyzer='char')\n",
    "cc_vectr.fit(train_df['InstrForNxtAgt'].values.tolist() + test_df['InstrForNxtAgt'].values.tolist())\n",
    "train_char_count_vec = cc_vectr.transform(train_df['InstrForNxtAgt'].values.tolist())\n",
    "test_char_count_vec = cc_vectr.transform(test_df['InstrForNxtAgt'].values.tolist())\n",
    "\n",
    "# Train NV model\n",
    "model = naive_bayes.MultinomialNB()\n",
    "model.fit(train_char_count_vec, train_y)\n",
    "\n",
    "kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=199)\n",
    "train_cv_scores = cross_val_score(model, train_char_count_vec, train_y, cv=kf)\n",
    "print(train_cv_scores)\n",
    "print(\"Mean Train cv score : \", train_cv_scores.mean())\n",
    "\n",
    "# Add the prediction probabilities for Failure or Success from text as new features\n",
    "train_y_pred_proba = model.predict_proba(train_char_count_vec)\n",
    "train_df[[\"nb_ccvec_failure\", \"nb_ccvec_success\"]] = train_y_pred_proba\n",
    "\n",
    "# Test dataset CV\n",
    "test_cv_scores = cross_val_score(model, test_char_count_vec, test_y, cv=kf)\n",
    "print(\"Mean Test cv score : \", test_cv_scores.mean())\n",
    "\n",
    "full_test_preds_cv_score = (train_cv_scores.mean() + test_cv_scores.mean())/2.\n",
    "print(\"Full pred CV score : \", full_test_preds_cv_score)\n",
    "\n",
    "# Add the prediction probabilities for Failure or Success from text as new features\n",
    "test_y_pred_proba = model.predict_proba(test_char_count_vec)\n",
    "test_df[[\"nb_ccvec_failure\", \"nb_ccvec_success\"]] = test_y_pred_proba\n",
    "\n",
    "train_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "55e95b9a-807b-4e99-bf95-069a11e07a64",
    "_uuid": "227e565a4a8a6aa04dcfb648688bf0deb0853910"
   },
   "source": [
    "#### Naive Bayes on Character Tfidf Vectorizer:\n",
    "\n",
    "As before, let's train a multinomial naive bayes model and get predictions on the character tfidf vectorizer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "a15097b8-f9c7-4b3b-815e-a8f4a5f2558e",
    "_uuid": "bfeaa62ae81e2af3fdb595422f08f560506c1f42"
   },
   "outputs": [],
   "source": [
    "### Fit transform the tfidf vectorizer ###\n",
    "tfidf_vectr = TfidfVectorizer(ngram_range=(1,5), analyzer='char')\n",
    "full_tfidf_char_vec = tfidf_vectr.fit_transform(train_df['InstrForNxtAgt'].values.tolist() + test_df['InstrForNxtAgt'].values.tolist())\n",
    "train_tfidf_char_vec = tfidf_vectr.transform(train_df['InstrForNxtAgt'].values.tolist())\n",
    "test_tfidf_char_vec = tfidf_vectr.transform(test_df['InstrForNxtAgt'].values.tolist())\n",
    "\n",
    "model = naive_bayes.MultinomialNB()\n",
    "model.fit(train_tfidf_char_vec, train_y)\n",
    "\n",
    "# Train CV Scores\n",
    "train_cv_scores = cross_val_score(model, train_tfidf_char_vec, train_y, cv=kf)\n",
    "print(\"Mean Train cv score : \", train_cv_scores.mean())\n",
    "\n",
    "# Add the prediction probabilities for Failure or Success from text as new features\n",
    "train_y_pred_proba = model.predict_proba(train_tfidf_char_vec)\n",
    "train_df[[\"nb_tfidf_char_failure\", \"nb_tfidf_char_success\"]] = train_y_pred_proba\n",
    "\n",
    "# Test dataset CV\n",
    "test_cv_scores = cross_val_score(model, test_tfidf_char_vec, test_y, cv=kf)\n",
    "print(\"Mean Test cv score : \", test_cv_scores.mean())\n",
    "\n",
    "full_test_preds_cv_score = (train_cv_scores.mean() + test_cv_scores.mean())/2.\n",
    "print(\"Full pred CV score : \", full_test_preds_cv_score)\n",
    "\n",
    "# Add the prediction probabilities for Failure or Success from text as new features\n",
    "test_y_pred_proba = model.predict_proba(test_tfidf_char_vec)\n",
    "test_df[[\"nb_tfidf_char_failure\", \"nb_tfidf_char_success\"]] = test_y_pred_proba\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "c807a434-c6d8-4a6c-a592-76c70bd043fe",
    "_uuid": "af6e2880a16f96304bd6556ed67d9eac5491a3bc"
   },
   "source": [
    "\n",
    "The cross-validation scores on character tfidf vector are better than scores in word tfidf vector. This indicates that characters in the text are helping, we will use it as well.\n",
    "\n",
    "Let's print the feature engineered dataframe so far."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "365abbc9-d630-4075-a2aa-c025d146d6aa",
    "_uuid": "0577ac7e16cd158944d1ee3c27ca92c588e2d10a"
   },
   "source": [
    "### Train an XGBoost Model with Feature Engineered Dataset\n",
    "\n",
    "Now we have a feature engineered dataset, replacing text feature `InstrForNxtAgt` with new features using text feature engineering techniques such word count, TF-IDF and probabilities using Multinomial Naive Bayes model trained on word count and TF-IDF vectors generated from the text feature. With these new features, let's train an xgboost model and evaluate the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cols_to_drop = ['y_target', 'InstrForNxtAgt']\n",
    "train_X = train_df.drop(cols_to_drop, axis=1)\n",
    "test_X = test_df.drop(cols_to_drop, axis=1)\n",
    "\n",
    "train_X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "# Init classifier\n",
    "xgb_cl = xgb.XGBClassifier(objective='binary:logistic', use_label_encoder=True, eval_metric='error')\n",
    "\n",
    "# Fit\n",
    "xgb_cl.fit(train_X, train_y)\n",
    "\n",
    "# Predict\n",
    "preds = xgb_cl.predict(test_X)\n",
    "\n",
    "# Score\n",
    "accuracy_score(test_y, preds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The new XGBoost model with new features indeed performs better than earlier model.\n",
    "\n",
    "#### Check Feature Importance\n",
    "Let's plot feature importance provided by XGBoost to determine which features have higher impact on predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "ed36de2b-e0bb-42e6-bed8-bbbdeb513ba5",
    "_uuid": "ddefa6111b588449df9f2abbc541c0ca051174ab"
   },
   "outputs": [],
   "source": [
    "### Plot the important variables ###\n",
    "fig, ax = plt.subplots(figsize=(12,12))\n",
    "xgb.plot_importance(xgb_cl, max_num_features=50, height=0.8, ax=ax)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "c4b258fe-ac27-4d5b-908c-eda43670fbf7",
    "_uuid": "c818e95a21559bdb4def3aed522b6616eb1a9de8"
   },
   "source": [
    "As before `Creditor Postal Address` and `Debtor Postal Address` have most impact in this dataset. But newly added features such as Naive Bayes predictions on character tfidf, character count, word tfidf and word counts improve predictions. \n",
    "\n",
    "#### Evaluate Model\n",
    "Let's get the confusion matrix to evaluate this new model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "2f3aaa7b-396c-426b-89b3-1e66385946fd",
    "_uuid": "c195596cd7dff2de98bddec1453e30d8d5511169"
   },
   "outputs": [],
   "source": [
    "cnf_matrix = confusion_matrix(test_y, preds)\n",
    "np.set_printoptions(precision=2)\n",
    "\n",
    "# Plot non-normalized confusion matrix\n",
    "plt.figure(figsize=(8,8))\n",
    "plot_confusion_matrix(cnf_matrix, classes=['1', '0'],\n",
    "                      title='Confusion matrix of XGB, without normalization')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The text feature engineering did help to make the model perform better."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_cell_guid": "855a4e2b-d377-4a91-bac9-4202119c8666",
    "_uuid": "5e40512bdaf05ffbd7a1914735c46784d7c33329"
   },
   "source": [
    "## Further improvements\n",
    "**Feature Engineering**\n",
    "* Using word embedding based features such fasttext, blazingtext\n",
    "\n",
    "**Model Building & Training**\n",
    "* Parameter tuning for TFIDF and Count Vectorizers.\n",
    "* Hyperparameter tuning for Multinomial Naive Bayes and XGBoost models.\n",
    "* Try out SageMaker Linear Learner algorithm.\n",
    "* Try out Ensembling and Stacking with other models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}