# Text Feature Engineering Notebook
 
This notebook demonstrates feature engineering for text features in a dataset. This notebook runs locally to show iterative process of feature engineering.

The objective is to accurately predict if a pacs.008 XML payment message will be successfully processed without exception, thereby resulting in a payment transfer from the debtor to the creditor. The target label is `Success i.e 1` or `Failure i.e. 0`.

This notebook shows how to create new features from a text feature, combining the new features with other numerical or categorical features in the dataset. We will also check along the way if these new derived features will help us in prediction. 

We'll start with following steps:
1. Select features from full labeled dataset for training.
1. Perform some basic data analysis and visualization of selected features.
1. Clean data such impute missing values.
1. Dive deep into the feature engineering of a text feature.

Let's start.

## Install Python Packages

In [None]:
#!pip install nltk
!conda install nltk -y
#!pip install xgboost
!conda install xgboost -y

## Download Labeled Dataset

In [None]:
import os
import boto3
import sagemaker
from sagemaker import get_execution_role

sm_client = boto3.Session().client('sagemaker')
sm_session = sagemaker.Session()
region = boto3.session.Session().region_name

role = get_execution_role()
print ("Notebook is running with assumed role {}".format (role))
print("Working with AWS services in the {} region".format(region))


In [None]:
# Working directory for the notebook
WORKDIR = os.getcwd()
BASENAME = os.path.dirname(WORKDIR)
print(f"WORKDIR: {WORKDIR}")
print(f"BASENAME: {BASENAME}")

# Create a directory storing local data
iso20022_data_path = 'iso20022-data'
if not os.path.exists(iso20022_data_path):
 # Create a new directory because it does not exist 
 os.makedirs(iso20022_data_path)

# Store all prototype assets in this bucket
s3_bucket_name = 'iso20022-prototype-t3'
s3_bucket_uri = 's3://' + s3_bucket_name

# Prefix for all files in this prototype
prefix = 'iso20022'

pacs008_prefix = prefix + '/pacs008'
raw_data_prefix = pacs008_prefix + '/raw-data'
labeled_data_prefix = pacs008_prefix + '/labeled-data'


labeled_data_location = s3_bucket_uri + '/' + labeled_data_prefix
print(f"Raw labeled data location = {labeled_data_location}")

# Download labeled raw dataset from S3
s3_client = boto3.client('s3')
s3_client.download_file(s3_bucket_name, labeled_data_prefix + '/labeled_data.csv', 'iso20022-data/labeled_data.csv')

## Select Features

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn import ensemble, metrics, model_selection, naive_bayes

color = sns.color_palette()

%matplotlib inline

eng_stopwords = set(stopwords.words("english"))
pd.options.mode.chained_assignment = None

### Select Features from Labeled Raw Dataset

In [None]:
## Read the train and test dataset and check the top few lines ##
labeled_raw_df = pd.read_csv("iso20022-data/labeled_data.csv")

fts=[
 'y_target', 
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForCdtrAgt_InstrInf',
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForCdtrAgt_Cd', 
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_Nm',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry', 
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_PstCd', 
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_StrtNm',
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_TwnNm',
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_Nm',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry', 
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_PstCd', 
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_StrtNm', 
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_TwnNm', 
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_DbtCdtRptgInd', 
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Authrty_Ctry', 
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Authrty_Nm',
# 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Tp', 
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'
]

# New data frame with selected features
selected_df = labeled_raw_df[fts]
 
selected_df.head()

In [None]:
# Rename columns
selected_df = selected_df.rename(columns={
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf': 'InstrForNxtAgt',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry': 'Dbtr_PstlAdr_Ctry',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry': 'Cdtr_PstlAdr_Ctry',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_DbtCdtRptgInd': 'RgltryRptg_DbtCdtRptgInd',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Authrty_Ctry': 'RgltryRptg_Authrty_Ctry',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd': 'RgltryRptg_Dtls_Cd'
})

selected_df.head()

### Split into Training and Test Datasets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(selected_df, selected_df['y_target'], test_size=0.20, random_state=299, shuffle=True)
train_df = X_train
test_df = X_test

print("Number of rows in train dataset : ",train_df.shape[0])
print("Number of rows in test dataset : ",test_df.shape[0])

In [None]:
train_df.head()

## Basic Data Analysis

We can check the number of occurrence of each debtor country to see if the classes are balanced. 

In [None]:
dbtr_cntry_count = train_df['Dbtr_PstlAdr_Ctry'].value_counts()

plt.figure(figsize=(8,4))
sns.barplot(dbtr_cntry_count.index, dbtr_cntry_count.values, alpha=0.8)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Debtor Country Name', fontsize=12)
plt.show()

In [None]:
cdtr_cntry_count = train_df['Cdtr_PstlAdr_Ctry'].value_counts()

plt.figure(figsize=(8,4))
sns.barplot(cdtr_cntry_count.index, cdtr_cntry_count.values, alpha=0.8)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Creditor Country Name', fontsize=12)
plt.show()

**Check missing values:**

In [None]:
train_df.isna().sum()

In [None]:
test_df.isna().sum()

**Let's do a quick check on a few samples of text feature `InstrForNxtAgt - Instructions for Next Agent` by each debtor and creditor country:**

In [None]:
grouped_df = train_df.groupby('Dbtr_PstlAdr_Ctry')
for name, group in grouped_df:
 print("Debtor Country name : ", name)
 cnt = 0
 for ind, row in group.iterrows():
 print(row["InstrForNxtAgt"])
 cnt += 1
 if cnt == 5:
 break
 print("\n")

In [None]:
grouped_df = train_df.groupby('Cdtr_PstlAdr_Ctry')
for name, group in grouped_df:
 print("Creditor Country name : ", name)
 cnt = 0
 for ind, row in group.iterrows():
 print(row["InstrForNxtAgt"])
 cnt += 1
 if cnt == 5:
 break
 print("\n")

There are a few special characters present in the text data primarily to indicate special service (/SVC/) or regulatory text (/REG/). Hence count of these special characters might be a good feature.

# Feature Engineering

### Impute missing values

In [None]:
# Perform simple Imputation for missing values in training dataset
# Traing dataset
train_df['InstrForNxtAgt'].fillna('none', inplace=True)
train_df['RgltryRptg_DbtCdtRptgInd'].fillna('none',inplace=True)
train_df['RgltryRptg_Authrty_Ctry'].fillna('none',inplace=True)
train_df['RgltryRptg_Dtls_Cd'].fillna('none',inplace=True)

train_df.isna().sum()


In [None]:
train_df

In [None]:
# Test dataset
test_df['InstrForNxtAgt'].fillna('none', inplace=True)
test_df['RgltryRptg_DbtCdtRptgInd'].fillna('none',inplace=True)
test_df['RgltryRptg_Authrty_Ctry'].fillna('none',inplace=True)
test_df['RgltryRptg_Dtls_Cd'].fillna('none',inplace=True)

test_df.isna().sum()

In [None]:
test_df

### Assign Datatypes 

**Pandas Datatypes**

In [None]:
# Categorical data transformation.

categorical_fts=[
# 'InstrForCdtrAgt_Cd',
 'Dbtr_PstlAdr_Ctry', 
 'Cdtr_PstlAdr_Ctry',
 'RgltryRptg_DbtCdtRptgInd', 
 'RgltryRptg_Authrty_Ctry', 
 'RgltryRptg_Dtls_Cd'
]
# Convert categorical features to categorical data type.
for col in categorical_fts:
 train_df[col] = pd.Categorical(train_df[col])

# Convert from original feature values to categorical values
for col in categorical_fts:
 train_df[col] = train_df[col].cat.codes

train_df

**Encode Target Label**

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
train_df['y_target'] = label_encoder.fit_transform(train_df['y_target'])

mapping = dict(zip(label_encoder.classes_, range(len(label_encoder.classes_))))
print(f"Label Mapping: {mapping}")

train_df

In [None]:
# Test dataset
categorical_fts=[
# 'InstrForCdtrAgt_Cd',
 'Dbtr_PstlAdr_Ctry', 
 'Cdtr_PstlAdr_Ctry',
 'RgltryRptg_DbtCdtRptgInd', 
 'RgltryRptg_Authrty_Ctry', 
 'RgltryRptg_Dtls_Cd'
]
# Convert categorical features to categorical data type.
for col in categorical_fts:
 test_df[col] = pd.Categorical(test_df[col])

# Convert from original feature values to categorical values
for col in categorical_fts:
 test_df[col] = test_df[col].cat.codes

test_df

In [None]:
# Test dataset

label_encoder = LabelEncoder()
test_df['y_target'] = label_encoder.fit_transform(test_df['y_target'])
test_df

In [None]:
print(f"No. of train_df Columns: {len(train_df.columns)}")
print(train_df.columns)

print(f"No. of test_df Columns: {len(test_df.columns)}")
print(test_df.columns)

## Text Feature Engineering

There a couple of approaches to feature engineer text:

 1. Meta features - features based on the statistics on text like number of words, mean word length, number of stop words, number of punctuations, number of upper case words and so on. 
 1. Text based features - using common techniques for text features i.e. directly based on the content of text i.e. words in the text e.g. text frequency, tf-idf, word2vec/fasttext/blazingtext etc.

In other words there are two types of features for text. 

### Meta Features

The meta features are:
1. Number of words in the text
1. Number of unique words in the text
1. Mean word length
1. Number of characters in the text
1. Number of stopwords 
1. Number of punctuations
1. Number of upper case words

We use [nltk](https://www.nltk.org/) stop words to compute some of these meta feature. For more info see [nltk book](https://www.nltk.org/book/ch02.html).

In [None]:
## Number of words in the text ##
train_df["num_words"] = train_df["InstrForNxtAgt"].apply(lambda x: len(str(x).split()))
test_df["num_words"] = test_df["InstrForNxtAgt"].apply(lambda x: len(str(x).split()))

## Number of unique words in the text ##
train_df["num_unique_words"] = train_df["InstrForNxtAgt"].apply(lambda x: len(set(str(x).split())))
test_df["num_unique_words"] = test_df["InstrForNxtAgt"].apply(lambda x: len(set(str(x).split())))

## Number of characters in the text ##
train_df["num_chars"] = train_df["InstrForNxtAgt"].apply(lambda x: len(str(x)))
test_df["num_chars"] = test_df["InstrForNxtAgt"].apply(lambda x: len(str(x)))

## Number of stopwords in the text ##
train_df["num_stopwords"] = train_df["InstrForNxtAgt"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))
test_df["num_stopwords"] = test_df["InstrForNxtAgt"].apply(lambda x: len([w for w in str(x).lower().split() if w in eng_stopwords]))

## Number of punctuations in the text ##
train_df["num_punctuations"] =train_df['InstrForNxtAgt'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))
test_df["num_punctuations"] =test_df['InstrForNxtAgt'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))

## Average length of the words in the text ##
train_df["mean_word_len"] = train_df["InstrForNxtAgt"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test_df["mean_word_len"] = test_df["InstrForNxtAgt"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

## Number of upper case words in the text ##
train_df["num_words_upper"] = train_df["InstrForNxtAgt"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
test_df["num_words_upper"] = test_df["InstrForNxtAgt"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))

print(f"No. of train_df Columns: {len(train_df.columns)}")
print(train_df.columns)

print(f"No. of test_df Columns: {len(test_df.columns)}")
print(test_df.columns)

Let us now plot some of a couple of these new features to see of they will be helpful in predictions. We use [violin plot](https://en.wikipedia.org/wiki/Violin_plot).

In [None]:
train_df['num_words'].loc[train_df['num_words']>8] = 8 #truncation for better visuals
plt.figure(figsize=(12,8))
sns.violinplot(x='y_target', y='num_words', data=train_df)
plt.xlabel('y_target', fontsize=12)
plt.ylabel('Number of words in text', fontsize=12)
plt.title("Number of words by y_target", fontsize=15)
plt.show()

This is highly text dependent, and it might have some benefit but we are not sure yet.

In [None]:
train_df['num_chars'].loc[train_df['num_chars']>15] = 15 #truncation for better visuals
plt.figure(figsize=(12,8))
sns.violinplot(x='y_target', y='num_chars', data=train_df)
plt.xlabel('y_target', fontsize=12)
plt.ylabel('Number of characters in text', fontsize=12)
plt.title("Number of characters by y_target", fontsize=15)
plt.show()

We are not sure if this feature helps either.

Here we build a xgboost model to check how these meta features are helping. 

**Drop** the `InstrForNxtAgt` column before training as it has been feature engineered.

In [None]:
## Prepare the data for modeling ###
#train_y = train_df['y_target']
#train_id = train_df['id'].values
#test_id = test_df['id'].values

### recompute the trauncated variables again ###
train_df["num_words"] = train_df["InstrForNxtAgt"].apply(lambda x: len(str(x).split()))
test_df["num_words"] = test_df["InstrForNxtAgt"].apply(lambda x: len(str(x).split()))
train_df["num_chars"] = train_df["InstrForNxtAgt"].apply(lambda x: len(str(x)))
test_df["num_chars"] = test_df["InstrForNxtAgt"].apply(lambda x: len(str(x)))
train_df["mean_word_len"] = train_df["InstrForNxtAgt"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test_df["mean_word_len"] = test_df["InstrForNxtAgt"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

# Drop the `InstrForNxtAgt` column before training as it has been feature engineered
cols_to_drop = ['y_target','InstrForNxtAgt']

train_X = train_df.drop(cols_to_drop, axis=1)
train_y = train_df['y_target']

test_X = test_df.drop(cols_to_drop, axis=1)
test_y = test_df['y_target']

print(f"Shape of train_X: {train_X.shape}")
print(train_X.columns)

print(f"Shape of train_y: {train_y.shape}")

print(f"Shape of test_X Columns: {test_X.shape}")
print(test_X.columns)

print(f"Shape of test_y: {test_y.shape}")


In [None]:
from sklearn.metrics import accuracy_score

# Init classifier
xgb_cl = xgb.XGBClassifier()

# Fit
xgb_cl.fit(train_X, train_y)

# Predict
preds = xgb_cl.predict(test_X)

# Score
accuracy_score(test_y, preds)

Check feature importance:

In [None]:
# Plot the feature importantance
fig, ax = plt.subplots(figsize=(12,12))
xgb.plot_importance(xgb_cl, max_num_features=50, height=0.8, ax=ax)
plt.show()

We can train a simple XGBoost model with these meta features alone.

### Text Based Features

Now lets use `InstrForNxtAgt` to creating some text based features.

One of the basic features which we could create is tf-idf values of the words present in the text. So we can start with that one.


In [None]:
### Fit transform the tfidf vectorizer ###
tfidf_vectr = TfidfVectorizer(stop_words='english', ngram_range=(1,3))
full_tfidf_vec = tfidf_vectr.fit_transform(train_df['InstrForNxtAgt'].values.tolist() + test_df['InstrForNxtAgt'].values.tolist())
train_tfidf_vec = tfidf_vectr.transform(train_df['InstrForNxtAgt'].values.tolist())
test_tfidf_vec = tfidf_vectr.transform(test_df['InstrForNxtAgt'].values.tolist())

Now that we have got the tfidf vector but it is a sparse matrix and so if we have to use it with other dense features, we have to find a way to summarize information in tfidf vector. Common approaches for this are: 
1. We can choose to get the top 'n' features (depending on the system config) from the tfidf vectorizer, convert it into dense format and concat with other features.
1. Build a model using just the sparse features and then use the predictions as one of the features along with other dense features. Simple models such Naive Bayes can be used or deep learning NLP models like word2vec or fastext can be used to create word embeddings (word vector) for the sentence.

Based on the dataset, one might perform better than the other. Here we will use the second approach but with a Multinomial Naive Bayes model on tfidf vector (new features) to predict `Success` or `Failure` of payment message. [Multinomial Naive Bayes models](https://scikit-learn.org/stable/modules/naive_bayes.html) are fast to train, and are commonly used in text classification problems.

#### Naive Bayes on Word Tfidf Vectorizer:

In [None]:
from sklearn.model_selection import KFold, cross_val_score

# Train Naive Bayes classifier
model = naive_bayes.MultinomialNB()
# Train dataset
model.fit(train_tfidf_vec, train_y)

# Get prediction for classes: Failure(0) or Success (1)
# predicted_class = model.predict(test_word_count_vec)
# print(predicted_class)

kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=199)
# Train CV scores
train_cv_scores = cross_val_score(model, train_tfidf_vec, train_y, cv=kf)
print("Mean Train CV score : ", train_cv_scores.mean())

#print(train_tfidf_vec.shape)
#print(train_tfidf_vec)

# Add the prediction probabilities for Failure or Success from text as new features
train_y_pred_proba = model.predict_proba(train_tfidf_vec)
#print(f"train_y_pred_proba type: {type(train_y_pred_proba)}")
#print(f"Shape train_y_pred_proba: {train_y_pred_proba.shape}")
#print(f"train_y_predictions:{train_y_pred_proba}")
train_df[["nb_tfidf_word_failure", "nb_tfidf_word_success"]] = train_y_pred_proba

# Test CV scores
test_cv_scores = cross_val_score(model, test_tfidf_vec, test_y, cv=kf)
print("Mean Test CV score : ", test_cv_scores.mean())

full_test_preds_cv_score = (train_cv_scores.mean() + test_cv_scores.mean())/2.
print("Full pred CV score : ", full_test_preds_cv_score)

# Add the prediction probabilities for Failure or Success from text as new features
test_y_pred_proba = model.predict_proba(test_tfidf_vec)
#print(f"test_y_pred_proba type: {type(test_y_pred_proba)}")
#print(f"Shape test_y_pred_proba: {test_y_pred_proba.shape}")
test_df[["nb_tfidf_word_failure", "nb_tfidf_word_success"]] = test_y_pred_proba

train_df

In [None]:
test_df

In [None]:
### Function to create confusion matrix ###
import itertools
from sklearn.metrics import confusion_matrix

### From http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py #
def plot_confusion_matrix(cm, classes,
 normalize=False,
 title='Confusion matrix',
 cmap=plt.cm.Blues):
 """
 This function prints and plots the confusion matrix.
 Normalization can be applied by setting `normalize=True`.
 """
 if normalize:
 cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
 #print("Normalized confusion matrix")
 #else:
 # print('Confusion matrix, without normalization')

 #print(cm)

 plt.imshow(cm, interpolation='nearest', cmap=cmap)
 plt.title(title)
 plt.colorbar()
 tick_marks = np.arange(len(classes))
 plt.xticks(tick_marks, classes, rotation=45)
 plt.yticks(tick_marks, classes)

 fmt = '.2f' if normalize else 'd'
 thresh = cm.max() / 2.
 for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
 plt.text(j, i, format(cm[i, j], fmt),
 horizontalalignment="center",
 color="white" if cm[i, j] > thresh else "black")

 plt.tight_layout()
 plt.ylabel('True label')
 plt.xlabel('Predicted label')

In [None]:
# predict using TD-IDF Naive Bayes Model
pred_test_y = model.predict(test_tfidf_vec)

# Confusion Matrix
cnf_matrix = confusion_matrix(test_y, pred_test_y)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure(figsize=(8,8))
plot_confusion_matrix(cnf_matrix, classes=['1', '0'],
 title='Confusion matrix, without normalization')
plt.show()


It seems tfidf features are useful in predicting `Success` or `Failure` outcomes.


In [None]:
train_df.shape

In [None]:
train_df.head()

#### Naive Bayes on Word Count Vectorizer:

We also train Multinomial Naive Bayes mode on word counts in text and check if word count helps in predictions.

In [None]:
### Fit transform the count vectorizer ###
word_count_vectr = CountVectorizer(stop_words='english', ngram_range=(1,3))
word_count_vectr.fit(train_df['InstrForNxtAgt'].values.tolist() + test_df['InstrForNxtAgt'].values.tolist())
# Get word counts
train_word_count_vec = word_count_vectr.transform(train_df['InstrForNxtAgt'].values.tolist())
test_word_count_vec = word_count_vectr.transform(test_df['InstrForNxtAgt'].values.tolist())

Now let us build Multinomial NB model using count vectorizer based features..

In [None]:
# Train Naive Bayes classifier
model = naive_bayes.MultinomialNB()
# Train dataset
model.fit(train_word_count_vec, train_y)

# clf_results = model.predict(test_word_count_vec)
# print(clf_results)

kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=199)
# Train CV scores
train_cv_scores = cross_val_score(model, train_word_count_vec, train_y, cv=kf)
print(train_cv_scores)
print("Mean Train cv score : ", train_cv_scores.mean())

# Add the prediction probabilities for Failure or Success from text as new features
train_y_pred_proba = model.predict_proba(train_word_count_vec)
train_df[["nb_wcvec_failure", "nb_wcvec_success"]] = train_y_pred_proba

# Test CV scores
test_cv_scores = cross_val_score(model, test_word_count_vec, test_y, cv=kf)
print("Mean Test CV score : ", test_cv_scores.mean())

full_test_preds_cv_score = (train_cv_scores.mean() + test_cv_scores.mean())/2.
print("Full pred CV score : ", full_test_preds_cv_score)

# Add the prediction probabilities for Failure or Success from text as new features
test_y_pred_proba = model.predict_proba(test_word_count_vec)
test_df[["nb_wcvec_failure", "nb_wcvec_success"]] = test_y_pred_proba

train_df

In [None]:
test_df

In [None]:
# predict using Word Count Vectorizer Naive Bayes Model
pred_test_y = model.predict(test_word_count_vec)

cnf_matrix = confusion_matrix(test_y, pred_test_y)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure(figsize=(8,8))
plot_confusion_matrix(cnf_matrix, classes=['1', '0'],
 title='Confusion matrix of NB on word count, without normalization')
plt.show()

Word count feature does worse than tfidf feature but still helps.

#### Naive Bayes on Character Count Vectorizer:

In 'InstrForNxtAgt' text, counting the charaters might help. We use the count vectorizer at character level to get some features. Again we can run Multinomial NB on top of it.

In [None]:
### Fit transform the tfidf vectorizer ###
cc_vectr = CountVectorizer(ngram_range=(1,7), analyzer='char')
cc_vectr.fit(train_df['InstrForNxtAgt'].values.tolist() + test_df['InstrForNxtAgt'].values.tolist())
train_char_count_vec = cc_vectr.transform(train_df['InstrForNxtAgt'].values.tolist())
test_char_count_vec = cc_vectr.transform(test_df['InstrForNxtAgt'].values.tolist())

# Train NV model
model = naive_bayes.MultinomialNB()
model.fit(train_char_count_vec, train_y)

kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=199)
train_cv_scores = cross_val_score(model, train_char_count_vec, train_y, cv=kf)
print(train_cv_scores)
print("Mean Train cv score : ", train_cv_scores.mean())

# Add the prediction probabilities for Failure or Success from text as new features
train_y_pred_proba = model.predict_proba(train_char_count_vec)
train_df[["nb_ccvec_failure", "nb_ccvec_success"]] = train_y_pred_proba

# Test dataset CV
test_cv_scores = cross_val_score(model, test_char_count_vec, test_y, cv=kf)
print("Mean Test cv score : ", test_cv_scores.mean())

full_test_preds_cv_score = (train_cv_scores.mean() + test_cv_scores.mean())/2.
print("Full pred CV score : ", full_test_preds_cv_score)

# Add the prediction probabilities for Failure or Success from text as new features
test_y_pred_proba = model.predict_proba(test_char_count_vec)
test_df[["nb_ccvec_failure", "nb_ccvec_success"]] = test_y_pred_proba

train_df

In [None]:
test_df

#### Naive Bayes on Character Tfidf Vectorizer:

As before, let's train a multinomial naive bayes model and get predictions on the character tfidf vectorizer.

In [None]:
### Fit transform the tfidf vectorizer ###
tfidf_vectr = TfidfVectorizer(ngram_range=(1,5), analyzer='char')
full_tfidf_char_vec = tfidf_vectr.fit_transform(train_df['InstrForNxtAgt'].values.tolist() + test_df['InstrForNxtAgt'].values.tolist())
train_tfidf_char_vec = tfidf_vectr.transform(train_df['InstrForNxtAgt'].values.tolist())
test_tfidf_char_vec = tfidf_vectr.transform(test_df['InstrForNxtAgt'].values.tolist())

model = naive_bayes.MultinomialNB()
model.fit(train_tfidf_char_vec, train_y)

# Train CV Scores
train_cv_scores = cross_val_score(model, train_tfidf_char_vec, train_y, cv=kf)
print("Mean Train cv score : ", train_cv_scores.mean())

# Add the prediction probabilities for Failure or Success from text as new features
train_y_pred_proba = model.predict_proba(train_tfidf_char_vec)
train_df[["nb_tfidf_char_failure", "nb_tfidf_char_success"]] = train_y_pred_proba

# Test dataset CV
test_cv_scores = cross_val_score(model, test_tfidf_char_vec, test_y, cv=kf)
print("Mean Test cv score : ", test_cv_scores.mean())

full_test_preds_cv_score = (train_cv_scores.mean() + test_cv_scores.mean())/2.
print("Full pred CV score : ", full_test_preds_cv_score)

# Add the prediction probabilities for Failure or Success from text as new features
test_y_pred_proba = model.predict_proba(test_tfidf_char_vec)
test_df[["nb_tfidf_char_failure", "nb_tfidf_char_success"]] = test_y_pred_proba




The cross-validation scores on character tfidf vector are better than scores in word tfidf vector. This indicates that characters in the text are helping, we will use it as well.

Let's print the feature engineered dataframe so far.

In [None]:
train_df

In [None]:
test_df

### Train an XGBoost Model with Feature Engineered Dataset

Now we have a feature engineered dataset, replacing text feature `InstrForNxtAgt` with new features using text feature engineering techniques such word count, TF-IDF and probabilities using Multinomial Naive Bayes model trained on word count and TF-IDF vectors generated from the text feature. With these new features, let's train an xgboost model and evaluate the results.

In [None]:
cols_to_drop = ['y_target', 'InstrForNxtAgt']
train_X = train_df.drop(cols_to_drop, axis=1)
test_X = test_df.drop(cols_to_drop, axis=1)

train_X

In [None]:
test_X

In [None]:
from sklearn.metrics import accuracy_score

# Init classifier
xgb_cl = xgb.XGBClassifier(objective='binary:logistic', use_label_encoder=True, eval_metric='error')

# Fit
xgb_cl.fit(train_X, train_y)

# Predict
preds = xgb_cl.predict(test_X)

# Score
accuracy_score(test_y, preds)

The new XGBoost model with new features indeed performs better than earlier model.

#### Check Feature Importance
Let's plot feature importance provided by XGBoost to determine which features have higher impact on predictions.

In [None]:
### Plot the important variables ###
fig, ax = plt.subplots(figsize=(12,12))
xgb.plot_importance(xgb_cl, max_num_features=50, height=0.8, ax=ax)
plt.show()

As before `Creditor Postal Address` and `Debtor Postal Address` have most impact in this dataset. But newly added features such as Naive Bayes predictions on character tfidf, character count, word tfidf and word counts improve predictions. 

#### Evaluate Model
Let's get the confusion matrix to evaluate this new model.

In [None]:
cnf_matrix = confusion_matrix(test_y, preds)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure(figsize=(8,8))
plot_confusion_matrix(cnf_matrix, classes=['1', '0'],
 title='Confusion matrix of XGB, without normalization')
plt.show()

The text feature engineering did help to make the model perform better.

## Further improvements
**Feature Engineering**
* Using word embedding based features such fasttext, blazingtext

**Model Building & Training**
* Parameter tuning for TFIDF and Count Vectorizers.
* Hyperparameter tuning for Multinomial Naive Bayes and XGBoost models.
* Try out SageMaker Linear Learner algorithm.
* Try out Ensembling and Stacking with other models.