# Responsible AI - Final Project

Build a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) that predicts the __credit_risk__ field (whether some is a credit risk or not) of the [German Credit Dataset](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29).

### Final Project Problem: Loan Approval

__Problem Definition:__
Given a set of features about an individual (e.g. age, past credit history, immigration status, ...) predict whether a loan is repaid or not (is customer a credit risk). We impose the additional constraint that the model should be fair with respect to different age groups ($\geq$ 25 yrs and $<$ 25 yrs).

In the banking industry, there are certain regulations regarding the use of sensitive features (e.g., age, ethnicity, marital status, ...). According to those regulations, it would not be okay if age played a significant role in the model (loans should be approved/denied regardless of an individuals' age).

For example, certain laws declare it unlawful for creditors to discriminate against any applicant on the basis of age (or other sensitive attributes). For more details, have a look at this paper:

``` 
F. Kamiran and T. Calders, "Data Preprocessing Techniques for Classification without Discrimination," Knowledge and Information Systems, 2012
```

__Table of contents__

1. Read the datasets (Given) 
2. Data Processing (Implement)
 * Exploratory Data Analysis
 * Select features to build the model (Suggested)
 * Train - Validation - Test Datasets
 * Data Processing with Pipeline
3. Train (and Tune) a Classifier on the Training Dataset (Implement)
4. Make Predictions on the Test Dataset (Implement)
5. Evaluate Results (Given)


__Datasets and Files__


- ```german_credit_training.csv```: Training data with loan applicants features, credit history, dependents, savings, account status, age group (and more). The label is __credit_risk__.

- ```german_credit_test.csv```: Test data with same features as above apart from label. This will be the data to make predictions for to emulate a production environment.


This notebook assumes an installation of the SageMaker kernel `conda_pytorch_p39`. In addition, libraries from a requirements.txt need to be installed:

In [None]:
!pip install --no-deps -U -q -r ../../requirements.txt

In [None]:
# Reshaping/basic libraries
import pandas as pd
import numpy as np

# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid", {"axes.facecolor": ".9"})

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression

# Operational libraries
import sys

sys.path.append("..")

# Jupyter(lab) libraries
import warnings

warnings.filterwarnings("ignore")

## 1. Read the datasets (Given)
(Go to top)

Then, we read the __training__ and __test__ datasets into dataframes, using [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). This library allows us to read and manipulate our data.

In [None]:
training_data = pd.read_csv("../../data/final_project/german_credit_training.csv")
test_data = pd.read_csv("../../data/final_project/german_credit_test.csv")

print("The shape of the training dataset is:", training_data.shape)
print("The shape of the test dataset is:", test_data.shape)

## 2. Data Processing (Implement)
(Go to top) 

### 2.1 Exploratory Data Analysis
(Go to Data Processing)

We look at number of rows, columns, and some simple statistics of the dataset.

In [None]:
training_data.head()

In [None]:
test_data.head()

In [None]:
# Implement more EDA here

### 2.2 Select features to build the model 
(Go to Data Processing)

For a quick start, we recommend using only a few of the numerical and categorical features. However, feel free to explore other fields. In this case, we do not need to cast our features to numerical/objects. Mindful with some of the feature names - they suggest numerical values but upon inspection it should become clear that they are actually categoricals (e.g. `employed_since_years` has been binned into groups).


In [None]:
# Grab model features/inputs and target/output
categorical_features = ["job_status", "employed_since_years", "savings", "age_groups"]

numerical_features = ["credit_amount", "credit_duration_months"]

Separate features and the model target.

In [None]:
model_target = "credit_risk"
model_features = categorical_features + numerical_features

print("Model features: ", model_features)
print("Model target: ", model_target)

### 2.3 Train - Validation Datasets
(Go to Data Processing)

We already have training and test datasets, but no validation dataset (which you need to create). Furthermore, the test dataset is missing the labels - the goal of the project is to predict these labels. 

To produce a validation set to evaluate model performance, split the training dataset into train and validation subsets using sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. Validation data you get here will be used later in section 3 to tune your classifier.

In [None]:
# Implement here

### 2.4 Data processing with Pipeline
(Go to Data Processing)

Build a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to impute the missing values and scale the numerical features, and finally train a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) on the imputed and scaled dataset. 


In [None]:
# Implement here

## 3. Train (and Tune) a Classifier (Implement)
(Go to top)

Train (and tune) the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) pipeline. For tuning, you can try different imputation strategies, different scaling methods.

In [None]:
# Implement here

## 4. Make Predictions on the Test Dataset (Implement)
(Go to top)

Use the trained classifier to predict the labels on the test set. Below you will find a code snippet that evaluates for DI.

In [None]:
# Implement here

# Get test data to test the classifier
# ! test data should come from german_credit_test.csv !
# ...

# Use the trained model to make predictions on the test dataset
# test_predictions = ...

## 5. Evaluate Results (Given)
(Go to top)

In [None]:
result_df = pd.DataFrame(columns=["ID", "credit_risk_pred"])
result_df["ID"] = test_data["ID"].tolist()
result_df["credit_risk_pred"] = test_predictions

result_df.to_csv("../../data/final_project/project_day1_result.csv", index=False)

### Final Evaluation on Test Data - Disparate Impact
To evaluate the fairness of the model predictions, we will calculate the disparate impact (DI) metric. For more details about DI you can have a look [here](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-di.html).

In [None]:
def calculate_di(test_data, pred_df, pred_col="credit_risk_pred"):
 """
 Function to calculate Disparate Impact metric using the results from this notebook.
 """
 try:
 # Merge predictions with original test data to model per group
 di_df = pred_df.merge(test_data, on="ID")
 # Count for group with members less than 25y old
 pos_outcomes_less25 = di_df[di_df["age_groups"] == 0][pred_col].value_counts()[
 0
 ] # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'
 total_less25 = len(di_df[di_df["age_groups"] == 0])
 # Count for group with members greater equal 25y old
 pos_outcomes_geq25 = di_df[di_df["age_groups"] == 1][pred_col].value_counts()[
 0
 ] # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'
 total_geq25 = len(di_df[di_df["age_groups"] == 1])
 # Check if correct number of gorups
 if total_geq25 == 0:
 print("There is only one group present in the data.")
 elif total_less25 == 0:
 print("There is only one group present in the data.")
 else:
 disparate_impact = (pos_outcomes_less25 / total_less25) / (
 pos_outcomes_geq25 / total_geq25
 )
 return disparate_impact
 except:
 print("Wrong inputs provided.")

In [None]:
calculate_di(test_data, result_df, "credit_risk_pred")

### Final Evaluation on Test Data - Accuracy & F1 Score
In addition to fairness evaluation, we also need to check the general model performance. During the EDA stage we learned that the target distribution is skewed so we will use F1 score in addition to accuracy.

In [None]:
accuracy_score(
 pd.read_csv("../../data/final_project/german_credit_test_labels.csv")[
 "credit_risk"
 ],
 result_df["credit_risk_pred"],
)

In [None]:
f1_score(
 pd.read_csv("../../data/final_project/german_credit_test_labels.csv")[
 "credit_risk"
 ],
 result_df["credit_risk_pred"],
)

This is the end of the notebook.