![MLU Logo](../../data/MLU_Logo.png)

# <a name="0">Responsible AI - Final Project Solution</a>

Build a fair [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) that predicts the __credit_risk__ field (whether some is a credit risk or not) of the [German Credit Dataset](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29).

### Final Project Problem: Loan Approval

__Problem Definition:__
Given a set of features about an individual (e.g. age, past credit history, immigration status, ...) predict whether a loan is repaid or not (is customer a credit risk). We impose the additional constraint that the model should be fair with respect to different age groups ($\geq$ 25 yrs and $<$ 25 yrs).

In the banking industry, there are certain regulations regarding the use of sensitive features (e.g., age, ethnicity, marital status, ...). According to those regulations, it would not be okay if age played a significant role in the model (loans should be approved/denied regardless of an individuals' age).


``` 
F. Kamiran and T. Calders, "Data Preprocessing Techniques for Classification without Discrimination," Knowledge and Information Systems, 2012
```

1. <a href="#1">Read the datasets</a> (Given) 
2. <a href="#2">Data Processing</a> (Implement)
    * <a href="#21">Exploratory Data Analysis</a>
    * <a href="#22">Select features to build the model</a> (Suggested)
    * <a href="#23">Train - Validation - Test Datasets</a>
    * <a href="#24">Feature transformation</a>
3. <a href="#3">Train a Classifier on the Training Dataset</a> (Implement)
4. <a href="#4">Make Predictions on the Test Dataset</a> (Implement)
5. <a href="#5">Evaluate Results</a> (Given)


__Datasets and Files:__


- ```german_credit_training.csv```: Training data with loan applicants features, credit history, dependents, savings, account status, age group (and more). The label is __credit_risk__.

- ```german_credit_test.csv```: Test data with same features as above apart from label. This will be the data to make predictions for to emulate a production environment.

This notebook assumes an installation of the SageMaker kernel `conda_pytorch_p39`. In addition, libraries from a requirements.txt need to be installed:

In [None]:
!pip install --no-deps -U -q -r ../../requirements.txt

In [None]:
%%capture

# Reshaping/basic libraries
import pandas as pd
import numpy as np

# Plotting libraries
import matplotlib.pyplot as plt

%matplotlib inline
import seaborn as sns

sns.set_style("darkgrid", {"axes.facecolor": ".9"})

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression

# Operational libraries
import sys

sys.path.append("..")
sys.path.insert(1, "..")

# Fairness libraries
from folktables.acs import *
from folktables.folktables import *
from folktables.load_acs import *
from fairlearn.reductions import EqualizedOdds
from fairlearn.postprocessing import ThresholdOptimizer
from fairlearn.metrics import MetricFrame, selection_rate

# Jupyter(lab) libraries
import warnings

warnings.filterwarnings("ignore")

## 1. <a name="1">Read the datasets</a> (Given)
(<a href="#0">Go to top</a>)

Then, we read the __training__ and __test__ datasets into dataframes, using [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). This library allows us to read and manipulate our data.

In [None]:
training_data = pd.read_csv("../../data/final_project/german_credit_training.csv")
test_data = pd.read_csv("../../data/final_project/german_credit_test.csv")

print("The shape of the training dataset is:", training_data.shape)
print("The shape of the test dataset is:", test_data.shape)

## 2. <a name="2">Data Processing</a> (Implement)
(<a href="#0">Go to top</a>) 

### 2.1 <a name="21">Exploratory Data Analysis</a>
(<a href="#2">Go to Data Processing</a>)

We look at number of rows, columns, and some simple statistics of the datasets.

In [None]:
training_data.head()

In [None]:
test_data.head()

In [None]:
# Implement more EDA here

### 2.2 <a name="22">Select features to build the model</a> 
(<a href="#2">Go to Data Processing</a>)

Let's use all the features. Below you see a snippet of code that separates categorical and numerical columns based on their data type. This should only be used if we are sure that the data types are correctly assigned (check during EDA). Mindful with some of the feature names - they suggest numerical values but upon inspection it should become clear that they are actually categoricals (e.g. `employed_since_years` has been binned into groups).


In [None]:
# Grab model features/inputs and target/output
categorical_features = (
    training_data.drop("credit_risk", axis=1)
    .select_dtypes(include="object")
    .columns.tolist()
)
print("Categorical columns:", categorical_features)

print("")

numerical_features = (
    training_data.drop("credit_risk", axis=1)
    .select_dtypes(include=np.number)
    .columns.tolist()
)
print("Numerical columns:", numerical_features)

We notice that `ID` is identified as numerical column. ID's should never be used as features for training as they are unique by row. Let's drop the ID from the model features after we have separated target and features. Also make sure to remove the sensitive feature so it does not end up as input for training.

In [None]:
sensitive_feature = "age_groups"

try:
    numerical_features.remove(sensitive_feature)
except:
    pass

try:
    categorical_features.remove(sensitive_feature)
except:
    pass

model_target = "credit_risk"
model_features = categorical_features + numerical_features

print("Model features: ", model_features)
print("\n")
print("Model target: ", model_target)

In [None]:
to_remove = "ID"

# Drop 'ID' feature from the respective list(s)
if to_remove in model_features:
    model_features.remove(to_remove)
if to_remove in categorical_features:
    categorical_features.remove(to_remove)
if to_remove in numerical_features:
    numerical_features.remove(to_remove)

Let's also remove `age_years` as this is an obvious proxy for the age groups.

In [None]:
to_remove = "age_years"

# Drop 'ID' feature from the respective list(s)
if to_remove in model_features:
    model_features.remove(to_remove)
if to_remove in categorical_features:
    categorical_features.remove(to_remove)
if to_remove in numerical_features:
    numerical_features.remove(to_remove)

### 2.3 <a name="23">Feature transformation</a>
(<a href="#2">Go to Data Processing</a>)

Here, you have different options. You can use Reweighing, Disparate Impact Remover or Suppression. However, in this notebook you should try to implement Equalized Odds postprocessing. Therefore, no transformation is required at this point.

### 2.4 <a name="24">Train - Validation Datasets</a>
(<a href="#2">Go to Data Processing</a>)

We already have training and test datasets, but no validation dataset (which you need to create). Furthermore, the test dataset is missing the labels - the goal of the project is to predict these labels. 

To produce a validation set to evaluate model performance, split the training dataset into train and validation subsets using sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. 

In [None]:
# Implement here

### 2.5 <a name="25">Data processing with Pipeline</a>
(<a href="#2">Go to Data Processing</a>)

Build a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)  to impute the missing values and scale the numerical features, and finally train a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)  on the imputed and scaled dataset. 


In [None]:
# Implement here

## 3. <a name="3">Train (and Tune) a Classifier</a> (Implement)
(<a href="#0">Go to top</a>)

Train (and tune) the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) pipeline. For tuning, you can try different imputation strategies, different scaling methods.

In [None]:
# Implement here

## 4. <a name="4">Make Predictions on the Test Dataset</a> (Implement)
(<a href="#0">Go to top</a>)

Use the trained classifier to predict the labels on the test set. Below you will find a code snippet that evaluates for DI.

In [None]:
# Implement here

# Get test data to test the classifier
# ! test data should come from german_credit_test.csv !
# ...

# Use the trained model to make predictions on the test dataset
# test_predictions = ...

## 5. <a name="5">Evaluate Results</a> (Given)
(<a href="#0">Go to top</a>)

In [None]:
result_df = pd.DataFrame(columns=["ID", "credit_risk_pred"])
result_df["ID"] = test_data["ID"].tolist()
result_df["credit_risk_pred"] = test_predictions

result_df.to_csv("../../data/final_project/project_day3_result.csv", index=False)

### Final Evaluation on Test Data - Disparate Impact
To evaluate the fairness of the model predictions, we will calculate the disparate impact (DI) metric. For more details about DI you can have a look [here](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-di.html).

In [None]:
def calculate_di(test_data, pred_df, pred_col="credit_risk_pred"):
    """
    Function to calculate Disparate Impact metric using the results from this notebook.
    """
    try:
        # Merge predictions with original test data to model per group
        di_df = pred_df.merge(test_data, on="ID")
        # Count for group with members less than 25y old
        pos_outcomes_less25 = di_df[di_df["age_groups"] == 0][pred_col].value_counts()[
            0
        ]  # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'
        total_less25 = len(di_df[di_df["age_groups"] == 0])
        # Count for group with members greater equal 25y old
        pos_outcomes_geq25 = di_df[di_df["age_groups"] == 1][pred_col].value_counts()[
            0
        ]  # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'
        total_geq25 = len(di_df[di_df["age_groups"] == 1])
        # Check if correct number of gorups
        if total_geq25 == 0:
            print("There is only one group present in the data.")
        elif total_less25 == 0:
            print("There is only one group present in the data.")
        else:
            disparate_impact = (pos_outcomes_less25 / total_less25) / (
                pos_outcomes_geq25 / total_geq25
            )
            return disparate_impact
    except:
        print("Wrong inputs provided.")

In [None]:
calculate_di(test_data, result_df, "credit_risk_pred")

### Final Evaluation on Test Data - Accuracy & F1 Score
In addition to fairness evaluation, we also need to check the general model performance. During the EDA stage we learned that the target distribution is skewed so we will use F1 score in addition to accuracy.

In [None]:
accuracy_score(
    pd.read_csv("../../data/final_project/german_credit_test_labels.csv")[
        "credit_risk"
    ],
    result_df["credit_risk_pred"],
)

In [None]:
f1_score(
    pd.read_csv("../../data/final_project/german_credit_test_labels.csv")[
        "credit_risk"
    ],
    result_df["credit_risk_pred"],
)

This is the end of the notebook.