{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![MLU Logo](../../data/MLU_Logo.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a name=\"0\">Responsible AI - Final Project Solution</a>\n",
    "\n",
    "Build a fair [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) that predicts the __credit_risk__ field (whether some is a credit risk or not) of the [German Credit Dataset](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29).\n",
    "\n",
    "### Final Project Problem: Loan Approval\n",
    "\n",
    "__Problem Definition:__\n",
    "Given a set of features about an individual (e.g. age, past credit history, immigration status, ...) predict whether a loan is repaid or not (is customer a credit risk). We impose the additional constraint that the model should be fair with respect to different age groups ($\\geq$ 25 yrs and $<$ 25 yrs).\n",
    "\n",
    "In the banking industry, there are certain regulations regarding the use of sensitive features (e.g., age, ethnicity, marital status, ...). According to those regulations, it would not be okay if age played a significant role in the model (loans should be approved/denied regardless of an individuals' age).\n",
    "\n",
    "\n",
    "``` \n",
    "F. Kamiran and T. Calders, \"Data Preprocessing Techniques for Classification without Discrimination,\" Knowledge and Information Systems, 2012\n",
    "```\n",
    "\n",
    "1. <a href=\"#1\">Read the datasets</a> (Given) \n",
    "2. <a href=\"#2\">Data Processing</a> (Implement)\n",
    "    * <a href=\"#21\">Exploratory Data Analysis</a>\n",
    "    * <a href=\"#22\">Select features to build the model</a> (Suggested)\n",
    "    * <a href=\"#23\">Train - Validation - Test Datasets</a>\n",
    "    * <a href=\"#24\">Feature transformation</a>\n",
    "3. <a href=\"#3\">Train a Classifier on the Training Dataset</a> (Implement)\n",
    "4. <a href=\"#4\">Make Predictions on the Test Dataset</a> (Implement)\n",
    "5. <a href=\"#5\">Evaluate Results</a> (Given)\n",
    "\n",
    "\n",
    "__Datasets and Files:__\n",
    "\n",
    "\n",
    "- ```german_credit_training.csv```: Training data with loan applicants features, credit history, dependents, savings, account status, age group (and more). The label is __credit_risk__.\n",
    "\n",
    "- ```german_credit_test.csv```: Test data with same features as above apart from label. This will be the data to make predictions for to emulate a production environment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook assumes an installation of the SageMaker kernel `conda_pytorch_p39`. In addition, libraries from a requirements.txt need to be installed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install --no-deps -U -q -r ../../requirements.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "\n",
    "# Reshaping/basic libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Plotting libraries\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "%matplotlib inline\n",
    "import seaborn as sns\n",
    "\n",
    "sns.set_style(\"darkgrid\", {\"axes.facecolor\": \".9\"})\n",
    "\n",
    "# ML libraries\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import confusion_matrix, accuracy_score, f1_score\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "# Operational libraries\n",
    "import sys\n",
    "\n",
    "sys.path.append(\"..\")\n",
    "sys.path.insert(1, \"..\")\n",
    "\n",
    "# Fairness libraries\n",
    "from folktables.acs import *\n",
    "from folktables.folktables import *\n",
    "from folktables.load_acs import *\n",
    "from fairlearn.reductions import EqualizedOdds\n",
    "from fairlearn.postprocessing import ThresholdOptimizer\n",
    "from fairlearn.metrics import MetricFrame, selection_rate\n",
    "\n",
    "# Jupyter(lab) libraries\n",
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. <a name=\"1\">Read the datasets</a> (Given)\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we read the __training__ and __test__ datasets into dataframes, using [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). This library allows us to read and manipulate our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The shape of the training dataset is: (800, 24)\n",
      "The shape of the test dataset is: (200, 23)\n"
     ]
    }
   ],
   "source": [
    "training_data = pd.read_csv(\"../../data/final_project/german_credit_training.csv\")\n",
    "test_data = pd.read_csv(\"../../data/final_project/german_credit_test.csv\")\n",
    "\n",
    "print(\"The shape of the training dataset is:\", training_data.shape)\n",
    "print(\"The shape of the test dataset is:\", test_data.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. <a name=\"2\">Data Processing</a> (Implement)\n",
    "(<a href=\"#0\">Go to top</a>) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 <a name=\"21\">Exploratory Data Analysis</a>\n",
    "(<a href=\"#2\">Go to Data Processing</a>)\n",
    "\n",
    "We look at number of rows, columns, and some simple statistics of the datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>sex</th>\n",
       "      <th>personal_status</th>\n",
       "      <th>checking_account_status</th>\n",
       "      <th>credit_duration_months</th>\n",
       "      <th>credit_history</th>\n",
       "      <th>credit_purpose</th>\n",
       "      <th>credit_amount</th>\n",
       "      <th>savings</th>\n",
       "      <th>employed_since_years</th>\n",
       "      <th>...</th>\n",
       "      <th>age_years</th>\n",
       "      <th>other_installment_plans</th>\n",
       "      <th>housing</th>\n",
       "      <th>num_existing_credits</th>\n",
       "      <th>job_status</th>\n",
       "      <th>num_people_liable_for</th>\n",
       "      <th>telephone</th>\n",
       "      <th>foreign_worker</th>\n",
       "      <th>credit_risk</th>\n",
       "      <th>age_groups</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>564</td>\n",
       "      <td>male</td>\n",
       "      <td>married/widowed</td>\n",
       "      <td>... &lt; 0 DM</td>\n",
       "      <td>24</td>\n",
       "      <td>existing credits paid back duly till now</td>\n",
       "      <td>retraining</td>\n",
       "      <td>4712</td>\n",
       "      <td>.. &gt;= 1000 DM</td>\n",
       "      <td>1 &lt;= ... &lt; 4 years</td>\n",
       "      <td>...</td>\n",
       "      <td>37</td>\n",
       "      <td>bank</td>\n",
       "      <td>rent</td>\n",
       "      <td>2-3</td>\n",
       "      <td>management/self-employed/highly qualified empl...</td>\n",
       "      <td>0 to 2</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>484</td>\n",
       "      <td>male</td>\n",
       "      <td>married/widowed</td>\n",
       "      <td>... &gt;= 200 DM</td>\n",
       "      <td>12</td>\n",
       "      <td>all credits at this bank paid back duly</td>\n",
       "      <td>others</td>\n",
       "      <td>1163</td>\n",
       "      <td>100 &lt;= ... &lt; 500 DM</td>\n",
       "      <td>1 &lt;= ... &lt; 4 years</td>\n",
       "      <td>...</td>\n",
       "      <td>44</td>\n",
       "      <td>none</td>\n",
       "      <td>rent</td>\n",
       "      <td>1</td>\n",
       "      <td>skilled employee/official</td>\n",
       "      <td>0 to 2</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>26</td>\n",
       "      <td>female</td>\n",
       "      <td>single</td>\n",
       "      <td>... &gt;= 200 DM</td>\n",
       "      <td>6</td>\n",
       "      <td>delay in paying off in the past</td>\n",
       "      <td>furniture/equipment</td>\n",
       "      <td>426</td>\n",
       "      <td>unknown/no savings account</td>\n",
       "      <td>.. &gt;= 7 years</td>\n",
       "      <td>...</td>\n",
       "      <td>39</td>\n",
       "      <td>none</td>\n",
       "      <td>rent</td>\n",
       "      <td>1</td>\n",
       "      <td>unskilled - resident</td>\n",
       "      <td>0 to 2</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>443</td>\n",
       "      <td>male</td>\n",
       "      <td>married/widowed</td>\n",
       "      <td>... &gt;= 200 DM</td>\n",
       "      <td>12</td>\n",
       "      <td>no credits taken/all credits paid back duly</td>\n",
       "      <td>repairs</td>\n",
       "      <td>719</td>\n",
       "      <td>unknown/no savings account</td>\n",
       "      <td>.. &gt;= 7 years</td>\n",
       "      <td>...</td>\n",
       "      <td>41</td>\n",
       "      <td>bank</td>\n",
       "      <td>rent</td>\n",
       "      <td>1</td>\n",
       "      <td>unskilled - resident</td>\n",
       "      <td>3 or more</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>211</td>\n",
       "      <td>female/male</td>\n",
       "      <td>female (non-single) or male (single)</td>\n",
       "      <td>... &gt;= 200 DM</td>\n",
       "      <td>36</td>\n",
       "      <td>no credits taken/all credits paid back duly</td>\n",
       "      <td>furniture/equipment</td>\n",
       "      <td>3835</td>\n",
       "      <td>.. &gt;= 1000 DM</td>\n",
       "      <td>.. &gt;= 7 years</td>\n",
       "      <td>...</td>\n",
       "      <td>45</td>\n",
       "      <td>none</td>\n",
       "      <td>rent</td>\n",
       "      <td>1</td>\n",
       "      <td>unskilled - resident</td>\n",
       "      <td>0 to 2</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 24 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    ID          sex                       personal_status  \\\n",
       "0  564         male                       married/widowed   \n",
       "1  484         male                       married/widowed   \n",
       "2   26       female                                single   \n",
       "3  443         male                       married/widowed   \n",
       "4  211  female/male  female (non-single) or male (single)   \n",
       "\n",
       "  checking_account_status  credit_duration_months  \\\n",
       "0              ... < 0 DM                      24   \n",
       "1           ... >= 200 DM                      12   \n",
       "2           ... >= 200 DM                       6   \n",
       "3           ... >= 200 DM                      12   \n",
       "4           ... >= 200 DM                      36   \n",
       "\n",
       "                                credit_history       credit_purpose  \\\n",
       "0     existing credits paid back duly till now           retraining   \n",
       "1      all credits at this bank paid back duly               others   \n",
       "2              delay in paying off in the past  furniture/equipment   \n",
       "3  no credits taken/all credits paid back duly              repairs   \n",
       "4  no credits taken/all credits paid back duly  furniture/equipment   \n",
       "\n",
       "   credit_amount                     savings employed_since_years  ...  \\\n",
       "0           4712               .. >= 1000 DM   1 <= ... < 4 years  ...   \n",
       "1           1163         100 <= ... < 500 DM   1 <= ... < 4 years  ...   \n",
       "2            426  unknown/no savings account        .. >= 7 years  ...   \n",
       "3            719  unknown/no savings account        .. >= 7 years  ...   \n",
       "4           3835               .. >= 1000 DM        .. >= 7 years  ...   \n",
       "\n",
       "  age_years other_installment_plans housing num_existing_credits  \\\n",
       "0        37                    bank    rent                  2-3   \n",
       "1        44                    none    rent                    1   \n",
       "2        39                    none    rent                    1   \n",
       "3        41                    bank    rent                    1   \n",
       "4        45                    none    rent                    1   \n",
       "\n",
       "                                          job_status num_people_liable_for  \\\n",
       "0  management/self-employed/highly qualified empl...                0 to 2   \n",
       "1                          skilled employee/official                0 to 2   \n",
       "2                               unskilled - resident                0 to 2   \n",
       "3                               unskilled - resident             3 or more   \n",
       "4                               unskilled - resident                0 to 2   \n",
       "\n",
       "  telephone foreign_worker credit_risk age_groups  \n",
       "0      True          False           0          1  \n",
       "1      True          False           0          1  \n",
       "2     False          False           0          1  \n",
       "3     False          False           1          1  \n",
       "4      True          False           0          1  \n",
       "\n",
       "[5 rows x 24 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "training_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>sex</th>\n",
       "      <th>personal_status</th>\n",
       "      <th>checking_account_status</th>\n",
       "      <th>credit_duration_months</th>\n",
       "      <th>credit_history</th>\n",
       "      <th>credit_purpose</th>\n",
       "      <th>credit_amount</th>\n",
       "      <th>savings</th>\n",
       "      <th>employed_since_years</th>\n",
       "      <th>...</th>\n",
       "      <th>property</th>\n",
       "      <th>age_years</th>\n",
       "      <th>other_installment_plans</th>\n",
       "      <th>housing</th>\n",
       "      <th>num_existing_credits</th>\n",
       "      <th>job_status</th>\n",
       "      <th>num_people_liable_for</th>\n",
       "      <th>telephone</th>\n",
       "      <th>foreign_worker</th>\n",
       "      <th>age_groups</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>963</td>\n",
       "      <td>male</td>\n",
       "      <td>married/widowed</td>\n",
       "      <td>... &gt;= 200 DM</td>\n",
       "      <td>24</td>\n",
       "      <td>no credits taken/all credits paid back duly</td>\n",
       "      <td>furniture/equipment</td>\n",
       "      <td>2397</td>\n",
       "      <td>100 &lt;= ... &lt; 500 DM</td>\n",
       "      <td>.. &gt;= 7 years</td>\n",
       "      <td>...</td>\n",
       "      <td>building society savings agreement/life insurance</td>\n",
       "      <td>35</td>\n",
       "      <td>bank</td>\n",
       "      <td>rent</td>\n",
       "      <td>2-3</td>\n",
       "      <td>skilled employee/official</td>\n",
       "      <td>0 to 2</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>611</td>\n",
       "      <td>female/male</td>\n",
       "      <td>female (non-single) or male (single)</td>\n",
       "      <td>0 &lt;= ... &lt; 200 DM</td>\n",
       "      <td>10</td>\n",
       "      <td>no credits taken/all credits paid back duly</td>\n",
       "      <td>others</td>\n",
       "      <td>1240</td>\n",
       "      <td>... &lt; 100 DM</td>\n",
       "      <td>.. &gt;= 7 years</td>\n",
       "      <td>...</td>\n",
       "      <td>real estate</td>\n",
       "      <td>48</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>1</td>\n",
       "      <td>unskilled - resident</td>\n",
       "      <td>3 or more</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>106</td>\n",
       "      <td>male</td>\n",
       "      <td>married/widowed</td>\n",
       "      <td>... &gt;= 200 DM</td>\n",
       "      <td>18</td>\n",
       "      <td>critical account/other credits existing (not a...</td>\n",
       "      <td>others</td>\n",
       "      <td>6458</td>\n",
       "      <td>unknown/no savings account</td>\n",
       "      <td>.. &gt;= 7 years</td>\n",
       "      <td>...</td>\n",
       "      <td>real estate</td>\n",
       "      <td>39</td>\n",
       "      <td>bank</td>\n",
       "      <td>rent</td>\n",
       "      <td>2-3</td>\n",
       "      <td>management/self-employed/highly qualified empl...</td>\n",
       "      <td>3 or more</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>891</td>\n",
       "      <td>male</td>\n",
       "      <td>married/widowed</td>\n",
       "      <td>... &gt;= 200 DM</td>\n",
       "      <td>15</td>\n",
       "      <td>all credits at this bank paid back duly</td>\n",
       "      <td>furniture/equipment</td>\n",
       "      <td>1829</td>\n",
       "      <td>unknown/no savings account</td>\n",
       "      <td>.. &gt;= 7 years</td>\n",
       "      <td>...</td>\n",
       "      <td>building society savings agreement/life insurance</td>\n",
       "      <td>46</td>\n",
       "      <td>none</td>\n",
       "      <td>rent</td>\n",
       "      <td>2-3</td>\n",
       "      <td>skilled employee/official</td>\n",
       "      <td>0 to 2</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>342</td>\n",
       "      <td>female</td>\n",
       "      <td>single</td>\n",
       "      <td>... &lt; 0 DM</td>\n",
       "      <td>18</td>\n",
       "      <td>no credits taken/all credits paid back duly</td>\n",
       "      <td>furniture/equipment</td>\n",
       "      <td>3213</td>\n",
       "      <td>100 &lt;= ... &lt; 500 DM</td>\n",
       "      <td>... &lt; 1 year</td>\n",
       "      <td>...</td>\n",
       "      <td>unknown/no property</td>\n",
       "      <td>25</td>\n",
       "      <td>none</td>\n",
       "      <td>for free</td>\n",
       "      <td>1</td>\n",
       "      <td>skilled employee/official</td>\n",
       "      <td>0 to 2</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 23 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    ID          sex                       personal_status  \\\n",
       "0  963         male                       married/widowed   \n",
       "1  611  female/male  female (non-single) or male (single)   \n",
       "2  106         male                       married/widowed   \n",
       "3  891         male                       married/widowed   \n",
       "4  342       female                                single   \n",
       "\n",
       "  checking_account_status  credit_duration_months  \\\n",
       "0           ... >= 200 DM                      24   \n",
       "1       0 <= ... < 200 DM                      10   \n",
       "2           ... >= 200 DM                      18   \n",
       "3           ... >= 200 DM                      15   \n",
       "4              ... < 0 DM                      18   \n",
       "\n",
       "                                      credit_history       credit_purpose  \\\n",
       "0        no credits taken/all credits paid back duly  furniture/equipment   \n",
       "1        no credits taken/all credits paid back duly               others   \n",
       "2  critical account/other credits existing (not a...               others   \n",
       "3            all credits at this bank paid back duly  furniture/equipment   \n",
       "4        no credits taken/all credits paid back duly  furniture/equipment   \n",
       "\n",
       "   credit_amount                     savings employed_since_years  ...  \\\n",
       "0           2397         100 <= ... < 500 DM        .. >= 7 years  ...   \n",
       "1           1240                ... < 100 DM        .. >= 7 years  ...   \n",
       "2           6458  unknown/no savings account        .. >= 7 years  ...   \n",
       "3           1829  unknown/no savings account        .. >= 7 years  ...   \n",
       "4           3213         100 <= ... < 500 DM         ... < 1 year  ...   \n",
       "\n",
       "                                            property age_years  \\\n",
       "0  building society savings agreement/life insurance        35   \n",
       "1                                        real estate        48   \n",
       "2                                        real estate        39   \n",
       "3  building society savings agreement/life insurance        46   \n",
       "4                                unknown/no property        25   \n",
       "\n",
       "  other_installment_plans   housing  num_existing_credits  \\\n",
       "0                    bank      rent                   2-3   \n",
       "1                    none       own                     1   \n",
       "2                    bank      rent                   2-3   \n",
       "3                    none      rent                   2-3   \n",
       "4                    none  for free                     1   \n",
       "\n",
       "                                          job_status num_people_liable_for  \\\n",
       "0                          skilled employee/official                0 to 2   \n",
       "1                               unskilled - resident             3 or more   \n",
       "2  management/self-employed/highly qualified empl...             3 or more   \n",
       "3                          skilled employee/official                0 to 2   \n",
       "4                          skilled employee/official                0 to 2   \n",
       "\n",
       "  telephone foreign_worker age_groups  \n",
       "0      True          False          1  \n",
       "1     False          False          1  \n",
       "2      True          False          1  \n",
       "3      True          False          1  \n",
       "4     False          False          1  \n",
       "\n",
       "[5 rows x 23 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Implement more EDA here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 <a name=\"22\">Select features to build the model</a> \n",
    "(<a href=\"#2\">Go to Data Processing</a>)\n",
    "\n",
    "Let's use all the features. Below you see a snippet of code that separates categorical and numerical columns based on their data type. This should only be used if we are sure that the data types are correctly assigned (check during EDA). Mindful with some of the feature names - they suggest numerical values but upon inspection it should become clear that they are actually categoricals (e.g. `employed_since_years` has been binned into groups).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Categorical columns: ['sex', 'personal_status', 'checking_account_status', 'credit_history', 'credit_purpose', 'savings', 'employed_since_years', 'installment_rate', 'other_debtors_guarantors', 'present_residence_since', 'property', 'other_installment_plans', 'housing', 'num_existing_credits', 'job_status', 'num_people_liable_for']\n",
      "\n",
      "Numerical columns: ['ID', 'credit_duration_months', 'credit_amount', 'age_years', 'age_groups']\n"
     ]
    }
   ],
   "source": [
    "# Grab model features/inputs and target/output\n",
    "categorical_features = (\n",
    "    training_data.drop(\"credit_risk\", axis=1)\n",
    "    .select_dtypes(include=\"object\")\n",
    "    .columns.tolist()\n",
    ")\n",
    "print(\"Categorical columns:\", categorical_features)\n",
    "\n",
    "print(\"\")\n",
    "\n",
    "numerical_features = (\n",
    "    training_data.drop(\"credit_risk\", axis=1)\n",
    "    .select_dtypes(include=np.number)\n",
    "    .columns.tolist()\n",
    ")\n",
    "print(\"Numerical columns:\", numerical_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We notice that `ID` is identified as numerical column. ID's should never be used as features for training as they are unique by row. Let's drop the ID from the model features after we have separated target and features. Also make sure to remove the sensitive feature so it does not end up as input for training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model features:  ['sex', 'personal_status', 'checking_account_status', 'credit_history', 'credit_purpose', 'savings', 'employed_since_years', 'installment_rate', 'other_debtors_guarantors', 'present_residence_since', 'property', 'other_installment_plans', 'housing', 'num_existing_credits', 'job_status', 'num_people_liable_for', 'ID', 'credit_duration_months', 'credit_amount', 'age_years']\n",
      "\n",
      "\n",
      "Model target:  credit_risk\n"
     ]
    }
   ],
   "source": [
    "sensitive_feature = \"age_groups\"\n",
    "\n",
    "try:\n",
    "    numerical_features.remove(sensitive_feature)\n",
    "except:\n",
    "    pass\n",
    "\n",
    "try:\n",
    "    categorical_features.remove(sensitive_feature)\n",
    "except:\n",
    "    pass\n",
    "\n",
    "model_target = \"credit_risk\"\n",
    "model_features = categorical_features + numerical_features\n",
    "\n",
    "print(\"Model features: \", model_features)\n",
    "print(\"\\n\")\n",
    "print(\"Model target: \", model_target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "to_remove = \"ID\"\n",
    "\n",
    "# Drop 'ID' feature from the respective list(s)\n",
    "if to_remove in model_features:\n",
    "    model_features.remove(to_remove)\n",
    "if to_remove in categorical_features:\n",
    "    categorical_features.remove(to_remove)\n",
    "if to_remove in numerical_features:\n",
    "    numerical_features.remove(to_remove)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's also remove `age_years` as this is an obvious proxy for the age groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "to_remove = \"age_years\"\n",
    "\n",
    "# Drop 'ID' feature from the respective list(s)\n",
    "if to_remove in model_features:\n",
    "    model_features.remove(to_remove)\n",
    "if to_remove in categorical_features:\n",
    "    categorical_features.remove(to_remove)\n",
    "if to_remove in numerical_features:\n",
    "    numerical_features.remove(to_remove)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 <a name=\"23\">Feature transformation</a>\n",
    "(<a href=\"#2\">Go to Data Processing</a>)\n",
    "\n",
    "Here, you have different options. You can use Reweighing, Disparate Impact Remover or Suppression. However, in this notebook you should try to implement Equalized Odds postprocessing. Therefore, no transformation is required at this point."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 <a name=\"24\">Train - Validation Datasets</a>\n",
    "(<a href=\"#2\">Go to Data Processing</a>)\n",
    "\n",
    "We already have training and test datasets, but no validation dataset (which you need to create). Furthermore, the test dataset is missing the labels - the goal of the project is to predict these labels. \n",
    "\n",
    "To produce a validation set to evaluate model performance, split the training dataset into train and validation subsets using sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Implement here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train - Test - Validation datasets shapes:  (720, 24) (200, 23) (80, 24)\n"
     ]
    }
   ],
   "source": [
    "# We only need to split between train and val (test is already separate)\n",
    "train_data, val_data = train_test_split(\n",
    "    training_data, test_size=0.1, shuffle=True, random_state=23\n",
    ")\n",
    "\n",
    "# Print the shapes of the Train - Test Datasets\n",
    "print(\n",
    "    \"Train - Test - Validation datasets shapes: \",\n",
    "    train_data.shape,\n",
    "    test_data.shape,\n",
    "    val_data.shape,\n",
    ")\n",
    "\n",
    "train_data.reset_index(inplace=True, drop=True)\n",
    "val_data.reset_index(inplace=True, drop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.5 <a name=\"25\">Data processing with Pipeline</a>\n",
    "(<a href=\"#2\">Go to Data Processing</a>)\n",
    "\n",
    "Build a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)  to impute the missing values and scale the numerical features, and finally train a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)  on the imputed and scaled dataset. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Implement here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[(&#x27;data_processing&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;numerical_processing&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;num_imputer&#x27;,\n",
       "                                                                   SimpleImputer()),\n",
       "                                                                  (&#x27;num_scaler&#x27;,\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  [&#x27;credit_duration_months&#x27;,\n",
       "                                                   &#x27;credit_amount&#x27;]),\n",
       "                                                 (&#x27;categorical_processing&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;cat_imputer&#x27;,\n",
       "                                                                   SimpleImputer(fill_value=&#x27;missing&#x27;,\n",
       "                                                                                 strategy=&#x27;constant&#x27;)),\n",
       "                                                                  (&#x27;cat_encoder&#x27;,\n",
       "                                                                   OneHot...))]),\n",
       "                                                  [&#x27;sex&#x27;, &#x27;personal_status&#x27;,\n",
       "                                                   &#x27;checking_account_status&#x27;,\n",
       "                                                   &#x27;credit_history&#x27;,\n",
       "                                                   &#x27;credit_purpose&#x27;, &#x27;savings&#x27;,\n",
       "                                                   &#x27;employed_since_years&#x27;,\n",
       "                                                   &#x27;installment_rate&#x27;,\n",
       "                                                   &#x27;other_debtors_guarantors&#x27;,\n",
       "                                                   &#x27;present_residence_since&#x27;,\n",
       "                                                   &#x27;property&#x27;,\n",
       "                                                   &#x27;other_installment_plans&#x27;,\n",
       "                                                   &#x27;housing&#x27;,\n",
       "                                                   &#x27;num_existing_credits&#x27;,\n",
       "                                                   &#x27;job_status&#x27;,\n",
       "                                                   &#x27;num_people_liable_for&#x27;])])),\n",
       "                (&#x27;lg&#x27;, LogisticRegression(random_state=0))])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" ><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">Pipeline</label><div class=\"sk-toggleable__content\"><pre>Pipeline(steps=[(&#x27;data_processing&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;numerical_processing&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;num_imputer&#x27;,\n",
       "                                                                   SimpleImputer()),\n",
       "                                                                  (&#x27;num_scaler&#x27;,\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  [&#x27;credit_duration_months&#x27;,\n",
       "                                                   &#x27;credit_amount&#x27;]),\n",
       "                                                 (&#x27;categorical_processing&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;cat_imputer&#x27;,\n",
       "                                                                   SimpleImputer(fill_value=&#x27;missing&#x27;,\n",
       "                                                                                 strategy=&#x27;constant&#x27;)),\n",
       "                                                                  (&#x27;cat_encoder&#x27;,\n",
       "                                                                   OneHot...))]),\n",
       "                                                  [&#x27;sex&#x27;, &#x27;personal_status&#x27;,\n",
       "                                                   &#x27;checking_account_status&#x27;,\n",
       "                                                   &#x27;credit_history&#x27;,\n",
       "                                                   &#x27;credit_purpose&#x27;, &#x27;savings&#x27;,\n",
       "                                                   &#x27;employed_since_years&#x27;,\n",
       "                                                   &#x27;installment_rate&#x27;,\n",
       "                                                   &#x27;other_debtors_guarantors&#x27;,\n",
       "                                                   &#x27;present_residence_since&#x27;,\n",
       "                                                   &#x27;property&#x27;,\n",
       "                                                   &#x27;other_installment_plans&#x27;,\n",
       "                                                   &#x27;housing&#x27;,\n",
       "                                                   &#x27;num_existing_credits&#x27;,\n",
       "                                                   &#x27;job_status&#x27;,\n",
       "                                                   &#x27;num_people_liable_for&#x27;])])),\n",
       "                (&#x27;lg&#x27;, LogisticRegression(random_state=0))])</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" ><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">data_processing: ColumnTransformer</label><div class=\"sk-toggleable__content\"><pre>ColumnTransformer(transformers=[(&#x27;numerical_processing&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;num_imputer&#x27;,\n",
       "                                                  SimpleImputer()),\n",
       "                                                 (&#x27;num_scaler&#x27;,\n",
       "                                                  MinMaxScaler())]),\n",
       "                                 [&#x27;credit_duration_months&#x27;, &#x27;credit_amount&#x27;]),\n",
       "                                (&#x27;categorical_processing&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;cat_imputer&#x27;,\n",
       "                                                  SimpleImputer(fill_value=&#x27;missing&#x27;,\n",
       "                                                                strategy=&#x27;constant&#x27;)),\n",
       "                                                 (&#x27;cat_encoder&#x27;,\n",
       "                                                  OneHotEncoder(handle_unknown=&#x27;ignore&#x27;))]),\n",
       "                                 [&#x27;sex&#x27;, &#x27;personal_status&#x27;,\n",
       "                                  &#x27;checking_account_status&#x27;, &#x27;credit_history&#x27;,\n",
       "                                  &#x27;credit_purpose&#x27;, &#x27;savings&#x27;,\n",
       "                                  &#x27;employed_since_years&#x27;, &#x27;installment_rate&#x27;,\n",
       "                                  &#x27;other_debtors_guarantors&#x27;,\n",
       "                                  &#x27;present_residence_since&#x27;, &#x27;property&#x27;,\n",
       "                                  &#x27;other_installment_plans&#x27;, &#x27;housing&#x27;,\n",
       "                                  &#x27;num_existing_credits&#x27;, &#x27;job_status&#x27;,\n",
       "                                  &#x27;num_people_liable_for&#x27;])])</pre></div></div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" ><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">numerical_processing</label><div class=\"sk-toggleable__content\"><pre>[&#x27;credit_duration_months&#x27;, &#x27;credit_amount&#x27;]</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-4\" type=\"checkbox\" ><label for=\"sk-estimator-id-4\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">SimpleImputer</label><div class=\"sk-toggleable__content\"><pre>SimpleImputer()</pre></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-5\" type=\"checkbox\" ><label for=\"sk-estimator-id-5\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">MinMaxScaler</label><div class=\"sk-toggleable__content\"><pre>MinMaxScaler()</pre></div></div></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-6\" type=\"checkbox\" ><label for=\"sk-estimator-id-6\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">categorical_processing</label><div class=\"sk-toggleable__content\"><pre>[&#x27;sex&#x27;, &#x27;personal_status&#x27;, &#x27;checking_account_status&#x27;, &#x27;credit_history&#x27;, &#x27;credit_purpose&#x27;, &#x27;savings&#x27;, &#x27;employed_since_years&#x27;, &#x27;installment_rate&#x27;, &#x27;other_debtors_guarantors&#x27;, &#x27;present_residence_since&#x27;, &#x27;property&#x27;, &#x27;other_installment_plans&#x27;, &#x27;housing&#x27;, &#x27;num_existing_credits&#x27;, &#x27;job_status&#x27;, &#x27;num_people_liable_for&#x27;]</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-7\" type=\"checkbox\" ><label for=\"sk-estimator-id-7\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">SimpleImputer</label><div class=\"sk-toggleable__content\"><pre>SimpleImputer(fill_value=&#x27;missing&#x27;, strategy=&#x27;constant&#x27;)</pre></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-8\" type=\"checkbox\" ><label for=\"sk-estimator-id-8\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">OneHotEncoder</label><div class=\"sk-toggleable__content\"><pre>OneHotEncoder(handle_unknown=&#x27;ignore&#x27;)</pre></div></div></div></div></div></div></div></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-9\" type=\"checkbox\" ><label for=\"sk-estimator-id-9\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LogisticRegression</label><div class=\"sk-toggleable__content\"><pre>LogisticRegression(random_state=0)</pre></div></div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('data_processing',\n",
       "                 ColumnTransformer(transformers=[('numerical_processing',\n",
       "                                                  Pipeline(steps=[('num_imputer',\n",
       "                                                                   SimpleImputer()),\n",
       "                                                                  ('num_scaler',\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  ['credit_duration_months',\n",
       "                                                   'credit_amount']),\n",
       "                                                 ('categorical_processing',\n",
       "                                                  Pipeline(steps=[('cat_imputer',\n",
       "                                                                   SimpleImputer(fill_value='missing',\n",
       "                                                                                 strategy='constant')),\n",
       "                                                                  ('cat_encoder',\n",
       "                                                                   OneHot...))]),\n",
       "                                                  ['sex', 'personal_status',\n",
       "                                                   'checking_account_status',\n",
       "                                                   'credit_history',\n",
       "                                                   'credit_purpose', 'savings',\n",
       "                                                   'employed_since_years',\n",
       "                                                   'installment_rate',\n",
       "                                                   'other_debtors_guarantors',\n",
       "                                                   'present_residence_since',\n",
       "                                                   'property',\n",
       "                                                   'other_installment_plans',\n",
       "                                                   'housing',\n",
       "                                                   'num_existing_credits',\n",
       "                                                   'job_status',\n",
       "                                                   'num_people_liable_for'])])),\n",
       "                ('lg', LogisticRegression(random_state=0))])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### STEP 1 ###\n",
    "##############\n",
    "\n",
    "# Preprocess the numerical features\n",
    "numerical_processor = Pipeline(\n",
    "    [(\"num_imputer\", SimpleImputer(strategy=\"mean\")), (\"num_scaler\", MinMaxScaler())]\n",
    ")\n",
    "# Preprocess the categorical features\n",
    "categorical_processor = Pipeline(\n",
    "    [\n",
    "        (\"cat_imputer\", SimpleImputer(strategy=\"constant\", fill_value=\"missing\")),\n",
    "        (\"cat_encoder\", OneHotEncoder(handle_unknown=\"ignore\")),\n",
    "    ]\n",
    ")\n",
    "\n",
    "### STEP 2 ###\n",
    "##############\n",
    "\n",
    "# Combine all data preprocessors from above\n",
    "data_processor = ColumnTransformer(\n",
    "    [\n",
    "        (\"numerical_processing\", numerical_processor, numerical_features),\n",
    "        (\"categorical_processing\", categorical_processor, categorical_features),\n",
    "    ]\n",
    ")\n",
    "\n",
    "\n",
    "### STEP 3 ###\n",
    "##############\n",
    "\n",
    "# Pipeline desired all data transformers, along with an estimator at the end\n",
    "# Later you can set/reach the parameters using the names issued - for hyperparameter tuning, for example\n",
    "pipeline = Pipeline(\n",
    "    [\n",
    "        (\"data_processing\", data_processor),\n",
    "        (\"lg\", LogisticRegression(random_state=0)),\n",
    "    ]\n",
    ")\n",
    "\n",
    "# Visualize the pipeline\n",
    "# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps\n",
    "from sklearn import set_config\n",
    "\n",
    "set_config(display=\"diagram\")\n",
    "pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. <a name=\"3\">Train (and Tune) a Classifier</a> (Implement)\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Train (and tune) the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) pipeline. For tuning, you can try different imputation strategies, different scaling methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Implement here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>ThresholdOptimizer(constraints=&#x27;equalized_odds&#x27;,\n",
       "                   estimator=LogisticRegression(random_state=0))</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-10\" type=\"checkbox\" ><label for=\"sk-estimator-id-10\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">ThresholdOptimizer</label><div class=\"sk-toggleable__content\"><pre>ThresholdOptimizer(constraints=&#x27;equalized_odds&#x27;,\n",
       "                   estimator=LogisticRegression(random_state=0))</pre></div></div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-11\" type=\"checkbox\" ><label for=\"sk-estimator-id-11\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">estimator: LogisticRegression</label><div class=\"sk-toggleable__content\"><pre>LogisticRegression(random_state=0)</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-12\" type=\"checkbox\" ><label for=\"sk-estimator-id-12\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LogisticRegression</label><div class=\"sk-toggleable__content\"><pre>LogisticRegression(random_state=0)</pre></div></div></div></div></div></div></div></div></div></div>"
      ],
      "text/plain": [
       "ThresholdOptimizer(constraints='equalized_odds',\n",
       "                   estimator=LogisticRegression(random_state=0))"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get train data to train the classifier\n",
    "X_train = train_data[model_features]\n",
    "y_train = train_data[model_target]\n",
    "\n",
    "# Learn the transformation & extract feature names\n",
    "data_processor.fit(X_train)\n",
    "\n",
    "# To extract feature names we first need to fit the data processor as this will generate the one hot encoding\n",
    "ft_names = numerical_features + list(\n",
    "    data_processor.transformers_[1][1]\n",
    "    .named_steps[\"cat_encoder\"]\n",
    "    .get_feature_names(categorical_features)\n",
    ")\n",
    "\n",
    "# Add column names and convert to data frame\n",
    "X_train_prep = pd.DataFrame(\n",
    "    data_processor.transform(X_train).todense(), columns=ft_names\n",
    ")\n",
    "\n",
    "# Set up ThresholdOptimizer\n",
    "eo_model = ThresholdOptimizer(\n",
    "    estimator=pipeline[-1],\n",
    "    constraints=\"equalized_odds\",\n",
    "    objective=\"accuracy_score\",\n",
    "    grid_size=1000,\n",
    "    flip=False,\n",
    "    prefit=False,\n",
    "    predict_method=\"deprecated\",\n",
    ")\n",
    "\n",
    "# Adjust the results that the classifier would produce by letting ThresholdOptimizer know what the sensitive features are\n",
    "eo_model.fit(X_train_prep, y_train, sensitive_features=train_data[\"age_groups\"].values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. <a name=\"4\">Make Predictions on the Test Dataset</a> (Implement)\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Use the trained classifier to predict the labels on the test set. Below you will find a code snippet that evaluates for DI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Implement here\n",
    "\n",
    "# Get test data to test the classifier\n",
    "# ! test data should come from german_credit_test.csv !\n",
    "# ...\n",
    "\n",
    "# Use the trained model to make predictions on the test dataset\n",
    "# test_predictions = ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# You can now use the fitted equalized odds post-processor and use it to create adjusted outcomes (predictions)\n",
    "test_predictions = eo_model.predict(\n",
    "    data_processor.transform(test_data[model_features]).todense(),\n",
    "    sensitive_features=test_data[\"age_groups\"].values,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. <a name=\"5\">Evaluate Results</a> (Given)\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "result_df = pd.DataFrame(columns=[\"ID\", \"credit_risk_pred\"])\n",
    "result_df[\"ID\"] = test_data[\"ID\"].tolist()\n",
    "result_df[\"credit_risk_pred\"] = test_predictions\n",
    "\n",
    "result_df.to_csv(\"../../data/final_project/project_day3_result.csv\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Final Evaluation on Test Data - Disparate Impact\n",
    "To evaluate the fairness of the model predictions, we will calculate the disparate impact (DI) metric. For more details about DI you can have a look [here](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-di.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def calculate_di(test_data, pred_df, pred_col=\"credit_risk_pred\"):\n",
    "    \"\"\"\n",
    "    Function to calculate Disparate Impact metric using the results from this notebook.\n",
    "    \"\"\"\n",
    "    try:\n",
    "        # Merge predictions with original test data to model per group\n",
    "        di_df = pred_df.merge(test_data, on=\"ID\")\n",
    "        # Count for group with members less than 25y old\n",
    "        pos_outcomes_less25 = di_df[di_df[\"age_groups\"] == 0][pred_col].value_counts()[\n",
    "            0\n",
    "        ]  # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n",
    "        total_less25 = len(di_df[di_df[\"age_groups\"] == 0])\n",
    "        # Count for group with members greater equal 25y old\n",
    "        pos_outcomes_geq25 = di_df[di_df[\"age_groups\"] == 1][pred_col].value_counts()[\n",
    "            0\n",
    "        ]  # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n",
    "        total_geq25 = len(di_df[di_df[\"age_groups\"] == 1])\n",
    "        # Check if correct number of gorups\n",
    "        if total_geq25 == 0:\n",
    "            print(\"There is only one group present in the data.\")\n",
    "        elif total_less25 == 0:\n",
    "            print(\"There is only one group present in the data.\")\n",
    "        else:\n",
    "            disparate_impact = (pos_outcomes_less25 / total_less25) / (\n",
    "                pos_outcomes_geq25 / total_geq25\n",
    "            )\n",
    "            return disparate_impact\n",
    "    except:\n",
    "        print(\"Wrong inputs provided.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6343283582089553"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "calculate_di(test_data, result_df, \"credit_risk_pred\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Final Evaluation on Test Data - Accuracy & F1 Score\n",
    "In addition to fairness evaluation, we also need to check the general model performance. During the EDA stage we learned that the target distribution is skewed so we will use F1 score in addition to accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.745"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "accuracy_score(\n",
    "    pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n",
    "        \"credit_risk\"\n",
    "    ],\n",
    "    result_df[\"credit_risk_pred\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.5405405405405405"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f1_score(\n",
    "    pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n",
    "        \"credit_risk\"\n",
    "    ],\n",
    "    result_df[\"credit_risk_pred\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the end of the notebook."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p39",
   "language": "python",
   "name": "conda_pytorch_p39"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}