{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![MLU Logo](../../data/MLU_Logo.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a name=\"0\">Responsible AI - Final Project Solution</a>\n",
    "\n",
    "Build a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) that predicts the __credit_risk__ field (whether some is a credit risk or not) of the [German Credit Dataset](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29).\n",
    "\n",
    "### Final Project Problem: Loan Approval\n",
    "\n",
    "__Problem Definition:__\n",
    "Given a set of features about an individual (e.g. age, past credit history, immigration status, ...) predict whether a loan is repaid or not (is customer a credit risk). We impose the additional constraint that the model should be fair with respect to different age groups ($\\geq$ 25 yrs and $<$ 25 yrs).\n",
    "\n",
    "In the banking industry, there are certain regulations regarding the use of sensitive features (e.g., age, ethnicity, marital status, ...). According to those regulations, it would not be okay if age played a significant role in the model (loans should be approved/denied regardless of an individuals' age).\n",
    "\n",
    "For example, certain laws declare it unlawful for creditors to discriminate against any applicant on the basis of age (or other sensitive attributes). For more details, have a look at this paper:\n",
    "\n",
    "``` \n",
    "F. Kamiran and T. Calders, \"Data Preprocessing Techniques for Classification without Discrimination,\" Knowledge and Information Systems, 2012\n",
    "```\n",
    "\n",
    "__Table of contents__\n",
    "\n",
    "1. <a href=\"#1\">Read the datasets</a> (Given) \n",
    "2. <a href=\"#2\">Data Processing</a> (Implement)\n",
    "    * <a href=\"#21\">Exploratory Data Analysis</a>\n",
    "    * <a href=\"#22\">Select features to build the model</a> (Suggested)\n",
    "    * <a href=\"#23\">Train - Validation - Test Datasets</a>\n",
    "    * <a href=\"#24\">Data Processing with Pipeline</a>\n",
    "3. <a href=\"#3\">Train (and Tune) a Classifier on the Training Dataset</a> (Implement)\n",
    "4. <a href=\"#4\">Make Predictions on the Test Dataset</a> (Implement)\n",
    "5. <a href=\"#5\">Evaluate Results</a> (Given)\n",
    "\n",
    "\n",
    "__Datasets and Files__\n",
    "\n",
    "\n",
    "- ```german_credit_training.csv```: Training data with loan applicants features, credit history, dependents, savings, account status, age group (and more). The label is __credit_risk__.\n",
    "\n",
    "- ```german_credit_test.csv```: Test data with same features as above apart from label. This will be the data to make predictions for to emulate a production environment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook assumes an installation of the SageMaker kernel `conda_pytorch_p39`. In addition, libraries from a requirements.txt need to be installed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --no-deps -U -q -r ../../requirements.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reshaping/basic libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# Plotting libraries\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "sns.set_style(\"darkgrid\", {\"axes.facecolor\": \".9\"})\n",
    "\n",
    "# ML libraries\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import confusion_matrix, accuracy_score, f1_score\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "# Operational libraries\n",
    "import sys\n",
    "\n",
    "sys.path.append(\"..\")\n",
    "\n",
    "# Jupyter(lab) libraries\n",
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. <a name=\"1\">Read the datasets</a> (Given)\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, we read the __training__ and __test__ datasets into dataframes, using [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). This library allows us to read and manipulate our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The shape of the training dataset is: (800, 24)\n",
      "The shape of the test dataset is: (200, 23)\n"
     ]
    }
   ],
   "source": [
    "training_data = pd.read_csv(\"../../data/final_project/german_credit_training.csv\")\n",
    "test_data = pd.read_csv(\"../../data/final_project/german_credit_test.csv\")\n",
    "\n",
    "print(\"The shape of the training dataset is:\", training_data.shape)\n",
    "print(\"The shape of the test dataset is:\", test_data.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. <a name=\"2\">Data Processing</a> (Implement)\n",
    "(<a href=\"#0\">Go to top</a>) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 <a name=\"21\">Exploratory Data Analysis</a>\n",
    "(<a href=\"#2\">Go to Data Processing</a>)\n",
    "\n",
    "We look at number of rows, columns, and some simple statistics of the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>sex</th>\n",
       "      <th>personal_status</th>\n",
       "      <th>checking_account_status</th>\n",
       "      <th>credit_duration_months</th>\n",
       "      <th>credit_history</th>\n",
       "      <th>credit_purpose</th>\n",
       "      <th>credit_amount</th>\n",
       "      <th>savings</th>\n",
       "      <th>employed_since_years</th>\n",
       "      <th>...</th>\n",
       "      <th>age_years</th>\n",
       "      <th>other_installment_plans</th>\n",
       "      <th>housing</th>\n",
       "      <th>num_existing_credits</th>\n",
       "      <th>job_status</th>\n",
       "      <th>num_people_liable_for</th>\n",
       "      <th>telephone</th>\n",
       "      <th>foreign_worker</th>\n",
       "      <th>credit_risk</th>\n",
       "      <th>age_groups</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>564</td>\n",
       "      <td>male</td>\n",
       "      <td>married/widowed</td>\n",
       "      <td>... &lt; 0 DM</td>\n",
       "      <td>24</td>\n",
       "      <td>existing credits paid back duly till now</td>\n",
       "      <td>retraining</td>\n",
       "      <td>4712</td>\n",
       "      <td>.. &gt;= 1000 DM</td>\n",
       "      <td>1 &lt;= ... &lt; 4 years</td>\n",
       "      <td>...</td>\n",
       "      <td>37</td>\n",
       "      <td>bank</td>\n",
       "      <td>rent</td>\n",
       "      <td>2-3</td>\n",
       "      <td>management/self-employed/highly qualified empl...</td>\n",
       "      <td>0 to 2</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>484</td>\n",
       "      <td>male</td>\n",
       "      <td>married/widowed</td>\n",
       "      <td>... &gt;= 200 DM</td>\n",
       "      <td>12</td>\n",
       "      <td>all credits at this bank paid back duly</td>\n",
       "      <td>others</td>\n",
       "      <td>1163</td>\n",
       "      <td>100 &lt;= ... &lt; 500 DM</td>\n",
       "      <td>1 &lt;= ... &lt; 4 years</td>\n",
       "      <td>...</td>\n",
       "      <td>44</td>\n",
       "      <td>none</td>\n",
       "      <td>rent</td>\n",
       "      <td>1</td>\n",
       "      <td>skilled employee/official</td>\n",
       "      <td>0 to 2</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>26</td>\n",
       "      <td>female</td>\n",
       "      <td>single</td>\n",
       "      <td>... &gt;= 200 DM</td>\n",
       "      <td>6</td>\n",
       "      <td>delay in paying off in the past</td>\n",
       "      <td>furniture/equipment</td>\n",
       "      <td>426</td>\n",
       "      <td>unknown/no savings account</td>\n",
       "      <td>.. &gt;= 7 years</td>\n",
       "      <td>...</td>\n",
       "      <td>39</td>\n",
       "      <td>none</td>\n",
       "      <td>rent</td>\n",
       "      <td>1</td>\n",
       "      <td>unskilled - resident</td>\n",
       "      <td>0 to 2</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>443</td>\n",
       "      <td>male</td>\n",
       "      <td>married/widowed</td>\n",
       "      <td>... &gt;= 200 DM</td>\n",
       "      <td>12</td>\n",
       "      <td>no credits taken/all credits paid back duly</td>\n",
       "      <td>repairs</td>\n",
       "      <td>719</td>\n",
       "      <td>unknown/no savings account</td>\n",
       "      <td>.. &gt;= 7 years</td>\n",
       "      <td>...</td>\n",
       "      <td>41</td>\n",
       "      <td>bank</td>\n",
       "      <td>rent</td>\n",
       "      <td>1</td>\n",
       "      <td>unskilled - resident</td>\n",
       "      <td>3 or more</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>211</td>\n",
       "      <td>female/male</td>\n",
       "      <td>female (non-single) or male (single)</td>\n",
       "      <td>... &gt;= 200 DM</td>\n",
       "      <td>36</td>\n",
       "      <td>no credits taken/all credits paid back duly</td>\n",
       "      <td>furniture/equipment</td>\n",
       "      <td>3835</td>\n",
       "      <td>.. &gt;= 1000 DM</td>\n",
       "      <td>.. &gt;= 7 years</td>\n",
       "      <td>...</td>\n",
       "      <td>45</td>\n",
       "      <td>none</td>\n",
       "      <td>rent</td>\n",
       "      <td>1</td>\n",
       "      <td>unskilled - resident</td>\n",
       "      <td>0 to 2</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 24 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    ID          sex                       personal_status  \\\n",
       "0  564         male                       married/widowed   \n",
       "1  484         male                       married/widowed   \n",
       "2   26       female                                single   \n",
       "3  443         male                       married/widowed   \n",
       "4  211  female/male  female (non-single) or male (single)   \n",
       "\n",
       "  checking_account_status  credit_duration_months  \\\n",
       "0              ... < 0 DM                      24   \n",
       "1           ... >= 200 DM                      12   \n",
       "2           ... >= 200 DM                       6   \n",
       "3           ... >= 200 DM                      12   \n",
       "4           ... >= 200 DM                      36   \n",
       "\n",
       "                                credit_history       credit_purpose  \\\n",
       "0     existing credits paid back duly till now           retraining   \n",
       "1      all credits at this bank paid back duly               others   \n",
       "2              delay in paying off in the past  furniture/equipment   \n",
       "3  no credits taken/all credits paid back duly              repairs   \n",
       "4  no credits taken/all credits paid back duly  furniture/equipment   \n",
       "\n",
       "   credit_amount                     savings employed_since_years  ...  \\\n",
       "0           4712               .. >= 1000 DM   1 <= ... < 4 years  ...   \n",
       "1           1163         100 <= ... < 500 DM   1 <= ... < 4 years  ...   \n",
       "2            426  unknown/no savings account        .. >= 7 years  ...   \n",
       "3            719  unknown/no savings account        .. >= 7 years  ...   \n",
       "4           3835               .. >= 1000 DM        .. >= 7 years  ...   \n",
       "\n",
       "  age_years other_installment_plans housing num_existing_credits  \\\n",
       "0        37                    bank    rent                  2-3   \n",
       "1        44                    none    rent                    1   \n",
       "2        39                    none    rent                    1   \n",
       "3        41                    bank    rent                    1   \n",
       "4        45                    none    rent                    1   \n",
       "\n",
       "                                          job_status num_people_liable_for  \\\n",
       "0  management/self-employed/highly qualified empl...                0 to 2   \n",
       "1                          skilled employee/official                0 to 2   \n",
       "2                               unskilled - resident                0 to 2   \n",
       "3                               unskilled - resident             3 or more   \n",
       "4                               unskilled - resident                0 to 2   \n",
       "\n",
       "  telephone foreign_worker credit_risk age_groups  \n",
       "0      True          False           0          1  \n",
       "1      True          False           0          1  \n",
       "2     False          False           0          1  \n",
       "3     False          False           1          1  \n",
       "4      True          False           0          1  \n",
       "\n",
       "[5 rows x 24 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "training_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>sex</th>\n",
       "      <th>personal_status</th>\n",
       "      <th>checking_account_status</th>\n",
       "      <th>credit_duration_months</th>\n",
       "      <th>credit_history</th>\n",
       "      <th>credit_purpose</th>\n",
       "      <th>credit_amount</th>\n",
       "      <th>savings</th>\n",
       "      <th>employed_since_years</th>\n",
       "      <th>...</th>\n",
       "      <th>property</th>\n",
       "      <th>age_years</th>\n",
       "      <th>other_installment_plans</th>\n",
       "      <th>housing</th>\n",
       "      <th>num_existing_credits</th>\n",
       "      <th>job_status</th>\n",
       "      <th>num_people_liable_for</th>\n",
       "      <th>telephone</th>\n",
       "      <th>foreign_worker</th>\n",
       "      <th>age_groups</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>963</td>\n",
       "      <td>male</td>\n",
       "      <td>married/widowed</td>\n",
       "      <td>... &gt;= 200 DM</td>\n",
       "      <td>24</td>\n",
       "      <td>no credits taken/all credits paid back duly</td>\n",
       "      <td>furniture/equipment</td>\n",
       "      <td>2397</td>\n",
       "      <td>100 &lt;= ... &lt; 500 DM</td>\n",
       "      <td>.. &gt;= 7 years</td>\n",
       "      <td>...</td>\n",
       "      <td>building society savings agreement/life insurance</td>\n",
       "      <td>35</td>\n",
       "      <td>bank</td>\n",
       "      <td>rent</td>\n",
       "      <td>2-3</td>\n",
       "      <td>skilled employee/official</td>\n",
       "      <td>0 to 2</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>611</td>\n",
       "      <td>female/male</td>\n",
       "      <td>female (non-single) or male (single)</td>\n",
       "      <td>0 &lt;= ... &lt; 200 DM</td>\n",
       "      <td>10</td>\n",
       "      <td>no credits taken/all credits paid back duly</td>\n",
       "      <td>others</td>\n",
       "      <td>1240</td>\n",
       "      <td>... &lt; 100 DM</td>\n",
       "      <td>.. &gt;= 7 years</td>\n",
       "      <td>...</td>\n",
       "      <td>real estate</td>\n",
       "      <td>48</td>\n",
       "      <td>none</td>\n",
       "      <td>own</td>\n",
       "      <td>1</td>\n",
       "      <td>unskilled - resident</td>\n",
       "      <td>3 or more</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>106</td>\n",
       "      <td>male</td>\n",
       "      <td>married/widowed</td>\n",
       "      <td>... &gt;= 200 DM</td>\n",
       "      <td>18</td>\n",
       "      <td>critical account/other credits existing (not a...</td>\n",
       "      <td>others</td>\n",
       "      <td>6458</td>\n",
       "      <td>unknown/no savings account</td>\n",
       "      <td>.. &gt;= 7 years</td>\n",
       "      <td>...</td>\n",
       "      <td>real estate</td>\n",
       "      <td>39</td>\n",
       "      <td>bank</td>\n",
       "      <td>rent</td>\n",
       "      <td>2-3</td>\n",
       "      <td>management/self-employed/highly qualified empl...</td>\n",
       "      <td>3 or more</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>891</td>\n",
       "      <td>male</td>\n",
       "      <td>married/widowed</td>\n",
       "      <td>... &gt;= 200 DM</td>\n",
       "      <td>15</td>\n",
       "      <td>all credits at this bank paid back duly</td>\n",
       "      <td>furniture/equipment</td>\n",
       "      <td>1829</td>\n",
       "      <td>unknown/no savings account</td>\n",
       "      <td>.. &gt;= 7 years</td>\n",
       "      <td>...</td>\n",
       "      <td>building society savings agreement/life insurance</td>\n",
       "      <td>46</td>\n",
       "      <td>none</td>\n",
       "      <td>rent</td>\n",
       "      <td>2-3</td>\n",
       "      <td>skilled employee/official</td>\n",
       "      <td>0 to 2</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>342</td>\n",
       "      <td>female</td>\n",
       "      <td>single</td>\n",
       "      <td>... &lt; 0 DM</td>\n",
       "      <td>18</td>\n",
       "      <td>no credits taken/all credits paid back duly</td>\n",
       "      <td>furniture/equipment</td>\n",
       "      <td>3213</td>\n",
       "      <td>100 &lt;= ... &lt; 500 DM</td>\n",
       "      <td>... &lt; 1 year</td>\n",
       "      <td>...</td>\n",
       "      <td>unknown/no property</td>\n",
       "      <td>25</td>\n",
       "      <td>none</td>\n",
       "      <td>for free</td>\n",
       "      <td>1</td>\n",
       "      <td>skilled employee/official</td>\n",
       "      <td>0 to 2</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 23 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    ID          sex                       personal_status  \\\n",
       "0  963         male                       married/widowed   \n",
       "1  611  female/male  female (non-single) or male (single)   \n",
       "2  106         male                       married/widowed   \n",
       "3  891         male                       married/widowed   \n",
       "4  342       female                                single   \n",
       "\n",
       "  checking_account_status  credit_duration_months  \\\n",
       "0           ... >= 200 DM                      24   \n",
       "1       0 <= ... < 200 DM                      10   \n",
       "2           ... >= 200 DM                      18   \n",
       "3           ... >= 200 DM                      15   \n",
       "4              ... < 0 DM                      18   \n",
       "\n",
       "                                      credit_history       credit_purpose  \\\n",
       "0        no credits taken/all credits paid back duly  furniture/equipment   \n",
       "1        no credits taken/all credits paid back duly               others   \n",
       "2  critical account/other credits existing (not a...               others   \n",
       "3            all credits at this bank paid back duly  furniture/equipment   \n",
       "4        no credits taken/all credits paid back duly  furniture/equipment   \n",
       "\n",
       "   credit_amount                     savings employed_since_years  ...  \\\n",
       "0           2397         100 <= ... < 500 DM        .. >= 7 years  ...   \n",
       "1           1240                ... < 100 DM        .. >= 7 years  ...   \n",
       "2           6458  unknown/no savings account        .. >= 7 years  ...   \n",
       "3           1829  unknown/no savings account        .. >= 7 years  ...   \n",
       "4           3213         100 <= ... < 500 DM         ... < 1 year  ...   \n",
       "\n",
       "                                            property age_years  \\\n",
       "0  building society savings agreement/life insurance        35   \n",
       "1                                        real estate        48   \n",
       "2                                        real estate        39   \n",
       "3  building society savings agreement/life insurance        46   \n",
       "4                                unknown/no property        25   \n",
       "\n",
       "  other_installment_plans   housing  num_existing_credits  \\\n",
       "0                    bank      rent                   2-3   \n",
       "1                    none       own                     1   \n",
       "2                    bank      rent                   2-3   \n",
       "3                    none      rent                   2-3   \n",
       "4                    none  for free                     1   \n",
       "\n",
       "                                          job_status num_people_liable_for  \\\n",
       "0                          skilled employee/official                0 to 2   \n",
       "1                               unskilled - resident             3 or more   \n",
       "2  management/self-employed/highly qualified empl...             3 or more   \n",
       "3                          skilled employee/official                0 to 2   \n",
       "4                          skilled employee/official                0 to 2   \n",
       "\n",
       "  telephone foreign_worker age_groups  \n",
       "0      True          False          1  \n",
       "1     False          False          1  \n",
       "2      True          False          1  \n",
       "3      True          False          1  \n",
       "4     False          False          1  \n",
       "\n",
       "[5 rows x 23 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Implement more EDA here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 800 entries, 0 to 799\n",
      "Data columns (total 24 columns):\n",
      " #   Column                    Non-Null Count  Dtype \n",
      "---  ------                    --------------  ----- \n",
      " 0   ID                        800 non-null    int64 \n",
      " 1   sex                       800 non-null    object\n",
      " 2   personal_status           800 non-null    object\n",
      " 3   checking_account_status   800 non-null    object\n",
      " 4   credit_duration_months    800 non-null    int64 \n",
      " 5   credit_history            800 non-null    object\n",
      " 6   credit_purpose            800 non-null    object\n",
      " 7   credit_amount             800 non-null    int64 \n",
      " 8   savings                   800 non-null    object\n",
      " 9   employed_since_years      800 non-null    object\n",
      " 10  installment_rate          800 non-null    object\n",
      " 11  other_debtors_guarantors  800 non-null    object\n",
      " 12  present_residence_since   800 non-null    object\n",
      " 13  property                  800 non-null    object\n",
      " 14  age_years                 800 non-null    int64 \n",
      " 15  other_installment_plans   800 non-null    object\n",
      " 16  housing                   800 non-null    object\n",
      " 17  num_existing_credits      800 non-null    object\n",
      " 18  job_status                800 non-null    object\n",
      " 19  num_people_liable_for     800 non-null    object\n",
      " 20  telephone                 800 non-null    bool  \n",
      " 21  foreign_worker            800 non-null    bool  \n",
      " 22  credit_risk               800 non-null    int64 \n",
      " 23  age_groups                800 non-null    int64 \n",
      "dtypes: bool(2), int64(6), object(16)\n",
      "memory usage: 139.2+ KB\n"
     ]
    }
   ],
   "source": [
    "training_data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 <a name=\"22\">Select features to build the model</a> \n",
    "(<a href=\"#2\">Go to Data Processing</a>)\n",
    "\n",
    "For a quick start, we recommend using only a few of the numerical and categorical features. However, feel free to explore other fields. In this case, we do not need to cast our features to numerical/objects. Mindful with some of the feature names - they suggest numerical values but upon inspection it should become clear that they are actually categoricals (e.g. `employed_since_years` has been binned into groups).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Grab model features/inputs and target/output\n",
    "categorical_features = [\"job_status\", \"employed_since_years\", \"savings\", \"age_groups\"]\n",
    "\n",
    "numerical_features = [\"credit_amount\", \"credit_duration_months\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Separate features and the model target."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model features:  ['job_status', 'employed_since_years', 'savings', 'age_groups', 'credit_amount', 'credit_duration_months']\n",
      "Model target:  credit_risk\n"
     ]
    }
   ],
   "source": [
    "model_target = \"credit_risk\"\n",
    "model_features = categorical_features + numerical_features\n",
    "\n",
    "print(\"Model features: \", model_features)\n",
    "print(\"Model target: \", model_target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 <a name=\"23\">Train - Validation Datasets</a>\n",
    "(<a href=\"#2\">Go to Data Processing</a>)\n",
    "\n",
    "We already have training and test datasets, but no validation dataset (which you need to create). Furthermore, the test dataset is missing the labels - the goal of the project is to predict these labels. \n",
    "\n",
    "To produce a validation set to evaluate model performance, split the training dataset into train and validation subsets using sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. Validation data you get here will be used later in section 3 to tune your classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Implement here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<seaborn.axisgrid.FacetGrid at 0x7fbeed0e5ca0>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjkAAAHpCAYAAABgPS3fAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAt3klEQVR4nO3de1xU9b7/8fcMVw0kuaioHW8kooBAnodCFCe3u21lp9CHXTZlmaWVhpZ53V6gCKiskI5H9y51a9HWivR08dTZnVMd3VFayTYvlYhlylYGNG+YXGZ+f3ScXxOmgugavryej4ePh7PWmjWfxUPg5aw1MzaXy+USAACAYexWDwAAAHAhEDkAAMBIRA4AADASkQMAAIxE5AAAACMROQAAwEhEDgAAMBKRAwAAjORr9QDeZP/+/eK9EQEAF0pkZKTVI7QpPJMDAACMROQAAAAjETkAAMBIRA4AADASkQMAAIxE5AAAACMROQAAwEhEDgAAMJKlkfP8888rOjra48/w4cPd60+ePKns7GwNHjxYiYmJeuihh1RVVeWxj4qKCo0fP14DBw5UcnKynnzySdXX11/sQwEAAF7G8nc8vvzyy7V8+XL3bR8fH/ffc3Nz9dFHH6mgoEDBwcF6/PHHNWnSJK1atUqS1NDQoAkTJig8PFyrVq1SZWWlZsyYIT8/Pz3yyCMX/VgAAID3sDxyfHx8FBER0Wj50aNHVVxcrAULFig5OVnST9Fz/fXXq7S0VAkJCdqwYYPKysq0fPlyhYeHKyYmRpMnT9aCBQs0adIk+fv7N2kWm83WIscEAACsZ3nkfPfdd0pNTVVAQIASEhI0depUde3aVVu3blVdXZ1SUlLc2/bp00ddu3Z1R05paan69u2r8PBw9zapqanKyspSWVmZ+vfv36RZOnfu3GLHBQAArGVp5MTHxysvL0+9evWSw+HQokWLlJGRobfeektVVVXy8/NThw4dPO4TFhYmh8MhSaqqqvIIHEnu26e2aYoDBw7wAZ0AgAumS5cuVo/QplgaOWlpae6/9+vXTwMHDtQ111yj//zP/1RgYOBFn8flchE5AAAYwqteQt6hQwf17NlTe/bsUXh4uOrq6nTkyBGPbaqrq93X8ISHhzd6tdWp26e7zgcAALQdXhU5x48f1/fff6+IiAjFxsbKz89PJSUl7vXl5eWqqKhQQkKCJCkhIUHffPONqqur3dt8/PHHCgoKUlRU1MUeHwAAeBFLT1c9+eSTuuaaa9S1a1dVVlbq+eefl91u14gRIxQcHKxRo0YpPz9fISEhCgoKUk5OjhITE92Rk5qaqqioKE2fPl3Tpk2Tw+FQQUGBMjIymvzKKgAAYBZLI2f//v165JFH9MMPPyg0NFRXXHGFXn31VYWGhkqSZs+eLbvdrszMTNXW1io1NVXz589339/Hx0dLlixRVlaWbr31VrVr107p6enKzMy06pAAAICXsLm40tZt//79XHgMWMhut8tu96qz6K2W0+mU0+m0egz8QmRkpNUjtCmWv08OAEg/BU5YeIR8fYicllDf4FR1lYPQQZtG5ADwCna7Xb4+ds15Zb12Vx62epxWrVenEOX8/irZ7XYiB20akQPAq+yuPKyv9h20egwABuB5YQAAYCQiBwAAGInIAQAARiJyAACAkYgcAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiBwAAGInIAQAARiJyAACAkYgcAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiBwAAGInIAQAARiJyAACAkYgcAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiBwAAGInIAQAARiJyAACAkYgcAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiBwAAGInIAQAARiJyAACAkYgcAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiBwAAGInIAQAARiJyAACAkYgcAABgJCIHAAAYicgBAABGInIAAICRvCZy/vSnPyk6OlpPPPGEe9nJkyeVnZ2twYMHKzExUQ899JCqqqo87ldRUaHx48dr4MCBSk5O1pNPPqn6+vqLPT4AAPAyXhE5W7Zs0apVqxQdHe2xPDc3Vx988IEKCgr00ksvqbKyUpMmTXKvb2ho0IQJE1RXV6dVq1YpPz9fa9asUWFh4cU+BAAA4GUsj5zjx49r2rRpysnJUUhIiHv50aNHVVxcrJkzZyo5OVmxsbHKzc3V5s2bVVpaKknasGGDysrK9PTTTysmJkZpaWmaPHmyioqKVFtba9ERAQAAb+Br9QCPPfaY0tLSlJKSosWLF7uXb926VXV1dUpJSXEv69Onj7p27arS0lIlJCSotLRUffv2VXh4uHub1NRUZWVlqaysTP3792/SLDab7fwPCECz8P3X8mw2G19XtGmWRs4777yj7du36/XXX2+0rqqqSn5+furQoYPH8rCwMDkcDvc2Pw8cSe7bp7Zpis6dOzf5PgDgrX758xFoayyLnH/84x964okntGzZMgUEBFg1hocDBw7I5XJZPQbQJvn6+vJLuYVVVVXxQgwv06VLF6tHaFMsi5xt27apurpaI0eOdC9raGjQpk2bVFRUpKVLl6qurk5HjhzxeDanurpaERERkn76X8qWLVs89nvq1VentmkKl8tF5AAW4Xuv5fEzDW2dZZEzZMgQvfXWWx7LZs2apd69e+u+++5TZGSk/Pz8VFJSot/97neSpPLyclVUVCghIUGSlJCQoCVLlqi6ulphYWGSpI8//lhBQUGKioq6qMcDAAC8i2WRExQUpL59+3osa9++vS699FL38lGjRik/P18hISEKCgpSTk6OEhMT3ZGTmpqqqKgoTZ8+XdOmTZPD4VBBQYEyMjLk7+9/sQ8JAAB4EctfXXUms2fPlt1uV2Zmpmpra5Wamqr58+e71/v4+GjJkiXKysrSrbfeqnbt2ik9PV2ZmZkWTg0AALyBzcUJW7f9+/dz/hqwiK+vryIiIpRR8La+2nfQ6nFatX7dQlU0ZYQcDgcXHnuZyMhIq0doUyx/M0AAAIALgcgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiBwAAGInIAQAARiJyAACAkYgcAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiBwAAGInIAQAARiJyAACAkYgcAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiBwAAGInIAQAARiJyAACAkYgcAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiBwAAGInIAQAARiJyAACAkYgcAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiBwAAGInIAQAARiJyAACAkYgcAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiBwAAGInIAQAARiJyAACAkYgcAABgJEsj55VXXtGNN96opKQkJSUl6dZbb9VHH33kXn/y5EllZ2dr8ODBSkxM1EMPPaSqqiqPfVRUVGj8+PEaOHCgkpOT9eSTT6q+vv5iHwoAAPAylkZOly5d9Oijj+qNN95QcXGxhgwZookTJ2rnzp2SpNzcXH3wwQcqKCjQSy+9pMrKSk2aNMl9/4aGBk2YMEF1dXVatWqV8vPztWbNGhUWFlp1SAAAwEtYGjlDhw5VWlqaevbsqV69eunhhx9W+/btVVpaqqNHj6q4uFgzZ85UcnKyYmNjlZubq82bN6u0tFSStGHDBpWVlenpp59WTEyM0tLSNHnyZBUVFam2ttbKQwMAABbztXqAUxoaGvTuu++qpqZGiYmJ2rp1q+rq6pSSkuLepk+fPuratatKS0uVkJCg0tJS9e3bV+Hh4e5tUlNTlZWVpbKyMvXv379JM9hsthY7HgBNw/dfy7PZbHxd0aZZHjlff/21brvtNp08eVLt27fXokWLFBUVpR07dsjPz08dOnTw2D4sLEwOh0OSVFVV5RE4kty3T23TFJ07d27mUQCA9/nlz0egrbE8cnr16qW1a9fq6NGjeu+99zRjxgy9/PLLlsxy4MABuVwuSx4baOt8fX35pdzCqqqqeCGGl+nSpYvVI7QplkeOv7+/evToIUmKjY3Vl19+qZUrV+q6665TXV2djhw54vFsTnV1tSIiIiT99L+ULVu2eOzv1KuvTm3TFC6Xi8gBLML3XsvjZxraOq97nxyn06na2lrFxsbKz89PJSUl7nXl5eWqqKhQQkKCJCkhIUHffPONqqur3dt8/PHHCgoKUlRU1MUeHQAAeBFLn8l55plndPXVVysyMlLHjx/X22+/rY0bN2rp0qUKDg7WqFGjlJ+fr5CQEAUFBSknJ0eJiYnuyElNTVVUVJSmT5+uadOmyeFwqKCgQBkZGfL397fy0AAAgMUsjZzq6mrNmDFDlZWVCg4OVnR0tJYuXaorr7xSkjR79mzZ7XZlZmaqtrZWqampmj9/vvv+Pj4+WrJkibKysnTrrbeqXbt2Sk9PV2ZmplWHBAAAvITNxQlbt/3793P+GrCIr6+vIiIilFHwtr7ad9DqcVq1ft1CVTRlhBwOBxcee5nIyEirR2hTvO6aHAAAgJZA5AAAACMROQAAwEhEDgAAMBKRAwAAjETkAAAAIxE5AADASEQOAAAwEpEDAACMROQAAAAjETkAAMBIRA4AADBSsyJnzJgxOnLkSKPlx44d05gxY857KAAAgPPVrMjZuHGj6urqGi0/efKkPv/88/MeCgAA4Hz5NmXjr776yv33srIyORwO922n06n169erc+fOLTcdAABAMzUpcm6++WbZbDbZbDbdddddjdYHBgZqzpw5LTYcAABAczUpcv77v/9bLpdLw4YN02uvvabQ0FD3Oj8/P4WFhcnHx6fFhwQAAGiqJkVOt27dJHmetgIAAPBGTYqcn/v222/16aefqrq6Wk6n02PdpEmTznswAACA89GsyHn11VeVlZWljh07Kjw8XDabzb3OZrMROQAAwHLNipzFixdrypQpGj9+fEvPAwAA0CKa9T45hw8f1nXXXdfSswAAALSYZkXO8OHDtWHDhpaeBQAAoMU0K3J69OihhQsXaubMmVq2bJlWrlzp8QcAAHi3vXv3Kjo6Wjt27JAkffrpp4qOjj7txza15OO01LbnolnX5KxevVrt27fXxo0btXHjRo91NpuNz68CAKCVSUxM1IYNGxQcHCxJeuONN5Sbm6vPPvvsvPYbGRmpDRs2qGPHji0xZpM0K3L+53/+p6XnAAAAzVBXVyc/P7/z3o+/v78iIiJaYKL/r7a29oLs91w163QVAAC4cJxOp1544QX99re/VWxsrP7lX/5Fixcvdp/OWbdune644w7FxcXprbfekiS99tpruu666xQXF6fhw4erqKjIY59btmzRzTffrLi4OI0cObLRKaGfn6769NNPNWvWLB09elTR0dGKjo7W888/f9a5hw4dqkWLFmn69OlKSkrSvHnzGp2COnz4sKZOnaohQ4YoPj5e1157rYqLi0+7v4aGBs2aNUvDhw9XRUVFk7+OzXomZ9asWWdcn5eX15zdAgAASc8884xee+01zZo1S1dccYUqKyu1e/du9/oFCxZo5syZiomJUUBAgN58800tXLhQ8+bNU0xMjHbs2KG5c+eqffv2Sk9P1/HjxzVhwgSlpKTo6aef1t69e/XEE0/86uMnJiZq9uzZKiws1LvvvitJat++/TnNvmzZMk2cOPFX3zNv4cKF2rVrl1544QV17NhRe/bs0Y8//thou9raWj3yyCPat2+fXnnlFY+PkjpXzYqcX16UVF9fr507d+rIkSMaMmRIc3YJAAAkHTt2TCtXrtS8efOUnp4uSfqnf/onDRo0SHv37pUk3XXXXbr22mvd93n++ec1c+ZM97LLLrtMZWVlWr16tdLT0/X222/L6XQqNzdXAQEBuvzyy7V//35lZWWddgZ/f38FBwfLZrM1+VTTkCFDdM8997hvn5r5lIqKCsXExCguLk6S1L1790b7OH78uMaPH6/a2lqtXLnSfZ1QUzUrchYtWtRomdPpVFZWli677LJmDQIAAKTy8nLV1tae8UmD2NhY999ramq0Z88e/eEPf9DcuXPdy+vr691xsGvXLkVHRysgIMC9PjEx8QJM7znb6dx+++3KzMzU9u3bdeWVV2rYsGFKSkry2Gbq1Knq0qWLVqxYocDAwGbP0uzPrvolu92uu+++W2PGjNF9993XUrsFAKBN+XmI/JqfnzqqqamRJD3++OMaOHCgx3Z2+8W/9LZdu3ZnXJ+WlqYPPvhAH330kf72t7/p7rvvVkZGhmbMmOGxzZtvvqnNmzcrOTm52bO06NF///33qq+vb8ldAgDQpvTs2VOBgYH65JNPzmn78PBwderUSd9//7169Ojh8efU2ZU+ffro66+/1smTJ933Ky0tPeN+/fz81NDQ0OzjOJPQ0FClp6drwYIFmj17tlavXu2x/vbbb9fUqVP14IMPNnqrmqZo1jM5v7yw2OVyyeFw6MMPP3SfPwQAAE0XEBCg++67T08//bT8/PyUlJSkgwcPaufOnb/6rEZmZqZycnIUHBysq666SrW1tdq6dauOHDmisWPHasSIEXruuec0Z84cTZgwQfv27dOyZcvOOEe3bt1UU1OjkpISRUdHq127dmd9luZcLFy4UAMGDNDll1+u2tpaffjhh+rTp0+j7e688041NDRowoQJeuGFFzRo0KAmP1azImf79u0et+12u0JDQzVz5kyNGjWqObsEAAD/58EHH5SPj48KCwtVWVmpiIgI3Xbbbb+6/ejRoxUYGKilS5fqqaeeUvv27dW3b1/dddddkqRLLrlES5Ys0fz583XzzTcrKipKjz76qB566KFf3WdSUpJuu+02TZkyRT/88IMmTZp0xu3PlZ+fn5599lnt27dPgYGBuuKKK/Tss8+edtu7775bLpdL48eP14svvtjo2p2zsblcLtd5T2yI/fv3iy8HYA1fX19FREQoo+BtfbXvoNXjtGr9uoWqaMoIORwOLiHwMpGRkVaP0Kac14XHBw8eVHl5uSSpd+/ezXoNOwAAwIXQrMipqanR448/rv/4j/+Q0+mUJPn4+Oimm27S3LlzW+ScHQAA8B6fffbZGV89vXnz5os4zblpVuTk5+dr06ZNWrx4sa644gpJ0ueff66cnBzl5+crOzu7RYcEAADWio2N1dq1a60eo0maFTnvvfeeCgsLNXjwYPeytLQ0BQQEaMqUKUQOAACGCQwMVI8ePaweo0ma9T45P/74o8LDwxstDwsLO+3nTwAAAFxszYqchIQEFRYWeryp0I8//qh/+7d/U0JCQkvNBgAA0GzNOl01e/Zs3Xvvvbr66qvVr18/SdJXX30lf3//s765EAAAwMXQrMiJjo7Wf/3Xf+mtt95yv4R8xIgRuvHGG8/rg7QAAABaSrMi549//KPCwsJ0yy23eCx//fXXdfDgQY0fP75FhgMAAGiuZl2Ts3r1avXu3bvR8ssvv1yrVq0676EAAMDZNfzfe9WZ+njnq1nP5DgcDkVERDRaHhoaKofDcd5DAQCAs/Ox2zXnlfXaXXn4gj9Wr04hyvn9Vc26b1FRkZYuXSqHw6F+/fpp7ty5io+Pb+EJG2tW5ERGRuqLL75wf4T7KZ9//rk6derUIoMBAICz21152Ks/723dunXKy8tTdna2Bg4cqBUrVmjcuHF69913FRYWdkEfu1mnq0aPHq3c3FwVFxdr37592rdvn15//XXl5eU1uk4HAAC0XcuXL9ctt9yiUaNGKSoqStnZ2QoMDFRxcfEFf+xmPZNz77336ocfflB2drbq6uokSQEBAbr33ns1YcKEFh0QAAC0TrW1tdq2bZtHG9jtdqWkpFyUz7pqVuTYbDZNmzZNDz74oHbt2qXAwED17NlT/v7+LT0fAABopQ4dOqSGhoZGp6XCwsLcb0FzITUrck655JJLLsqFQwAAAE3VrGtyAAAAzqZjx47y8fFRdXW1x/Lq6urTfgZmSyNyAADABeHv768BAwaopKTEvczpdKqkpESJiYkX/PHP63QVAACwVq9OIV79OGPHjtWMGTMUGxur+Ph4rVixQidOnNDIkSNbeMLGiBwAAFqpBqez2W/Q19zH87E37STQ9ddfr4MHD6qwsFAOh0MxMTF68cUXL8rpKiIHAIBWqqnBYdXj3XHHHbrjjjtaeJqz45ocAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiBwCAVsrlbDD68c4XH+sAAEArZbP7qOqNmaqrKr/gj+UX3lvhI/ObdJ9NmzZp6dKl2rp1qxwOhxYtWqRhw4ZdoAkbI3IAAGjF6qrKVbd/h9VjnFZNTY2io6M1atQoTZo06aI/PpEDAAAuiLS0NKWlpVn2+FyTAwAAjETkAAAAIxE5AADASJZGzh//+EeNGjVKiYmJSk5O1oMPPqjycs8rxE+ePKns7GwNHjxYiYmJeuihh1RVVeWxTUVFhcaPH6+BAwcqOTlZTz75pOrr6y/moQAAAC9jaeRs3LhRGRkZevXVV7V8+XLV19dr3LhxqqmpcW+Tm5urDz74QAUFBXrppZdUWVnpcYV2Q0ODJkyYoLq6Oq1atUr5+flas2aNCgsLrTgkAADgJSx9ddXSpUs9bufn5ys5OVnbtm3TP//zP+vo0aMqLi7WggULlJycLOmn6Ln++utVWlqqhIQEbdiwQWVlZVq+fLnCw8MVExOjyZMna8GCBZo0aZL8/f2tODQAAC4Kv/DeXvs4x48f1549e9y39+7dqx07digkJERdu3ZtyfFOy6teQn706FFJUkhIiCRp69atqqurU0pKinubPn36qGvXru7IKS0tVd++fRUeHu7eJjU1VVlZWSorK1P//v3P+fFtNlsLHQmApuL7r+XZbDa+roZzORua/AZ95/t4NrvPOW+/detWjRkzxn07Ly9PkpSenq78/As/t9dEjtPpVG5urpKSktS3b19JUlVVlfz8/NShQwePbcPCwuRwONzb/DxwJLlvn9rmXHXu3Lm54wOA1/nlz0aYpynBYcXjDR48WF9//fUFmubsvCZysrOztXPnTr3yyiuWzXDgwAG5XC7LHh9oy3x9ffml3MKqqqp4EYaX6dKli9UjtCleETmPPfaYPvzwQ7388sse/wDCw8NVV1enI0eOeDybU11drYiICPc2W7Zs8djfqVdfndrmXLlcLiIHsAjfey2Pn2lo6yx9dZXL5dJjjz2mv/71r1qxYoUuu+wyj/WxsbHy8/NTSUmJe1l5ebkqKiqUkJAgSUpISNA333yj6upq9zYff/yxgoKCFBUVdVGOAwAAeB9Ln8nJzs7W22+/rX//93/XJZdc4r6GJjg4WIGBgQoODtaoUaOUn5+vkJAQBQUFKScnR4mJie7ISU1NVVRUlKZPn65p06bJ4XCooKBAGRkZvLIKAIA2zNLI+ctf/iJJuvPOOz2W5+XlaeTIkZKk2bNny263KzMzU7W1tUpNTdX8+fPd2/r4+GjJkiXKysrSrbfeqnbt2ik9PV2ZmZkX70AAAIDXsbk4Yeu2f/9+zl8DFvH19VVERIQyCt7WV/sOWj1Oq9avW6iKpozQoUOHuPD4PDmdTjmdzhbbX2RkZIvtC2fnFRceAwBaTlhwoFzOBnXs2NHqUVo9Z0O9HFXVLRo6uHiIHAAwTHCgv2x2H1W9MVN1VeVnvwNOyy+8t8JH5stutxM5rRSRAwCGqqsqV93+HVaPAVjG0peQAwAAXChEDgAAMBKRAwAAjETkAAAAIxE5AADASEQOAAAwEpEDAACMROQAAAAjETkAAMBIRA4AADASkQMAAIxE5AAAACMROQAAwEhEDgAAMBKRAwAAjETkAAAAIxE5AADASEQOAAAwEpEDAACMROQAAAAjETkAAMBIRA4AADASkQMAAIxE5AAAACMROQAAwEhEDgAAMBKRAwAAjETkAAAAIxE5AADASEQOAAAwEpEDAACMROQAAAAjETkAAMBIRA4AADASkQMAAIxE5AAAACMROQAAwEhEDgAAMBKRAwAAjETkAAAAIxE5AADASEQOAAAwEpEDAACMROQAAAAjETkAAMBIRA4AADASkQMAAIxE5AAAACMROQAAwEhEDgAAMBKRAwAAjETkAAAAIxE5AADASEQOAAAwEpEDAACMROQAAAAjETkAAMBIRA4AADASkQMAAIxE5AAAACMROQAAwEhEDgAAMBKRAwAAjETkAAAAIxE5AADASEQOAAAwEpEDAACMROQAAAAjETkAAMBIlkbOpk2bdP/99ys1NVXR0dF6//33Pda7XC4tXLhQqampio+P1913361vv/3WY5sffvhBU6dOVVJSkgYNGqTZs2fr+PHjF/EoAACAN7I0cmpqahQdHa358+efdv0LL7ygl156SVlZWXr11VfVrl07jRs3TidPnnRv8+ijj6qsrEzLly/XkiVL9Nlnn2nevHkX6xAAAICXsjRy0tLS9PDDD+u3v/1to3Uul0srV67UAw88oGHDhqlfv3566qmnVFlZ6X7GZ9euXVq/fr1ycnI0cOBADRo0SHPmzNE777yjAwcOXOzDAQAAXsTX6gF+zd69e+VwOJSSkuJeFhwcrIEDB2rz5s264YYbtHnzZnXo0EFxcXHubVJSUmS327Vly5bTxtOZ2Gy2FpsfQNPw/QdvZbPZ+PfZSnlt5DgcDklSWFiYx/KwsDBVVVVJkqqqqhQaGuqx3tfXVyEhIe77N0Xnzp2bOS0AwFTh4eFWj4Bm8trIscKBAwfkcrmsHgNok3x9ffllAq9UVVWl+vr6FtlXly5dWmQ/ODdeGzkRERGSpOrqanXq1Mm9vLq6Wv369ZP0U10fPHjQ43719fU6fPiw+/5N4XK5iBzAInzvwVvxu6H18tr3yenevbsiIiJUUlLiXnbs2DH9/e9/V2JioiQpMTFRR44c0datW93bfPLJJ3I6nYqPj7/oMwMAAO9h6TM5x48f1549e9y39+7dqx07digkJERdu3bVmDFjtHjxYvXo0UPdu3fXwoUL1alTJw0bNkyS1KdPH1111VWaO3eusrOzVVdXp8cff1w33HAD19cAANDGWRo5W7du1ZgxY9y38/LyJEnp6enKz8/XfffdpxMnTmjevHk6cuSIrrjiCr344osKCAhw32fBggV6/PHHddddd8lut+vaa6/VnDlzLvqxAAAA72Jp5AwePFhff/31r6632WyaPHmyJk+e/KvbXHrppXrmmWcuxHgAAKAV89prcgAAAM4HkQMAAIxE5AAAACMROQAAwEhEDgAAMBKRAwAAjETkAAAAIxE5AADASEQOAAAwEpEDAACMROQAAAAjETkAAMBIRA4AADASkQMAAIxE5AAAACMROQAAwEhEDgAAMJKv1QOYym63y26nIVuC0+mU0+m0egwAQCtD5FwAdrtdYeER8vUhclpCfYNT1VUOQgcA0CREzgVgt9vl62PXnFfWa3flYavHadV6dQpRzu+vkt1uJ3IAAE1C5FxAuysP66t9B60eAwCANonzKQAAwEhEDgAAMBKRAwAAjETkAAAAIxE5AADASEQOAAAwEpEDAACMROQAAAAjETkAAMBIRA4AADASkQMAAIxE5AAAACMROQAAwEh8CjlaBV9f/qmeL6fTKafTafUYAHDR8JsDXi0sOFAuZ4M6duxo9SitnrOhXo6qakIHQJtB5MCrBQf6y2b3UdUbM1VXVW71OK2WX3hvhY/Ml91uJ3IAtBlEDlqFuqpy1e3fYfUYAIBWhAuPAQCAkYgcAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiBwAAGInIAQAARiJyAACAkYgcAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiBwAAGInIAQAARiJyAACAkYgcAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCRjIqeoqEhDhw5VXFycRo8erS1btlg9EgAAsJARkbNu3Trl5eVp4sSJWrNmjfr166dx48apurra6tEAAIBFjIic5cuX65ZbbtGoUaMUFRWl7OxsBQYGqri42OrRAACARXytHuB81dbWatu2bZowYYJ7md1uV0pKijZv3tykfdntdrlcrvOeyW7/qR37dQ1VO/9W/yW2VI9OHSRJ/l1iZPNrZ/E0rZdfWE9JP/3bPPXv09vwfdNy+L5pGa3h+wZnZnO1xG91Cx04cEBXX321Vq1apcTERPfyp556Sps2bdJrr71m4XQAAMAqpCkAADBSq4+cjh07ysfHp9FFxtXV1QoPD7doKgAAYLVWHzn+/v4aMGCASkpK3MucTqdKSko8Tl8BAIC2xYir+8aOHasZM2YoNjZW8fHxWrFihU6cOKGRI0daPRoAALCIEZFz/fXX6+DBgyosLJTD4VBMTIxefPFFTlcBANCGtfpXVwEAAJxOq78mBwAA4HSIHAAAYCQiBwAAGInIAQAARiJy4LWKioo0dOhQxcXFafTo0dqyZYvVIwFebdOmTbr//vuVmpqq6Ohovf/++1aPBFiKyIFXWrdunfLy8jRx4kStWbNG/fr107hx4xq9szWA/6+mpkbR0dGaP3++1aMAXoGXkMMrjR49WnFxcZo3b56kn97FOi0tTXfeeafGjx9v8XSA94uOjtaiRYs0bNgwq0cBLMMzOfA6tbW12rZtm1JSUtzL7Ha7UlJStHnzZgsnAwC0JkQOvM6hQ4fU0NCgsLAwj+VhYWGqqqqyaCoAQGtD5AAAACMROfA6HTt2lI+PT6OLjKurq/k8MgDAOSNy4HX8/f01YMAAlZSUuJc5nU6VlJQoMTHRwskAAK2JEZ9CDvOMHTtWM2bMUGxsrOLj47VixQqdOHFCI0eOtHo0wGsdP35ce/bscd/eu3evduzYoZCQEHXt2tXCyQBr8BJyeK2XX35ZS5culcPhUExMjObMmaOBAwdaPRbgtT799FONGTOm0fL09HTl5+dbMBFgLSIHAAAYiWtyAACAkYgcAABgJCIHAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJGIHAAAYCQiB0CLq62ttXoEACByAG/3v//7v7r99ts1aNAgDR48WBMmTPD4pOkvvvhCN910k+Li4jRy5Ei9//77io6O1o4dO9zbfPPNN7r33nuVmJiolJQUTZs2TQcPHjynxz927JimTp2qhIQEpaam6s9//rPuvPNOPfHEE+5thg4dqkWLFmn69OlKSkrSvHnzJEnvvfeebrjhBsXGxmro0KFatmyZx76jo6P1/vvveywbNGiQ3njjDUk/fYp2dHS03nnnHd12222Ki4vTiBEjtHHjRvf2hw8f1tSpUzVkyBDFx8fr2muvVXFx8Tl+dQGYjMgBvNyJEyc0duxYFRcX689//rNsNpsmTpwop9OpY8eO6YEHHlDfvn21Zs0aTZ48WU8//bTH/Y8cOaK77rpL/fv31+uvv64XX3xR1dXVmjJlyjk9fn5+vjZv3qzFixdr2bJl+uyzz7Rt27ZG2y1btkz9+vXT2rVr9eCDD2rr1q2aMmWKrr/+er311luaNGmSFi5c6A6Ypnjqqac0duxYrV27VgkJCbr//vt16NAhSdLChQu1a9cuvfDCC1q3bp2ysrLUsWPHJj8GAPP4Wj0AgDP73e9+53E7NzdXycnJKisr0+effy5JysnJUUBAgKKiolRZWak5c+a4t3/55ZfVv39/PfLIIx77SEtL0+7du9WrV69ffexjx45p7dq1WrBggZKTkyVJeXl5uuqqqxptO2TIEN1zzz3u21OnTlVycrImTpwoSerVq5fKysq0dOlSjRw5sklfg4yMDPfXISsrS+vXr9frr7+u++67TxUVFYqJiVFcXJwkqXv37k3aNwBzETmAl/v2229VWFiov//97zp06JBcLpck6R//+Id2796t6OhoBQQEuLc/9cv+lK+++kqffvqpEhMTG+17z549Z4ycvXv3qq6uTvHx8e5lwcHBp71PbGysx+3y8nL95je/8ViWlJSklStXqqGhQT4+Pmc4ak8/n93X11exsbEqLy+XJN1+++3KzMzU9u3bdeWVV2rYsGFKSko6530DMBeRA3i5+++/X926dVNOTo46deokp9OpESNGqK6u7pzuX1NTo2uuuUaPPvpoo3UREREtNme7du2afB+bzeaOtlPq6+ubtI+0tDR98MEH+uijj/S3v/1Nd999tzIyMjRjxowmzwPALFyTA3ixQ4cOaffu3XrggQeUnJysPn366PDhw+71vXr10jfffOPxaqYvv/zSYx8DBgzQzp071a1bN/Xo0cPjT/v27c/4+N27d5efn5/HPo8ePapvv/32rLP37t1bX3zxhceyL774Qj179nQ/ixMaGqrKykr3+m+//VYnTpxotK/S0lL33+vr67Vt2zb17t3bvSw0NFTp6elasGCBZs+erdWrV591PgDmI3IALxYSEqJLL71Uq1ev1nfffaeSkhLl5+e71994441yuVyaO3eudu3apfXr17tfwWSz2SRJv//973X48GE98sgj2rJli/bs2aP169dr1qxZamhoOOPjBwUF6eabb9ZTTz2lTz75RDt37tQf/vAH2Ww29/5/zT333KOSkhItWrRIu3fv1po1a1RUVORx3c6QIUNUVFSk7du368svv9T8+fPl5+fXaF+vvPKK/vrXv2rXrl167LHHdPjwYY0aNUrSTxcev//++/ruu++0c+dOffjhh+rTp8+5fYEBGI3IAbyY3W7Xc889p23btmnEiBHKy8vT9OnT3euDgoK0ePFi7dixQzfddJOee+4594W+/v7+kqTOnTvrL3/5i5xOp8aNG6cbb7xRubm5Cg4Olt1+9h8BM2fOdL+iaezYsUpKSlKfPn08rgM6nQEDBqigoEDr1q3TjTfeqMLCQmVmZnpcdDxjxgxFRkYqIyNDjz76qO655x4FBgY22tfUqVP1pz/9STfddJM+//xzLV68WKGhoZIkPz8/Pfvss/rXf/1X3XHHHbLb7Xr22WfP/sUFYDyb65cnxAG0am+++aZmz56tzz777LTBcL5qamp09dVXa8aMGRo9enSL7//n9u7dq9/85jdau3atYmJiLuhjATAPFx4DrdzatWvVvXt3de7cWV9//bUWLFig4cOHt1jgbN++XeXl5YqPj9fRo0e1aNEiSWr0yikA8DZEDtDKORwOFRYWyuFwKCIiQsOHD9fDDz98TvetqKjQDTfc8Kvr33nnHUk/vdHf7t275efnpwEDBqioqMh9uggAvBWnq4A2rL6+Xvv27fvV9d26dZOvL/8XAtA6ETkAAMBIvLoKAAAYicgBAABGInIAAICRiBwAAGAkIgcAABiJyAEAAEYicgAAgJH+H12EaWW3KAEkAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 582.861x500 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "sns.catplot(x=\"age_groups\", hue=\"credit_risk\", kind=\"count\", data=training_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that age group with members that are less than 25 yrs old are at a disadvantage. Almost the same number of applications get rejected as approved whereas for the group with members $\\geq$ the ration is almost 3:1 (e.g. three times as many applications approved as rejected)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train - Test - Validation datasets shapes:  (720, 24) (200, 23) (80, 24)\n"
     ]
    }
   ],
   "source": [
    "# We only need to split between train and val (test is already separate)\n",
    "train_data, val_data = train_test_split(\n",
    "    training_data, test_size=0.1, shuffle=True, random_state=23\n",
    ")\n",
    "\n",
    "# Print the shapes of the Train - Test Datasets\n",
    "print(\n",
    "    \"Train - Test - Validation datasets shapes: \",\n",
    "    train_data.shape,\n",
    "    test_data.shape,\n",
    "    val_data.shape,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 <a name=\"24\">Data processing with Pipeline</a>\n",
    "(<a href=\"#2\">Go to Data Processing</a>)\n",
    "\n",
    "Build a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)  to impute the missing values and scale the numerical features, and finally train a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)  on the imputed and scaled dataset. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Implement here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "### STEP 1 ###\n",
    "##############\n",
    "\n",
    "# Preprocess the numerical features\n",
    "numerical_processor = Pipeline(\n",
    "    [(\"num_imputer\", SimpleImputer(strategy=\"mean\")), (\"num_scaler\", MinMaxScaler())]\n",
    ")\n",
    "# Preprocess the categorical features\n",
    "categorical_processor = Pipeline(\n",
    "    [\n",
    "        (\"cat_imputer\", SimpleImputer(strategy=\"constant\", fill_value=\"missing\")),\n",
    "        (\"cat_encoder\", OneHotEncoder(handle_unknown=\"ignore\", drop=\"if_binary\")),\n",
    "    ]\n",
    ")\n",
    "\n",
    "### STEP 2 ###\n",
    "##############\n",
    "\n",
    "# Combine all data preprocessors from above\n",
    "data_processor = ColumnTransformer(\n",
    "    [\n",
    "        (\"numerical_processing\", numerical_processor, numerical_features),\n",
    "        (\"categorical_processing\", categorical_processor, categorical_features),\n",
    "    ]\n",
    ")\n",
    "\n",
    "### STEP 3 ###\n",
    "##############\n",
    "\n",
    "# Pipeline desired all data transformers, along with an estimator at the end\n",
    "# Later you can set/reach the parameters using the names issued - for hyperparameter tuning, for example\n",
    "pipeline = Pipeline(\n",
    "    [\n",
    "        (\"data_processing\", data_processor),\n",
    "        (\"lg\", LogisticRegression(solver=\"lbfgs\", penalty=\"none\")),\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. <a name=\"3\">Train (and Tune) a Classifier</a> (Implement)\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Train (and tune) the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) pipeline. For tuning, you can try different imputation strategies, different scaling methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Implement here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[(&#x27;data_processing&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;numerical_processing&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;num_imputer&#x27;,\n",
       "                                                                   SimpleImputer()),\n",
       "                                                                  (&#x27;num_scaler&#x27;,\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  [&#x27;credit_amount&#x27;,\n",
       "                                                   &#x27;credit_duration_months&#x27;]),\n",
       "                                                 (&#x27;categorical_processing&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;cat_imputer&#x27;,\n",
       "                                                                   SimpleImputer(fill_value=&#x27;missing&#x27;,\n",
       "                                                                                 strategy=&#x27;constant&#x27;)),\n",
       "                                                                  (&#x27;cat_encoder&#x27;,\n",
       "                                                                   OneHotEncoder(drop=&#x27;if_binary&#x27;,\n",
       "                                                                                 handle_unknown=&#x27;ignore&#x27;))]),\n",
       "                                                  [&#x27;job_status&#x27;,\n",
       "                                                   &#x27;employed_since_years&#x27;,\n",
       "                                                   &#x27;savings&#x27;,\n",
       "                                                   &#x27;age_groups&#x27;])])),\n",
       "                (&#x27;lg&#x27;, LogisticRegression(penalty=&#x27;none&#x27;))])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" ><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">Pipeline</label><div class=\"sk-toggleable__content\"><pre>Pipeline(steps=[(&#x27;data_processing&#x27;,\n",
       "                 ColumnTransformer(transformers=[(&#x27;numerical_processing&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;num_imputer&#x27;,\n",
       "                                                                   SimpleImputer()),\n",
       "                                                                  (&#x27;num_scaler&#x27;,\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  [&#x27;credit_amount&#x27;,\n",
       "                                                   &#x27;credit_duration_months&#x27;]),\n",
       "                                                 (&#x27;categorical_processing&#x27;,\n",
       "                                                  Pipeline(steps=[(&#x27;cat_imputer&#x27;,\n",
       "                                                                   SimpleImputer(fill_value=&#x27;missing&#x27;,\n",
       "                                                                                 strategy=&#x27;constant&#x27;)),\n",
       "                                                                  (&#x27;cat_encoder&#x27;,\n",
       "                                                                   OneHotEncoder(drop=&#x27;if_binary&#x27;,\n",
       "                                                                                 handle_unknown=&#x27;ignore&#x27;))]),\n",
       "                                                  [&#x27;job_status&#x27;,\n",
       "                                                   &#x27;employed_since_years&#x27;,\n",
       "                                                   &#x27;savings&#x27;,\n",
       "                                                   &#x27;age_groups&#x27;])])),\n",
       "                (&#x27;lg&#x27;, LogisticRegression(penalty=&#x27;none&#x27;))])</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" ><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">data_processing: ColumnTransformer</label><div class=\"sk-toggleable__content\"><pre>ColumnTransformer(transformers=[(&#x27;numerical_processing&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;num_imputer&#x27;,\n",
       "                                                  SimpleImputer()),\n",
       "                                                 (&#x27;num_scaler&#x27;,\n",
       "                                                  MinMaxScaler())]),\n",
       "                                 [&#x27;credit_amount&#x27;, &#x27;credit_duration_months&#x27;]),\n",
       "                                (&#x27;categorical_processing&#x27;,\n",
       "                                 Pipeline(steps=[(&#x27;cat_imputer&#x27;,\n",
       "                                                  SimpleImputer(fill_value=&#x27;missing&#x27;,\n",
       "                                                                strategy=&#x27;constant&#x27;)),\n",
       "                                                 (&#x27;cat_encoder&#x27;,\n",
       "                                                  OneHotEncoder(drop=&#x27;if_binary&#x27;,\n",
       "                                                                handle_unknown=&#x27;ignore&#x27;))]),\n",
       "                                 [&#x27;job_status&#x27;, &#x27;employed_since_years&#x27;,\n",
       "                                  &#x27;savings&#x27;, &#x27;age_groups&#x27;])])</pre></div></div></div><div class=\"sk-parallel\"><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" ><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">numerical_processing</label><div class=\"sk-toggleable__content\"><pre>[&#x27;credit_amount&#x27;, &#x27;credit_duration_months&#x27;]</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-4\" type=\"checkbox\" ><label for=\"sk-estimator-id-4\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">SimpleImputer</label><div class=\"sk-toggleable__content\"><pre>SimpleImputer()</pre></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-5\" type=\"checkbox\" ><label for=\"sk-estimator-id-5\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">MinMaxScaler</label><div class=\"sk-toggleable__content\"><pre>MinMaxScaler()</pre></div></div></div></div></div></div></div></div><div class=\"sk-parallel-item\"><div class=\"sk-item\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-6\" type=\"checkbox\" ><label for=\"sk-estimator-id-6\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">categorical_processing</label><div class=\"sk-toggleable__content\"><pre>[&#x27;job_status&#x27;, &#x27;employed_since_years&#x27;, &#x27;savings&#x27;, &#x27;age_groups&#x27;]</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-7\" type=\"checkbox\" ><label for=\"sk-estimator-id-7\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">SimpleImputer</label><div class=\"sk-toggleable__content\"><pre>SimpleImputer(fill_value=&#x27;missing&#x27;, strategy=&#x27;constant&#x27;)</pre></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-8\" type=\"checkbox\" ><label for=\"sk-estimator-id-8\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">OneHotEncoder</label><div class=\"sk-toggleable__content\"><pre>OneHotEncoder(drop=&#x27;if_binary&#x27;, handle_unknown=&#x27;ignore&#x27;)</pre></div></div></div></div></div></div></div></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-9\" type=\"checkbox\" ><label for=\"sk-estimator-id-9\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LogisticRegression</label><div class=\"sk-toggleable__content\"><pre>LogisticRegression(penalty=&#x27;none&#x27;)</pre></div></div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('data_processing',\n",
       "                 ColumnTransformer(transformers=[('numerical_processing',\n",
       "                                                  Pipeline(steps=[('num_imputer',\n",
       "                                                                   SimpleImputer()),\n",
       "                                                                  ('num_scaler',\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  ['credit_amount',\n",
       "                                                   'credit_duration_months']),\n",
       "                                                 ('categorical_processing',\n",
       "                                                  Pipeline(steps=[('cat_imputer',\n",
       "                                                                   SimpleImputer(fill_value='missing',\n",
       "                                                                                 strategy='constant')),\n",
       "                                                                  ('cat_encoder',\n",
       "                                                                   OneHotEncoder(drop='if_binary',\n",
       "                                                                                 handle_unknown='ignore'))]),\n",
       "                                                  ['job_status',\n",
       "                                                   'employed_since_years',\n",
       "                                                   'savings',\n",
       "                                                   'age_groups'])])),\n",
       "                ('lg', LogisticRegression(penalty='none'))])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get train data to train the classifier\n",
    "X_train = train_data[model_features]\n",
    "y_train = train_data[model_target]\n",
    "\n",
    "# Fit the classifier to the train data\n",
    "# Train data going through the Pipeline is imputed (with means from the train data),\n",
    "#   scaled (with the min/max from the train data),\n",
    "#   and finally used to fit the model\n",
    "pipeline.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. <a name=\"4\">Make Predictions on the Test Dataset</a> (Implement)\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Use the trained classifier to predict the labels on the test set. Below you will find a code snippet that evaluates for DI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Implement here\n",
    "\n",
    "# Get test data to test the classifier\n",
    "# ! test data should come from german_credit_test.csv !\n",
    "# ...\n",
    "\n",
    "# Use the trained model to make predictions on the test dataset\n",
    "# test_predictions = ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get test data to validate the classifier\n",
    "X_test = test_data[model_features]\n",
    "\n",
    "# Use the fitted model to make predictions on the test dataset\n",
    "# Test data going through the Pipeline is imputed (with means from the train data),\n",
    "#   scaled (with the min/max from the train data),\n",
    "#   and finally used to make predictions\n",
    "test_predictions = pipeline.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. <a name=\"5\">Evaluate Results</a> (Given)\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "result_df = pd.DataFrame(columns=[\"ID\", \"credit_risk_pred\"])\n",
    "result_df[\"ID\"] = test_data[\"ID\"].tolist()\n",
    "result_df[\"credit_risk_pred\"] = test_predictions\n",
    "\n",
    "result_df.to_csv(\"../../data/final_project/project_day1_result.csv\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Final Evaluation on Test Data - Disparate Impact\n",
    "To evaluate the fairness of the model predictions, we will calculate the disparate impact (DI) metric. For more details about DI you can have a look [here](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-di.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculate_di(test_data, pred_df, pred_col=\"credit_risk_pred\"):\n",
    "    \"\"\"\n",
    "    Function to calculate Disparate Impact metric using the results from this notebook.\n",
    "    \"\"\"\n",
    "    try:\n",
    "        # Merge predictions with original test data to model per group\n",
    "        di_df = pred_df.merge(test_data, on=\"ID\")\n",
    "        # Count for group with members less than 25y old\n",
    "        pos_outcomes_less25 = di_df[di_df[\"age_groups\"] == 0][pred_col].value_counts()[\n",
    "            0\n",
    "        ]  # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n",
    "        total_less25 = len(di_df[di_df[\"age_groups\"] == 0])\n",
    "        # Count for group with members greater equal 25y old\n",
    "        pos_outcomes_geq25 = di_df[di_df[\"age_groups\"] == 1][pred_col].value_counts()[\n",
    "            0\n",
    "        ]  # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n",
    "        total_geq25 = len(di_df[di_df[\"age_groups\"] == 1])\n",
    "        # Check if correct number of gorups\n",
    "        if total_geq25 == 0:\n",
    "            print(\"There is only one group present in the data.\")\n",
    "        elif total_less25 == 0:\n",
    "            print(\"There is only one group present in the data.\")\n",
    "        else:\n",
    "            disparate_impact = (pos_outcomes_less25 / total_less25) / (\n",
    "                pos_outcomes_geq25 / total_geq25\n",
    "            )\n",
    "            return disparate_impact\n",
    "    except:\n",
    "        print(\"Wrong inputs provided.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9384288747346072"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "calculate_di(test_data, result_df, \"credit_risk_pred\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While this might look good, keep in mind that `age_groups` was used to train the model; depending on the domain, it might not be permissible to use this feature."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Final Evaluation on Test Data - Accuracy & F1 Score\n",
    "In addition to fairness evaluation, we also need to check the general model performance. During the EDA stage we learned that the target distribution is skewed so we will use F1 score in addition to accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.735"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "accuracy_score(\n",
    "    pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n",
    "        \"credit_risk\"\n",
    "    ],\n",
    "    result_df[\"credit_risk_pred\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.3116883116883117"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "f1_score(\n",
    "    pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n",
    "        \"credit_risk\"\n",
    "    ],\n",
    "    result_df[\"credit_risk_pred\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the end of the notebook."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p39",
   "language": "python",
   "name": "conda_pytorch_p39"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}