{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Responsible AI - Final Project Solution\n", "\n", "Build a fair [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) that predicts the __credit_risk__ field (whether some is a credit risk or not) of the [German Credit Dataset](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29).\n", "\n", "### Final Project Problem: Loan Approval\n", "\n", "__Problem Definition:__\n", "Given a set of features about an individual (e.g. age, past credit history, immigration status, ...) predict whether a loan is repaid or not (is customer a credit risk). We impose the additional constraint that the model should be fair with respect to different age groups ($\\geq$ 25 yrs and $<$ 25 yrs).\n", "\n", "In the banking industry, there are certain regulations regarding the use of sensitive features (e.g., age, ethnicity, marital status, ...). According to those regulations, it would not be okay if age played a significant role in the model (loans should be approved/denied regardless of an individuals' age).\n", "\n", "\n", "``` \n", "F. Kamiran and T. Calders, \"Data Preprocessing Techniques for Classification without Discrimination,\" Knowledge and Information Systems, 2012\n", "```\n", "\n", "1. Read the datasets (Given) \n", "2. Data Processing (Implement)\n", " * Exploratory Data Analysis\n", " * Select features to build the model (Suggested)\n", " * Train - Validation - Test Datasets\n", " * Feature transformation\n", "3. Train a Classifier on the Training Dataset (Implement)\n", "4. Make Predictions on the Test Dataset (Implement)\n", "5. Evaluate Results (Given)\n", "\n", "\n", "__Datasets and Files:__\n", "\n", "\n", "- ```german_credit_training.csv```: Training data with loan applicants features, credit history, dependents, savings, account status, age group (and more). The label is __credit_risk__.\n", "\n", "- ```german_credit_test.csv```: Test data with same features as above apart from label. This will be the data to make predictions for to emulate a production environment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook assumes an installation of the SageMaker kernel `conda_pytorch_p39`. In addition, libraries from a requirements.txt need to be installed:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "!pip install --no-deps -U -q -r ../../requirements.txt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [], "source": [ "%%capture\n", "\n", "# Reshaping/basic libraries\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# Plotting libraries\n", "import matplotlib.pyplot as plt\n", "\n", "%matplotlib inline\n", "import seaborn as sns\n", "\n", "sns.set_style(\"darkgrid\", {\"axes.facecolor\": \".9\"})\n", "\n", "# ML libraries\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix, accuracy_score, f1_score\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "# Operational libraries\n", "import sys\n", "\n", "sys.path.append(\"..\")\n", "sys.path.insert(1, \"..\")\n", "\n", "# Fairness libraries\n", "from folktables.acs import *\n", "from folktables.folktables import *\n", "from folktables.load_acs import *\n", "from fairlearn.reductions import EqualizedOdds\n", "from fairlearn.postprocessing import ThresholdOptimizer\n", "from fairlearn.metrics import MetricFrame, selection_rate\n", "\n", "# Jupyter(lab) libraries\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Read the datasets (Given)\n", "(Go to top)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we read the __training__ and __test__ datasets into dataframes, using [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). This library allows us to read and manipulate our data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The shape of the training dataset is: (800, 24)\n", "The shape of the test dataset is: (200, 23)\n" ] } ], "source": [ "training_data = pd.read_csv(\"../../data/final_project/german_credit_training.csv\")\n", "test_data = pd.read_csv(\"../../data/final_project/german_credit_test.csv\")\n", "\n", "print(\"The shape of the training dataset is:\", training_data.shape)\n", "print(\"The shape of the test dataset is:\", test_data.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Data Processing (Implement)\n", "(Go to top) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Exploratory Data Analysis\n", "(Go to Data Processing)\n", "\n", "We look at number of rows, columns, and some simple statistics of the datasets." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", " | ID | \n", "sex | \n", "personal_status | \n", "checking_account_status | \n", "credit_duration_months | \n", "credit_history | \n", "credit_purpose | \n", "credit_amount | \n", "savings | \n", "employed_since_years | \n", "... | \n", "age_years | \n", "other_installment_plans | \n", "housing | \n", "num_existing_credits | \n", "job_status | \n", "num_people_liable_for | \n", "telephone | \n", "foreign_worker | \n", "credit_risk | \n", "age_groups | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "564 | \n", "male | \n", "married/widowed | \n", "... < 0 DM | \n", "24 | \n", "existing credits paid back duly till now | \n", "retraining | \n", "4712 | \n", ".. >= 1000 DM | \n", "1 <= ... < 4 years | \n", "... | \n", "37 | \n", "bank | \n", "rent | \n", "2-3 | \n", "management/self-employed/highly qualified empl... | \n", "0 to 2 | \n", "True | \n", "False | \n", "0 | \n", "1 | \n", "
1 | \n", "484 | \n", "male | \n", "married/widowed | \n", "... >= 200 DM | \n", "12 | \n", "all credits at this bank paid back duly | \n", "others | \n", "1163 | \n", "100 <= ... < 500 DM | \n", "1 <= ... < 4 years | \n", "... | \n", "44 | \n", "none | \n", "rent | \n", "1 | \n", "skilled employee/official | \n", "0 to 2 | \n", "True | \n", "False | \n", "0 | \n", "1 | \n", "
2 | \n", "26 | \n", "female | \n", "single | \n", "... >= 200 DM | \n", "6 | \n", "delay in paying off in the past | \n", "furniture/equipment | \n", "426 | \n", "unknown/no savings account | \n", ".. >= 7 years | \n", "... | \n", "39 | \n", "none | \n", "rent | \n", "1 | \n", "unskilled - resident | \n", "0 to 2 | \n", "False | \n", "False | \n", "0 | \n", "1 | \n", "
3 | \n", "443 | \n", "male | \n", "married/widowed | \n", "... >= 200 DM | \n", "12 | \n", "no credits taken/all credits paid back duly | \n", "repairs | \n", "719 | \n", "unknown/no savings account | \n", ".. >= 7 years | \n", "... | \n", "41 | \n", "bank | \n", "rent | \n", "1 | \n", "unskilled - resident | \n", "3 or more | \n", "False | \n", "False | \n", "1 | \n", "1 | \n", "
4 | \n", "211 | \n", "female/male | \n", "female (non-single) or male (single) | \n", "... >= 200 DM | \n", "36 | \n", "no credits taken/all credits paid back duly | \n", "furniture/equipment | \n", "3835 | \n", ".. >= 1000 DM | \n", ".. >= 7 years | \n", "... | \n", "45 | \n", "none | \n", "rent | \n", "1 | \n", "unskilled - resident | \n", "0 to 2 | \n", "True | \n", "False | \n", "0 | \n", "1 | \n", "
5 rows × 24 columns
\n", "\n", " | ID | \n", "sex | \n", "personal_status | \n", "checking_account_status | \n", "credit_duration_months | \n", "credit_history | \n", "credit_purpose | \n", "credit_amount | \n", "savings | \n", "employed_since_years | \n", "... | \n", "property | \n", "age_years | \n", "other_installment_plans | \n", "housing | \n", "num_existing_credits | \n", "job_status | \n", "num_people_liable_for | \n", "telephone | \n", "foreign_worker | \n", "age_groups | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "963 | \n", "male | \n", "married/widowed | \n", "... >= 200 DM | \n", "24 | \n", "no credits taken/all credits paid back duly | \n", "furniture/equipment | \n", "2397 | \n", "100 <= ... < 500 DM | \n", ".. >= 7 years | \n", "... | \n", "building society savings agreement/life insurance | \n", "35 | \n", "bank | \n", "rent | \n", "2-3 | \n", "skilled employee/official | \n", "0 to 2 | \n", "True | \n", "False | \n", "1 | \n", "
1 | \n", "611 | \n", "female/male | \n", "female (non-single) or male (single) | \n", "0 <= ... < 200 DM | \n", "10 | \n", "no credits taken/all credits paid back duly | \n", "others | \n", "1240 | \n", "... < 100 DM | \n", ".. >= 7 years | \n", "... | \n", "real estate | \n", "48 | \n", "none | \n", "own | \n", "1 | \n", "unskilled - resident | \n", "3 or more | \n", "False | \n", "False | \n", "1 | \n", "
2 | \n", "106 | \n", "male | \n", "married/widowed | \n", "... >= 200 DM | \n", "18 | \n", "critical account/other credits existing (not a... | \n", "others | \n", "6458 | \n", "unknown/no savings account | \n", ".. >= 7 years | \n", "... | \n", "real estate | \n", "39 | \n", "bank | \n", "rent | \n", "2-3 | \n", "management/self-employed/highly qualified empl... | \n", "3 or more | \n", "True | \n", "False | \n", "1 | \n", "
3 | \n", "891 | \n", "male | \n", "married/widowed | \n", "... >= 200 DM | \n", "15 | \n", "all credits at this bank paid back duly | \n", "furniture/equipment | \n", "1829 | \n", "unknown/no savings account | \n", ".. >= 7 years | \n", "... | \n", "building society savings agreement/life insurance | \n", "46 | \n", "none | \n", "rent | \n", "2-3 | \n", "skilled employee/official | \n", "0 to 2 | \n", "True | \n", "False | \n", "1 | \n", "
4 | \n", "342 | \n", "female | \n", "single | \n", "... < 0 DM | \n", "18 | \n", "no credits taken/all credits paid back duly | \n", "furniture/equipment | \n", "3213 | \n", "100 <= ... < 500 DM | \n", "... < 1 year | \n", "... | \n", "unknown/no property | \n", "25 | \n", "none | \n", "for free | \n", "1 | \n", "skilled employee/official | \n", "0 to 2 | \n", "False | \n", "False | \n", "1 | \n", "
5 rows × 23 columns
\n", "Pipeline(steps=[('data_processing',\n", " ColumnTransformer(transformers=[('numerical_processing',\n", " Pipeline(steps=[('num_imputer',\n", " SimpleImputer()),\n", " ('num_scaler',\n", " MinMaxScaler())]),\n", " ['credit_duration_months',\n", " 'credit_amount']),\n", " ('categorical_processing',\n", " Pipeline(steps=[('cat_imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('cat_encoder',\n", " OneHot...))]),\n", " ['sex', 'personal_status',\n", " 'checking_account_status',\n", " 'credit_history',\n", " 'credit_purpose', 'savings',\n", " 'employed_since_years',\n", " 'installment_rate',\n", " 'other_debtors_guarantors',\n", " 'present_residence_since',\n", " 'property',\n", " 'other_installment_plans',\n", " 'housing',\n", " 'num_existing_credits',\n", " 'job_status',\n", " 'num_people_liable_for'])])),\n", " ('lg', LogisticRegression(random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('data_processing',\n", " ColumnTransformer(transformers=[('numerical_processing',\n", " Pipeline(steps=[('num_imputer',\n", " SimpleImputer()),\n", " ('num_scaler',\n", " MinMaxScaler())]),\n", " ['credit_duration_months',\n", " 'credit_amount']),\n", " ('categorical_processing',\n", " Pipeline(steps=[('cat_imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('cat_encoder',\n", " OneHot...))]),\n", " ['sex', 'personal_status',\n", " 'checking_account_status',\n", " 'credit_history',\n", " 'credit_purpose', 'savings',\n", " 'employed_since_years',\n", " 'installment_rate',\n", " 'other_debtors_guarantors',\n", " 'present_residence_since',\n", " 'property',\n", " 'other_installment_plans',\n", " 'housing',\n", " 'num_existing_credits',\n", " 'job_status',\n", " 'num_people_liable_for'])])),\n", " ('lg', LogisticRegression(random_state=0))])
ColumnTransformer(transformers=[('numerical_processing',\n", " Pipeline(steps=[('num_imputer',\n", " SimpleImputer()),\n", " ('num_scaler',\n", " MinMaxScaler())]),\n", " ['credit_duration_months', 'credit_amount']),\n", " ('categorical_processing',\n", " Pipeline(steps=[('cat_imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('cat_encoder',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " ['sex', 'personal_status',\n", " 'checking_account_status', 'credit_history',\n", " 'credit_purpose', 'savings',\n", " 'employed_since_years', 'installment_rate',\n", " 'other_debtors_guarantors',\n", " 'present_residence_since', 'property',\n", " 'other_installment_plans', 'housing',\n", " 'num_existing_credits', 'job_status',\n", " 'num_people_liable_for'])])
['credit_duration_months', 'credit_amount']
SimpleImputer()
MinMaxScaler()
['sex', 'personal_status', 'checking_account_status', 'credit_history', 'credit_purpose', 'savings', 'employed_since_years', 'installment_rate', 'other_debtors_guarantors', 'present_residence_since', 'property', 'other_installment_plans', 'housing', 'num_existing_credits', 'job_status', 'num_people_liable_for']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore')
LogisticRegression(random_state=0)
ThresholdOptimizer(constraints='equalized_odds',\n", " estimator=LogisticRegression(random_state=0))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
ThresholdOptimizer(constraints='equalized_odds',\n", " estimator=LogisticRegression(random_state=0))
LogisticRegression(random_state=0)
LogisticRegression(random_state=0)