{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![MLU Logo](../../data/MLU_Logo.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Responsible AI - Final Project Solution\n", "\n", "Build a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) that predicts the __credit_risk__ field (whether some is a credit risk or not) of the [German Credit Dataset](https://archive.ics.uci.edu/ml/datasets/South+German+Credit+%28UPDATE%29).\n", "\n", "### Final Project Problem: Loan Approval\n", "\n", "__Problem Definition:__\n", "Given a set of features about an individual (e.g. age, past credit history, immigration status, ...) predict whether a loan is repaid or not (is customer a credit risk). We impose the additional constraint that the model should be fair with respect to different age groups ($\\geq$ 25 yrs and $<$ 25 yrs).\n", "\n", "In the banking industry, there are certain regulations regarding the use of sensitive features (e.g., age, ethnicity, marital status, ...). According to those regulations, it would not be okay if age played a significant role in the model (loans should be approved/denied regardless of an individuals' age).\n", "\n", "For example, certain laws declare it unlawful for creditors to discriminate against any applicant on the basis of age (or other sensitive attributes). For more details, have a look at this paper:\n", "\n", "``` \n", "F. Kamiran and T. Calders, \"Data Preprocessing Techniques for Classification without Discrimination,\" Knowledge and Information Systems, 2012\n", "```\n", "\n", "__Table of contents__\n", "\n", "1. Read the datasets (Given) \n", "2. Data Processing (Implement)\n", " * Exploratory Data Analysis\n", " * Select features to build the model (Suggested)\n", " * Train - Validation - Test Datasets\n", " * Data Processing with Pipeline\n", "3. Train (and Tune) a Classifier on the Training Dataset (Implement)\n", "4. Make Predictions on the Test Dataset (Implement)\n", "5. Evaluate Results (Given)\n", "\n", "\n", "__Datasets and Files__\n", "\n", "\n", "- ```german_credit_training.csv```: Training data with loan applicants features, credit history, dependents, savings, account status, age group (and more). The label is __credit_risk__.\n", "\n", "- ```german_credit_test.csv```: Test data with same features as above apart from label. This will be the data to make predictions for to emulate a production environment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook assumes an installation of the SageMaker kernel `conda_pytorch_p39`. In addition, libraries from a requirements.txt need to be installed:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "!pip install --no-deps -U -q -r ../../requirements.txt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Reshaping/basic libraries\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# Plotting libraries\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "sns.set_style(\"darkgrid\", {\"axes.facecolor\": \".9\"})\n", "\n", "# ML libraries\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix, accuracy_score, f1_score\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "# Operational libraries\n", "import sys\n", "\n", "sys.path.append(\"..\")\n", "\n", "# Jupyter(lab) libraries\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Read the datasets (Given)\n", "(Go to top)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, we read the __training__ and __test__ datasets into dataframes, using [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). This library allows us to read and manipulate our data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The shape of the training dataset is: (800, 24)\n", "The shape of the test dataset is: (200, 23)\n" ] } ], "source": [ "training_data = pd.read_csv(\"../../data/final_project/german_credit_training.csv\")\n", "test_data = pd.read_csv(\"../../data/final_project/german_credit_test.csv\")\n", "\n", "print(\"The shape of the training dataset is:\", training_data.shape)\n", "print(\"The shape of the test dataset is:\", test_data.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Data Processing (Implement)\n", "(Go to top) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Exploratory Data Analysis\n", "(Go to Data Processing)\n", "\n", "We look at number of rows, columns, and some simple statistics of the dataset." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDsexpersonal_statuschecking_account_statuscredit_duration_monthscredit_historycredit_purposecredit_amountsavingsemployed_since_years...age_yearsother_installment_planshousingnum_existing_creditsjob_statusnum_people_liable_fortelephoneforeign_workercredit_riskage_groups
0564malemarried/widowed... < 0 DM24existing credits paid back duly till nowretraining4712.. >= 1000 DM1 <= ... < 4 years...37bankrent2-3management/self-employed/highly qualified empl...0 to 2TrueFalse01
1484malemarried/widowed... >= 200 DM12all credits at this bank paid back dulyothers1163100 <= ... < 500 DM1 <= ... < 4 years...44nonerent1skilled employee/official0 to 2TrueFalse01
226femalesingle... >= 200 DM6delay in paying off in the pastfurniture/equipment426unknown/no savings account.. >= 7 years...39nonerent1unskilled - resident0 to 2FalseFalse01
3443malemarried/widowed... >= 200 DM12no credits taken/all credits paid back dulyrepairs719unknown/no savings account.. >= 7 years...41bankrent1unskilled - resident3 or moreFalseFalse11
4211female/malefemale (non-single) or male (single)... >= 200 DM36no credits taken/all credits paid back dulyfurniture/equipment3835.. >= 1000 DM.. >= 7 years...45nonerent1unskilled - resident0 to 2TrueFalse01
\n", "

5 rows × 24 columns

\n", "
" ], "text/plain": [ " ID sex personal_status \\\n", "0 564 male married/widowed \n", "1 484 male married/widowed \n", "2 26 female single \n", "3 443 male married/widowed \n", "4 211 female/male female (non-single) or male (single) \n", "\n", " checking_account_status credit_duration_months \\\n", "0 ... < 0 DM 24 \n", "1 ... >= 200 DM 12 \n", "2 ... >= 200 DM 6 \n", "3 ... >= 200 DM 12 \n", "4 ... >= 200 DM 36 \n", "\n", " credit_history credit_purpose \\\n", "0 existing credits paid back duly till now retraining \n", "1 all credits at this bank paid back duly others \n", "2 delay in paying off in the past furniture/equipment \n", "3 no credits taken/all credits paid back duly repairs \n", "4 no credits taken/all credits paid back duly furniture/equipment \n", "\n", " credit_amount savings employed_since_years ... \\\n", "0 4712 .. >= 1000 DM 1 <= ... < 4 years ... \n", "1 1163 100 <= ... < 500 DM 1 <= ... < 4 years ... \n", "2 426 unknown/no savings account .. >= 7 years ... \n", "3 719 unknown/no savings account .. >= 7 years ... \n", "4 3835 .. >= 1000 DM .. >= 7 years ... \n", "\n", " age_years other_installment_plans housing num_existing_credits \\\n", "0 37 bank rent 2-3 \n", "1 44 none rent 1 \n", "2 39 none rent 1 \n", "3 41 bank rent 1 \n", "4 45 none rent 1 \n", "\n", " job_status num_people_liable_for \\\n", "0 management/self-employed/highly qualified empl... 0 to 2 \n", "1 skilled employee/official 0 to 2 \n", "2 unskilled - resident 0 to 2 \n", "3 unskilled - resident 3 or more \n", "4 unskilled - resident 0 to 2 \n", "\n", " telephone foreign_worker credit_risk age_groups \n", "0 True False 0 1 \n", "1 True False 0 1 \n", "2 False False 0 1 \n", "3 False False 1 1 \n", "4 True False 0 1 \n", "\n", "[5 rows x 24 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "training_data.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IDsexpersonal_statuschecking_account_statuscredit_duration_monthscredit_historycredit_purposecredit_amountsavingsemployed_since_years...propertyage_yearsother_installment_planshousingnum_existing_creditsjob_statusnum_people_liable_fortelephoneforeign_workerage_groups
0963malemarried/widowed... >= 200 DM24no credits taken/all credits paid back dulyfurniture/equipment2397100 <= ... < 500 DM.. >= 7 years...building society savings agreement/life insurance35bankrent2-3skilled employee/official0 to 2TrueFalse1
1611female/malefemale (non-single) or male (single)0 <= ... < 200 DM10no credits taken/all credits paid back dulyothers1240... < 100 DM.. >= 7 years...real estate48noneown1unskilled - resident3 or moreFalseFalse1
2106malemarried/widowed... >= 200 DM18critical account/other credits existing (not a...others6458unknown/no savings account.. >= 7 years...real estate39bankrent2-3management/self-employed/highly qualified empl...3 or moreTrueFalse1
3891malemarried/widowed... >= 200 DM15all credits at this bank paid back dulyfurniture/equipment1829unknown/no savings account.. >= 7 years...building society savings agreement/life insurance46nonerent2-3skilled employee/official0 to 2TrueFalse1
4342femalesingle... < 0 DM18no credits taken/all credits paid back dulyfurniture/equipment3213100 <= ... < 500 DM... < 1 year...unknown/no property25nonefor free1skilled employee/official0 to 2FalseFalse1
\n", "

5 rows × 23 columns

\n", "
" ], "text/plain": [ " ID sex personal_status \\\n", "0 963 male married/widowed \n", "1 611 female/male female (non-single) or male (single) \n", "2 106 male married/widowed \n", "3 891 male married/widowed \n", "4 342 female single \n", "\n", " checking_account_status credit_duration_months \\\n", "0 ... >= 200 DM 24 \n", "1 0 <= ... < 200 DM 10 \n", "2 ... >= 200 DM 18 \n", "3 ... >= 200 DM 15 \n", "4 ... < 0 DM 18 \n", "\n", " credit_history credit_purpose \\\n", "0 no credits taken/all credits paid back duly furniture/equipment \n", "1 no credits taken/all credits paid back duly others \n", "2 critical account/other credits existing (not a... others \n", "3 all credits at this bank paid back duly furniture/equipment \n", "4 no credits taken/all credits paid back duly furniture/equipment \n", "\n", " credit_amount savings employed_since_years ... \\\n", "0 2397 100 <= ... < 500 DM .. >= 7 years ... \n", "1 1240 ... < 100 DM .. >= 7 years ... \n", "2 6458 unknown/no savings account .. >= 7 years ... \n", "3 1829 unknown/no savings account .. >= 7 years ... \n", "4 3213 100 <= ... < 500 DM ... < 1 year ... \n", "\n", " property age_years \\\n", "0 building society savings agreement/life insurance 35 \n", "1 real estate 48 \n", "2 real estate 39 \n", "3 building society savings agreement/life insurance 46 \n", "4 unknown/no property 25 \n", "\n", " other_installment_plans housing num_existing_credits \\\n", "0 bank rent 2-3 \n", "1 none own 1 \n", "2 bank rent 2-3 \n", "3 none rent 2-3 \n", "4 none for free 1 \n", "\n", " job_status num_people_liable_for \\\n", "0 skilled employee/official 0 to 2 \n", "1 unskilled - resident 3 or more \n", "2 management/self-employed/highly qualified empl... 3 or more \n", "3 skilled employee/official 0 to 2 \n", "4 skilled employee/official 0 to 2 \n", "\n", " telephone foreign_worker age_groups \n", "0 True False 1 \n", "1 False False 1 \n", "2 True False 1 \n", "3 True False 1 \n", "4 False False 1 \n", "\n", "[5 rows x 23 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_data.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Implement more EDA here" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 800 entries, 0 to 799\n", "Data columns (total 24 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 ID 800 non-null int64 \n", " 1 sex 800 non-null object\n", " 2 personal_status 800 non-null object\n", " 3 checking_account_status 800 non-null object\n", " 4 credit_duration_months 800 non-null int64 \n", " 5 credit_history 800 non-null object\n", " 6 credit_purpose 800 non-null object\n", " 7 credit_amount 800 non-null int64 \n", " 8 savings 800 non-null object\n", " 9 employed_since_years 800 non-null object\n", " 10 installment_rate 800 non-null object\n", " 11 other_debtors_guarantors 800 non-null object\n", " 12 present_residence_since 800 non-null object\n", " 13 property 800 non-null object\n", " 14 age_years 800 non-null int64 \n", " 15 other_installment_plans 800 non-null object\n", " 16 housing 800 non-null object\n", " 17 num_existing_credits 800 non-null object\n", " 18 job_status 800 non-null object\n", " 19 num_people_liable_for 800 non-null object\n", " 20 telephone 800 non-null bool \n", " 21 foreign_worker 800 non-null bool \n", " 22 credit_risk 800 non-null int64 \n", " 23 age_groups 800 non-null int64 \n", "dtypes: bool(2), int64(6), object(16)\n", "memory usage: 139.2+ KB\n" ] } ], "source": [ "training_data.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Select features to build the model \n", "(Go to Data Processing)\n", "\n", "For a quick start, we recommend using only a few of the numerical and categorical features. However, feel free to explore other fields. In this case, we do not need to cast our features to numerical/objects. Mindful with some of the feature names - they suggest numerical values but upon inspection it should become clear that they are actually categoricals (e.g. `employed_since_years` has been binned into groups).\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Grab model features/inputs and target/output\n", "categorical_features = [\"job_status\", \"employed_since_years\", \"savings\", \"age_groups\"]\n", "\n", "numerical_features = [\"credit_amount\", \"credit_duration_months\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Separate features and the model target." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model features: ['job_status', 'employed_since_years', 'savings', 'age_groups', 'credit_amount', 'credit_duration_months']\n", "Model target: credit_risk\n" ] } ], "source": [ "model_target = \"credit_risk\"\n", "model_features = categorical_features + numerical_features\n", "\n", "print(\"Model features: \", model_features)\n", "print(\"Model target: \", model_target)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Train - Validation Datasets\n", "(Go to Data Processing)\n", "\n", "We already have training and test datasets, but no validation dataset (which you need to create). Furthermore, the test dataset is missing the labels - the goal of the project is to predict these labels. \n", "\n", "To produce a validation set to evaluate model performance, split the training dataset into train and validation subsets using sklearn's [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. Validation data you get here will be used later in section 3 to tune your classifier." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Implement here" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sns.catplot(x=\"age_groups\", hue=\"credit_risk\", kind=\"count\", data=training_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that age group with members that are less than 25 yrs old are at a disadvantage. Almost the same number of applications get rejected as approved whereas for the group with members $\\geq$ the ration is almost 3:1 (e.g. three times as many applications approved as rejected)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train - Test - Validation datasets shapes: (720, 24) (200, 23) (80, 24)\n" ] } ], "source": [ "# We only need to split between train and val (test is already separate)\n", "train_data, val_data = train_test_split(\n", " training_data, test_size=0.1, shuffle=True, random_state=23\n", ")\n", "\n", "# Print the shapes of the Train - Test Datasets\n", "print(\n", " \"Train - Test - Validation datasets shapes: \",\n", " train_data.shape,\n", " test_data.shape,\n", " val_data.shape,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 Data processing with Pipeline\n", "(Go to Data Processing)\n", "\n", "Build a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) to impute the missing values and scale the numerical features, and finally train a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) on the imputed and scaled dataset. \n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Implement here" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "### STEP 1 ###\n", "##############\n", "\n", "# Preprocess the numerical features\n", "numerical_processor = Pipeline(\n", " [(\"num_imputer\", SimpleImputer(strategy=\"mean\")), (\"num_scaler\", MinMaxScaler())]\n", ")\n", "# Preprocess the categorical features\n", "categorical_processor = Pipeline(\n", " [\n", " (\"cat_imputer\", SimpleImputer(strategy=\"constant\", fill_value=\"missing\")),\n", " (\"cat_encoder\", OneHotEncoder(handle_unknown=\"ignore\", drop=\"if_binary\")),\n", " ]\n", ")\n", "\n", "### STEP 2 ###\n", "##############\n", "\n", "# Combine all data preprocessors from above\n", "data_processor = ColumnTransformer(\n", " [\n", " (\"numerical_processing\", numerical_processor, numerical_features),\n", " (\"categorical_processing\", categorical_processor, categorical_features),\n", " ]\n", ")\n", "\n", "### STEP 3 ###\n", "##############\n", "\n", "# Pipeline desired all data transformers, along with an estimator at the end\n", "# Later you can set/reach the parameters using the names issued - for hyperparameter tuning, for example\n", "pipeline = Pipeline(\n", " [\n", " (\"data_processing\", data_processor),\n", " (\"lg\", LogisticRegression(solver=\"lbfgs\", penalty=\"none\")),\n", " ]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Train (and Tune) a Classifier (Implement)\n", "(Go to top)\n", "\n", "Train (and tune) the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) pipeline. For tuning, you can try different imputation strategies, different scaling methods." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Implement here" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('data_processing',\n",
       "                 ColumnTransformer(transformers=[('numerical_processing',\n",
       "                                                  Pipeline(steps=[('num_imputer',\n",
       "                                                                   SimpleImputer()),\n",
       "                                                                  ('num_scaler',\n",
       "                                                                   MinMaxScaler())]),\n",
       "                                                  ['credit_amount',\n",
       "                                                   'credit_duration_months']),\n",
       "                                                 ('categorical_processing',\n",
       "                                                  Pipeline(steps=[('cat_imputer',\n",
       "                                                                   SimpleImputer(fill_value='missing',\n",
       "                                                                                 strategy='constant')),\n",
       "                                                                  ('cat_encoder',\n",
       "                                                                   OneHotEncoder(drop='if_binary',\n",
       "                                                                                 handle_unknown='ignore'))]),\n",
       "                                                  ['job_status',\n",
       "                                                   'employed_since_years',\n",
       "                                                   'savings',\n",
       "                                                   'age_groups'])])),\n",
       "                ('lg', LogisticRegression(penalty='none'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('data_processing',\n", " ColumnTransformer(transformers=[('numerical_processing',\n", " Pipeline(steps=[('num_imputer',\n", " SimpleImputer()),\n", " ('num_scaler',\n", " MinMaxScaler())]),\n", " ['credit_amount',\n", " 'credit_duration_months']),\n", " ('categorical_processing',\n", " Pipeline(steps=[('cat_imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('cat_encoder',\n", " OneHotEncoder(drop='if_binary',\n", " handle_unknown='ignore'))]),\n", " ['job_status',\n", " 'employed_since_years',\n", " 'savings',\n", " 'age_groups'])])),\n", " ('lg', LogisticRegression(penalty='none'))])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get train data to train the classifier\n", "X_train = train_data[model_features]\n", "y_train = train_data[model_target]\n", "\n", "# Fit the classifier to the train data\n", "# Train data going through the Pipeline is imputed (with means from the train data),\n", "# scaled (with the min/max from the train data),\n", "# and finally used to fit the model\n", "pipeline.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Make Predictions on the Test Dataset (Implement)\n", "(Go to top)\n", "\n", "Use the trained classifier to predict the labels on the test set. Below you will find a code snippet that evaluates for DI." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# Implement here\n", "\n", "# Get test data to test the classifier\n", "# ! test data should come from german_credit_test.csv !\n", "# ...\n", "\n", "# Use the trained model to make predictions on the test dataset\n", "# test_predictions = ..." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# Get test data to validate the classifier\n", "X_test = test_data[model_features]\n", "\n", "# Use the fitted model to make predictions on the test dataset\n", "# Test data going through the Pipeline is imputed (with means from the train data),\n", "# scaled (with the min/max from the train data),\n", "# and finally used to make predictions\n", "test_predictions = pipeline.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Evaluate Results (Given)\n", "(Go to top)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "result_df = pd.DataFrame(columns=[\"ID\", \"credit_risk_pred\"])\n", "result_df[\"ID\"] = test_data[\"ID\"].tolist()\n", "result_df[\"credit_risk_pred\"] = test_predictions\n", "\n", "result_df.to_csv(\"../../data/final_project/project_day1_result.csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Final Evaluation on Test Data - Disparate Impact\n", "To evaluate the fairness of the model predictions, we will calculate the disparate impact (DI) metric. For more details about DI you can have a look [here](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-post-training-bias-metric-di.html)." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "def calculate_di(test_data, pred_df, pred_col=\"credit_risk_pred\"):\n", " \"\"\"\n", " Function to calculate Disparate Impact metric using the results from this notebook.\n", " \"\"\"\n", " try:\n", " # Merge predictions with original test data to model per group\n", " di_df = pred_df.merge(test_data, on=\"ID\")\n", " # Count for group with members less than 25y old\n", " pos_outcomes_less25 = di_df[di_df[\"age_groups\"] == 0][pred_col].value_counts()[\n", " 0\n", " ] # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n", " total_less25 = len(di_df[di_df[\"age_groups\"] == 0])\n", " # Count for group with members greater equal 25y old\n", " pos_outcomes_geq25 = di_df[di_df[\"age_groups\"] == 1][pred_col].value_counts()[\n", " 0\n", " ] # value_counts()[0] takes the count of the '0 credit risk' == 'not credit risk'\n", " total_geq25 = len(di_df[di_df[\"age_groups\"] == 1])\n", " # Check if correct number of gorups\n", " if total_geq25 == 0:\n", " print(\"There is only one group present in the data.\")\n", " elif total_less25 == 0:\n", " print(\"There is only one group present in the data.\")\n", " else:\n", " disparate_impact = (pos_outcomes_less25 / total_less25) / (\n", " pos_outcomes_geq25 / total_geq25\n", " )\n", " return disparate_impact\n", " except:\n", " print(\"Wrong inputs provided.\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9384288747346072" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "calculate_di(test_data, result_df, \"credit_risk_pred\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While this might look good, keep in mind that `age_groups` was used to train the model; depending on the domain, it might not be permissible to use this feature." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Final Evaluation on Test Data - Accuracy & F1 Score\n", "In addition to fairness evaluation, we also need to check the general model performance. During the EDA stage we learned that the target distribution is skewed so we will use F1 score in addition to accuracy." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.735" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "accuracy_score(\n", " pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n", " \"credit_risk\"\n", " ],\n", " result_df[\"credit_risk_pred\"],\n", ")" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.3116883116883117" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "f1_score(\n", " pd.read_csv(\"../../data/final_project/german_credit_test_labels.csv\")[\n", " \"credit_risk\"\n", " ],\n", " result_df[\"credit_risk_pred\"],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the end of the notebook." ] } ], "metadata": { "kernelspec": { "display_name": "conda_pytorch_p39", "language": "python", "name": "conda_pytorch_p39" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" } }, "nbformat": 4, "nbformat_minor": 4 }