{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Responsible AI - Logistic Regression \n",
"\n",
"\n",
"This notebook shows how to build a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model to predict whether an individuals' income is $\\leq$ 50k or not using US census data.\n",
"\n",
"__Dataset:__ \n",
"You will download a dataset for this exercise using [folktables](https://github.com/zykls/folktables). Folktables provides an API to download data from the American Community Survey (ACS) Public Use Microdata Sample (PUMS) files which are managed by the US Census Bureau. The data itself is governed by the terms of use provided by the Census Bureau. For more information, see the [Terms of Service](https://www.census.gov/data/developers/about/terms-of-service.html).\n",
"\n",
"\n",
"__ML Problem:__ \n",
"Ultimately, the goal will be to predict whether an individual's income is above \\\\$50,000. We will filter the ACS PUMS data sample to only include individuals above the age of 16, who reported usual working hours of at least 1 hour per week in the past year, and an income of at least \\\\$100. The threshold of \\\\$50,000 was chosen so that this dataset can serve as a comparable substitute to the [UCI Adult dataset](https://archive.ics.uci.edu/ml/datasets/adult). The income threshold can be changed easily to define new prediction tasks.\n",
"\n",
"\n",
"1. Read the dataset\n",
"2. Data Processing\n",
" * Exploratory Data Analysis\n",
" * Select features to build the model\n",
" * Train - Validation - Test Datasets\n",
" * Data processing with Pipeline and ColumnTransformer\n",
"3. Train (and Tune) a Classifier\n",
"4. Test the Classifier\n",
"5. Accuracy Difference and DPPL\n",
"\n",
"\n",
"Before building the logistic regression model, let's have a quick look at how a linear regression can be turned into a classifier.\n",
"\n",
"Let's assume we want to use a feature (e.g. Class of Worker) to build a model that can predict the income class. A first step could be to plot the feature vs the model target. As class of worker is a categorical feature we introduce a little jitter to see the data points more easily.\n",
"\n",
"\n",
"\n",
"Now that we have a plot, we can fit a linear regression through the data points.\n",
"\n",
"
\n",
"\n",
"As we notice, the linear regression line extends beyond the target values. This is not ideal as we can end up making predictions that are outside the model target range (0 or 1); a linear regression can predict values in the range ($-\\inf$, $+\\inf$). To squish the linear regression values into a range of 0 to 1, we need to use a helper function. For this, we can use the sigmoid function:\n",
"\n",
"
\n", " | AGEP | \n", "COW | \n", "SCHL | \n", "MAR | \n", "OCCP | \n", "POBP | \n", "RELP | \n", "WKHP | \n", "SEX | \n", "RAC1P | \n", "PWGTP | \n", "GCL | \n", ">50k | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "30.0 | \n", "6.0 | \n", "14.0 | \n", "1.0 | \n", "9610.0 | \n", "6.0 | \n", "16.0 | \n", "40.0 | \n", "1.0 | \n", "8.0 | \n", "32.0 | \n", "2.0 | \n", "0.0 | \n", "
27 | \n", "23.0 | \n", "2.0 | \n", "21.0 | \n", "5.0 | \n", "2545.0 | \n", "207.0 | \n", "17.0 | \n", "20.0 | \n", "2.0 | \n", "6.0 | \n", "35.0 | \n", "NaN | \n", "0.0 | \n", "
33 | \n", "18.0 | \n", "1.0 | \n", "16.0 | \n", "5.0 | \n", "9610.0 | \n", "6.0 | \n", "17.0 | \n", "8.0 | \n", "2.0 | \n", "6.0 | \n", "33.0 | \n", "NaN | \n", "0.0 | \n", "
46 | \n", "40.0 | \n", "1.0 | \n", "15.0 | \n", "3.0 | \n", "4140.0 | \n", "303.0 | \n", "16.0 | \n", "22.0 | \n", "1.0 | \n", "8.0 | \n", "38.0 | \n", "2.0 | \n", "0.0 | \n", "
49 | \n", "18.0 | \n", "1.0 | \n", "18.0 | \n", "5.0 | \n", "725.0 | \n", "6.0 | \n", "17.0 | \n", "12.0 | \n", "2.0 | \n", "6.0 | \n", "60.0 | \n", "NaN | \n", "0.0 | \n", "
Pipeline(steps=[('data_processing',\n", " ColumnTransformer(transformers=[('numerical_processing',\n", " Pipeline(steps=[('num_imputer',\n", " SimpleImputer()),\n", " ('num_scaler',\n", " MinMaxScaler())]),\n", " ['AGEP', 'WKHP', 'PWGTP']),\n", " ('categorical_processing',\n", " Pipeline(steps=[('cat_imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('cat_encoder',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " ['COW', 'SCHL', 'MAR', 'RELP',\n", " 'SEX', 'RAC1P'])])),\n", " ('lg', LogisticRegression(penalty='none'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('data_processing',\n", " ColumnTransformer(transformers=[('numerical_processing',\n", " Pipeline(steps=[('num_imputer',\n", " SimpleImputer()),\n", " ('num_scaler',\n", " MinMaxScaler())]),\n", " ['AGEP', 'WKHP', 'PWGTP']),\n", " ('categorical_processing',\n", " Pipeline(steps=[('cat_imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('cat_encoder',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " ['COW', 'SCHL', 'MAR', 'RELP',\n", " 'SEX', 'RAC1P'])])),\n", " ('lg', LogisticRegression(penalty='none'))])
ColumnTransformer(transformers=[('numerical_processing',\n", " Pipeline(steps=[('num_imputer',\n", " SimpleImputer()),\n", " ('num_scaler',\n", " MinMaxScaler())]),\n", " ['AGEP', 'WKHP', 'PWGTP']),\n", " ('categorical_processing',\n", " Pipeline(steps=[('cat_imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('cat_encoder',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " ['COW', 'SCHL', 'MAR', 'RELP', 'SEX',\n", " 'RAC1P'])])
['AGEP', 'WKHP', 'PWGTP']
SimpleImputer()
MinMaxScaler()
['COW', 'SCHL', 'MAR', 'RELP', 'SEX', 'RAC1P']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore')
LogisticRegression(penalty='none')
Pipeline(steps=[('data_processing',\n", " ColumnTransformer(transformers=[('numerical_processing',\n", " Pipeline(steps=[('num_imputer',\n", " SimpleImputer()),\n", " ('num_scaler',\n", " MinMaxScaler())]),\n", " ['AGEP', 'WKHP', 'PWGTP']),\n", " ('categorical_processing',\n", " Pipeline(steps=[('cat_imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('cat_encoder',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " ['COW', 'SCHL', 'MAR', 'RELP',\n", " 'SEX', 'RAC1P'])])),\n", " ('lg', LogisticRegression(penalty='none'))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('data_processing',\n", " ColumnTransformer(transformers=[('numerical_processing',\n", " Pipeline(steps=[('num_imputer',\n", " SimpleImputer()),\n", " ('num_scaler',\n", " MinMaxScaler())]),\n", " ['AGEP', 'WKHP', 'PWGTP']),\n", " ('categorical_processing',\n", " Pipeline(steps=[('cat_imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('cat_encoder',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " ['COW', 'SCHL', 'MAR', 'RELP',\n", " 'SEX', 'RAC1P'])])),\n", " ('lg', LogisticRegression(penalty='none'))])
ColumnTransformer(transformers=[('numerical_processing',\n", " Pipeline(steps=[('num_imputer',\n", " SimpleImputer()),\n", " ('num_scaler',\n", " MinMaxScaler())]),\n", " ['AGEP', 'WKHP', 'PWGTP']),\n", " ('categorical_processing',\n", " Pipeline(steps=[('cat_imputer',\n", " SimpleImputer(fill_value='missing',\n", " strategy='constant')),\n", " ('cat_encoder',\n", " OneHotEncoder(handle_unknown='ignore'))]),\n", " ['COW', 'SCHL', 'MAR', 'RELP', 'SEX',\n", " 'RAC1P'])])
['AGEP', 'WKHP', 'PWGTP']
SimpleImputer()
MinMaxScaler()
['COW', 'SCHL', 'MAR', 'RELP', 'SEX', 'RAC1P']
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore')
LogisticRegression(penalty='none')