![MLU Logo](../../data/MLU_Logo.png)

# <a name="0">Responsible AI - Data Processing</a>

This notebook shows basic data processing steps required to get data ready for model ingestion.

__Dataset:__ 
You will download a dataset for this exercise using [folktables](https://github.com/zykls/folktables). Folktables provides an API to download data from the American Community Survey (ACS) Public Use Microdata Sample (PUMS) files which are managed by the US Census Bureau. The data itself is governed by the terms of use provided by the Census Bureau. For more information, see the [Terms of Service](https://www.census.gov/data/developers/about/terms-of-service.html).

__ML Problem:__ 
Ultimately, the goal will be to predict whether an individual's income is above \\$50,000. We will filter the ACS PUMS data sample to only include individuals above the age of 16, who reported usual working hours of at least 1 hour per week in the past year, and an income of at least \\$100. The threshold of \\$50,000 was chosen so that this dataset can serve as a comparable substitute to the [UCI Adult dataset](https://archive.ics.uci.edu/ml/datasets/adult). The income threshold can be changed easily to define new prediction tasks.

__Table of contents__
1. <a href="#1">Loading Data</a>
2. <a href="#2">Data Prep: Basics</a>
3. <a href="#3">Data Prep: Missing Values</a>
4. <a href="#4">Data Prep: Renaming Columns</a>
5. <a href="#5">Data Prep: Encoding Categoricals</a>
5. <a href="#6">Data Prep: Scaling Numericals</a>

This notebook assumes an installation of the SageMaker kernel `conda_pytorch_p39`. In addition, libraries from a requirements.txt need to be installed:

In [1]:
!pip install --no-deps -U -q -r ../../requirements.txt

In [2]:
# Reshaping/basic libraries
import pandas as pd

# Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid", {"axes.facecolor": ".9"})

# Operational libraries
import sys

sys.path.append("..")

# ML libraries
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

# Fairness libraries
from folktables.acs import *
from folktables.folktables import *
from folktables.load_acs import *

# Jupyter(lab) libraries
import warnings

warnings.filterwarnings("ignore")

## 1. <a name="1">Loading Data</a>
(<a href="#0">Go to top</a>)

To read in the dataset, we will be using [folktables](https://github.com/zykls/folktables) which provides access to the US Census dataset. Folktables contains predefined prediction tasks but also allows the user to specify the problem type.

The US Census dataset distinguishes between household and individuals. To obtain data on individuals, we use `ACSDataSource` with `survey=person`. The feature names for the US Census data follow the same distinction and use `P` for `person` and `H` for `household`, e.g.: `AGEP` refers to age of an individual.

In [3]:
income_features = [
    "AGEP",  # age individual
    "COW",  # class of worker
    "SCHL",  # educational attainment
    "MAR",  # marital status
    "OCCP",  # occupation
    "POBP",  # place of birth
    "RELP",  # relationship
    "WKHP",  # hours worked per week past 12 months
    "SEX",  # sex
    "RAC1P",  # recorded detailed race code
    "PWGTP",  # persons weight
    "GCL",  # grandparents living with grandchildren
    "SCH",  # school enrollment
]

# Define the prediction problem and features
ACSIncome = folktables.BasicProblem(
    features=income_features,
    target="PINCP",  # total persons income
    target_transform=lambda x: x > 50000,
    group="RAC1P",
    preprocess=adult_filter,  # applies the following conditions; ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
    postprocess=lambda x: x,  # applies post processing, e.g. fill all NAs
)

# Initialize year, duration ("1-Year" or "5-Year") and granularity (household or person)
data_source = ACSDataSource(survey_year="2018", horizon="1-Year", survey="person")
# Specify region (here: California) and load data
ca_data = data_source.get_data(states=["CA"], download=True)
# Apply transformation as per problem statement above
ca_features, ca_labels, ca_group = ACSIncome.df_to_numpy(ca_data)

Downloading data for 2018 1-Year person survey for CA...


## 2. <a name="2">Data Prep: Basics</a>
(<a href="#0">Go to top</a>)

We want to go through basic steps of data prep and convert all categorical features into dummy features (0/1 encoding) and also scale numerical values. Scaling is very important as various ML techniques use distance measures and values on different scales can fool those. Before you start the encoding and scaling, you should have a look at the main characteristics of the dataset first.

In [4]:
# Convert numpy array to dataframe
df = pd.DataFrame(
    np.concatenate((ca_features, ca_labels.reshape(-1, 1)), axis=1),
    columns=income_features + [">50k"],
)

# Print the first five rows
# NaN means missing data
df.head()

Unnamed: 0,AGEP,COW,SCHL,MAR,OCCP,POBP,RELP,WKHP,SEX,RAC1P,PWGTP,GCL,SCH,>50k
0,30.0,6.0,14.0,1.0,9610.0,6.0,16.0,40.0,1.0,8.0,32.0,2.0,1.0,0.0
1,21.0,4.0,16.0,5.0,1970.0,6.0,17.0,20.0,1.0,1.0,52.0,,2.0,0.0
2,65.0,2.0,22.0,5.0,2040.0,6.0,17.0,8.0,1.0,1.0,33.0,2.0,1.0,0.0
3,33.0,1.0,14.0,3.0,9610.0,36.0,16.0,40.0,1.0,1.0,53.0,2.0,1.0,0.0
4,18.0,2.0,19.0,5.0,1021.0,6.0,17.0,18.0,2.0,1.0,106.0,,3.0,0.0


Let's cast the categorical and numerical features accordingly (see EDA for additional explanation).

In [5]:
categorical_features = [
    "COW",
    "SCHL",
    "MAR",
    "OCCP",
    "POBP",
    "RELP",
    "SEX",
    "RAC1P",
    "GCL",
    "SCH",
]

numerical_features = ["AGEP", "WKHP", "PWGTP"]

In [6]:
# We cast categorical features to `category`
df[categorical_features] = df[categorical_features].astype("object")

# We cast numerical features to `int`
df[numerical_features] = df[numerical_features].astype("int")

Looks good, so we can now separate model features from model target to explore them separately. 

#### Model Target & Model Features

In [7]:
model_target = ">50k"
model_features = categorical_features + numerical_features

In [8]:
# Double check that that target is not accidentally part of the features
model_target in model_features

False

All good here. We made sure that the target is not in the feature list. If we find the above statement showing `True` we need to remove the target by calling `model_features.remove(model_target)`.

Let's have a look at missing values next.

## 3. <a name="3">Data Prep: Missing Values</a>
(<a href="#0">Go to top</a>)

The quickest way to check for missing values is to use `.isna().sum()`. This will provide a count of how many missing values we have. In fact, we can also see the count of missing values with `.info()` as it provided a count of non-null values.

In [9]:
# Show missing values
df.isna().sum()

AGEP         0
COW          0
SCHL         0
MAR          0
OCCP         0
POBP         0
RELP         0
WKHP         0
SEX          0
RAC1P        0
PWGTP        0
GCL      46273
SCH          0
>50k         0
dtype: int64

To fill missing values we will use Sklearns `SimpleImputer`. `SimpleImputer` is a Sklearn transformer which means we first need to fit it and then we can apply the transformation to our data. We start by initializing the transformer:

In [10]:
# Depending on the data type we need different imputation strategies!

# If we have missing values in a numerical column, we can backfill with the mean
imputer_numerical = SimpleImputer(strategy="mean")

# If we have missing values in a categorical column, we can backfill with "missing"
imputer_categorical = SimpleImputer(strategy="constant", fill_value="missing")

Once the transformers have been initialized, we can fit them and apply them to data.

In [11]:
imputer_numerical.fit(df[numerical_features])
imputer_categorical.fit(df[categorical_features])

The `.fit()` method learns the transformation (i.e. learns the mean per column, finds most frequent value, ...). Now that the transformation is learned, we can apply it. Careful when doing this on a dataset that was split to create a train, test and validation subset. The transformation needs to be learned on the training set and can then be applied to all other subsets.

In [12]:
df_num = imputer_numerical.transform(df[numerical_features])
df_cat = imputer_categorical.transform(
    df[categorical_features].astype(str)
)  # make sure to cast all other categoricals as string

df = pd.concat(
    [
        pd.DataFrame(df_num, columns=numerical_features),
        pd.DataFrame(df_cat, columns=categorical_features),
    ],
    axis=1,
).copy(deep=True)

# Show missing values
df.isna().sum()

AGEP     0
WKHP     0
PWGTP    0
COW      0
SCHL     0
MAR      0
OCCP     0
POBP     0
RELP     0
SEX      0
RAC1P    0
GCL      0
SCH      0
dtype: int64

Let's take a quick detour and rename the columns to make them easier to understand.

## 4. <a name="4">Data Prep: Renaming Columns</a>
(<a href="#0">Go to top</a>)

When looking at the dataframe, we notice that the column headers are not self-explanatory. This will make debugging and communicating results potentially confusing. We should therefore consider to rename the column headers. We can do this with `.rename()`. To perform the renaming, we need to create a mapping between the old name and the new name we want to use.

In [13]:
# Create column name mapping
name_mapping = {
    "AGEP": "age_individual",
    "COW": "class_of_worker",
    "SCHL": "educational_attainment",
    "MAR": "marital_status",
    "OCCP": "occupation",
    "POBP": "place_of_birth",
    "RELP": "relationship",
    "WKHP": "hours_worked_weekly_past_year",
    "SEX": "sex",
    "RAC1P": "race_code",
    "PWGTP": "persons_weight",
    "GCL": "grand_parents_living_with_grandchildren",
    "SCH": "school_enrollment",
}

# Rename the columns
df.rename(name_mapping, axis=1, inplace=True)

# Make sure to update the lists that contain the categorical and numerical features
categorical_features = [
    name_mapping[k] for k in name_mapping.keys() if k in categorical_features
]
numerical_features = [
    name_mapping[k] for k in name_mapping.keys() if k in numerical_features
]

Now that we have dealt with the missing values and renamed the columns, we can convert the categorical columns to one-hot encoded versions (dummies).

## 5. <a name="5">Data Prep: Encoding Categoricals</a>
(<a href="#0">Go to top</a>)

One-hot encoding only works if there are no NAs left in the dataframe, hence why we had to deal with the missing values first. Once again, we will use a transformer from Sklearn, `OneHotEncoder`.

In [14]:
# Initialize OneHotEncoder
ohe = OneHotEncoder(handle_unknown="ignore", sparse=False)

# Fit and transform in one step
df_cat_ohe = ohe.fit_transform(df[categorical_features])

# Create dataframe
df_cat_new = pd.DataFrame(
    df_cat_ohe, columns=ohe.get_feature_names(categorical_features)
)

df_cat_new.head()

Unnamed: 0,class_of_worker_1.0,class_of_worker_2.0,class_of_worker_3.0,class_of_worker_4.0,class_of_worker_5.0,class_of_worker_6.0,class_of_worker_7.0,class_of_worker_8.0,educational_attainment_1.0,educational_attainment_10.0,...,race_code_6.0,race_code_7.0,race_code_8.0,race_code_9.0,grand_parents_living_with_grandchildren_1.0,grand_parents_living_with_grandchildren_2.0,grand_parents_living_with_grandchildren_nan,school_enrollment_1.0,school_enrollment_2.0,school_enrollment_3.0
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


## 6. <a name="6">Data Prep: Scaling Numericals</a>
(<a href="#0">Go to top</a>)

Generally in ML we want all our numerical features to be on the same scale. This avoids that certain features are seen as more important based on their values alone. Scaling also helps for algorithms that use distance measures to evaluate similarity. We can use `MinMaxScaler` or `StandardScaler` for scaling numerical features.

In [15]:
# Initialize MinMaxScaler
mms = MinMaxScaler()

# Fit and transform in one step
df_num_mms = mms.fit_transform(df[numerical_features])

# Create dataframe
df_num_new = pd.DataFrame(df_num_mms, columns=numerical_features)

df_num_new.head()

Unnamed: 0,age_individual,hours_worked_weekly_past_year,persons_weight
0,0.168831,0.397959,0.019411
1,0.051948,0.193878,0.031935
2,0.623377,0.071429,0.020038
3,0.207792,0.397959,0.032561
4,0.012987,0.173469,0.065748


This is the end of this notebook.