{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![MLU Logo](../../data/MLU_Logo.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <a name=\"0\">Responsible AI - Data Processing</a>\n",
    "\n",
    "This notebook shows basic data processing steps required to get data ready for model ingestion.\n",
    "\n",
    "__Dataset:__ \n",
    "You will download a dataset for this exercise using [folktables](https://github.com/zykls/folktables). Folktables provides an API to download data from the American Community Survey (ACS) Public Use Microdata Sample (PUMS) files which are managed by the US Census Bureau. The data itself is governed by the terms of use provided by the Census Bureau. For more information, see the [Terms of Service](https://www.census.gov/data/developers/about/terms-of-service.html).\n",
    "\n",
    "__ML Problem:__ \n",
    "Ultimately, the goal will be to predict whether an individual's income is above \\\\$50,000. We will filter the ACS PUMS data sample to only include individuals above the age of 16, who reported usual working hours of at least 1 hour per week in the past year, and an income of at least \\\\$100. The threshold of \\\\$50,000 was chosen so that this dataset can serve as a comparable substitute to the [UCI Adult dataset](https://archive.ics.uci.edu/ml/datasets/adult). The income threshold can be changed easily to define new prediction tasks.\n",
    "\n",
    "__Table of contents__\n",
    "1. <a href=\"#1\">Loading Data</a>\n",
    "2. <a href=\"#2\">Data Prep: Basics</a>\n",
    "3. <a href=\"#3\">Data Prep: Missing Values</a>\n",
    "4. <a href=\"#4\">Data Prep: Renaming Columns</a>\n",
    "5. <a href=\"#5\">Data Prep: Encoding Categoricals</a>\n",
    "5. <a href=\"#6\">Data Prep: Scaling Numericals</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook assumes an installation of the SageMaker kernel `conda_pytorch_p39`. In addition, libraries from a requirements.txt need to be installed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install --no-deps -U -q -r ../../requirements.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Reshaping/basic libraries\n",
    "import pandas as pd\n",
    "\n",
    "# Plotting libraries\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "sns.set_style(\"darkgrid\", {\"axes.facecolor\": \".9\"})\n",
    "\n",
    "# Operational libraries\n",
    "import sys\n",
    "\n",
    "sys.path.append(\"..\")\n",
    "\n",
    "# ML libraries\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.preprocessing import OneHotEncoder, MinMaxScaler\n",
    "\n",
    "# Fairness libraries\n",
    "from folktables.acs import *\n",
    "from folktables.folktables import *\n",
    "from folktables.load_acs import *\n",
    "\n",
    "# Jupyter(lab) libraries\n",
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 1. <a name=\"1\">Loading Data</a>\n",
    "(<a href=\"#0\">Go to top</a>)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To read in the dataset, we will be using [folktables](https://github.com/zykls/folktables) which provides access to the US Census dataset. Folktables contains predefined prediction tasks but also allows the user to specify the problem type."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The US Census dataset distinguishes between household and individuals. To obtain data on individuals, we use `ACSDataSource` with `survey=person`. The feature names for the US Census data follow the same distinction and use `P` for `person` and `H` for `household`, e.g.: `AGEP` refers to age of an individual."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Downloading data for 2018 1-Year person survey for CA...\n"
     ]
    }
   ],
   "source": [
    "income_features = [\n",
    "    \"AGEP\",  # age individual\n",
    "    \"COW\",  # class of worker\n",
    "    \"SCHL\",  # educational attainment\n",
    "    \"MAR\",  # marital status\n",
    "    \"OCCP\",  # occupation\n",
    "    \"POBP\",  # place of birth\n",
    "    \"RELP\",  # relationship\n",
    "    \"WKHP\",  # hours worked per week past 12 months\n",
    "    \"SEX\",  # sex\n",
    "    \"RAC1P\",  # recorded detailed race code\n",
    "    \"PWGTP\",  # persons weight\n",
    "    \"GCL\",  # grandparents living with grandchildren\n",
    "    \"SCH\",  # school enrollment\n",
    "]\n",
    "\n",
    "# Define the prediction problem and features\n",
    "ACSIncome = folktables.BasicProblem(\n",
    "    features=income_features,\n",
    "    target=\"PINCP\",  # total persons income\n",
    "    target_transform=lambda x: x > 50000,\n",
    "    group=\"RAC1P\",\n",
    "    preprocess=adult_filter,  # applies the following conditions; ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))\n",
    "    postprocess=lambda x: x,  # applies post processing, e.g. fill all NAs\n",
    ")\n",
    "\n",
    "# Initialize year, duration (\"1-Year\" or \"5-Year\") and granularity (household or person)\n",
    "data_source = ACSDataSource(survey_year=\"2018\", horizon=\"1-Year\", survey=\"person\")\n",
    "# Specify region (here: California) and load data\n",
    "ca_data = data_source.get_data(states=[\"CA\"], download=True)\n",
    "# Apply transformation as per problem statement above\n",
    "ca_features, ca_labels, ca_group = ACSIncome.df_to_numpy(ca_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 2. <a name=\"2\">Data Prep: Basics</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "We want to go through basic steps of data prep and convert all categorical features into dummy features (0/1 encoding) and also scale numerical values. Scaling is very important as various ML techniques use distance measures and values on different scales can fool those. Before you start the encoding and scaling, you should have a look at the main characteristics of the dataset first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>AGEP</th>\n",
       "      <th>COW</th>\n",
       "      <th>SCHL</th>\n",
       "      <th>MAR</th>\n",
       "      <th>OCCP</th>\n",
       "      <th>POBP</th>\n",
       "      <th>RELP</th>\n",
       "      <th>WKHP</th>\n",
       "      <th>SEX</th>\n",
       "      <th>RAC1P</th>\n",
       "      <th>PWGTP</th>\n",
       "      <th>GCL</th>\n",
       "      <th>SCH</th>\n",
       "      <th>&gt;50k</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>30.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>14.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>9610.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>16.0</td>\n",
       "      <td>40.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>21.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>16.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1970.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>17.0</td>\n",
       "      <td>20.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>52.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>65.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>22.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>2040.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>17.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>33.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>14.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>9610.0</td>\n",
       "      <td>36.0</td>\n",
       "      <td>16.0</td>\n",
       "      <td>40.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>53.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>18.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>19.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>1021.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>17.0</td>\n",
       "      <td>18.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>106.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   AGEP  COW  SCHL  MAR    OCCP  POBP  RELP  WKHP  SEX  RAC1P  PWGTP  GCL  \\\n",
       "0  30.0  6.0  14.0  1.0  9610.0   6.0  16.0  40.0  1.0    8.0   32.0  2.0   \n",
       "1  21.0  4.0  16.0  5.0  1970.0   6.0  17.0  20.0  1.0    1.0   52.0  NaN   \n",
       "2  65.0  2.0  22.0  5.0  2040.0   6.0  17.0   8.0  1.0    1.0   33.0  2.0   \n",
       "3  33.0  1.0  14.0  3.0  9610.0  36.0  16.0  40.0  1.0    1.0   53.0  2.0   \n",
       "4  18.0  2.0  19.0  5.0  1021.0   6.0  17.0  18.0  2.0    1.0  106.0  NaN   \n",
       "\n",
       "   SCH  >50k  \n",
       "0  1.0   0.0  \n",
       "1  2.0   0.0  \n",
       "2  1.0   0.0  \n",
       "3  1.0   0.0  \n",
       "4  3.0   0.0  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Convert numpy array to dataframe\n",
    "df = pd.DataFrame(\n",
    "    np.concatenate((ca_features, ca_labels.reshape(-1, 1)), axis=1),\n",
    "    columns=income_features + [\">50k\"],\n",
    ")\n",
    "\n",
    "# Print the first five rows\n",
    "# NaN means missing data\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's cast the categorical and numerical features accordingly (see EDA for additional explanation)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "categorical_features = [\n",
    "    \"COW\",\n",
    "    \"SCHL\",\n",
    "    \"MAR\",\n",
    "    \"OCCP\",\n",
    "    \"POBP\",\n",
    "    \"RELP\",\n",
    "    \"SEX\",\n",
    "    \"RAC1P\",\n",
    "    \"GCL\",\n",
    "    \"SCH\",\n",
    "]\n",
    "\n",
    "numerical_features = [\"AGEP\", \"WKHP\", \"PWGTP\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# We cast categorical features to `category`\n",
    "df[categorical_features] = df[categorical_features].astype(\"object\")\n",
    "\n",
    "# We cast numerical features to `int`\n",
    "df[numerical_features] = df[numerical_features].astype(\"int\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks good, so we can now separate model features from model target to explore them separately. \n",
    "\n",
    "#### Model Target & Model Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_target = \">50k\"\n",
    "model_features = categorical_features + numerical_features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Double check that that target is not accidentally part of the features\n",
    "model_target in model_features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All good here. We made sure that the target is not in the feature list. If we find the above statement showing `True` we need to remove the target by calling `model_features.remove(model_target)`.\n",
    "\n",
    "Let's have a look at missing values next."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 3. <a name=\"3\">Data Prep: Missing Values</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "The quickest way to check for missing values is to use `.isna().sum()`. This will provide a count of how many missing values we have. In fact, we can also see the count of missing values with `.info()` as it provided a count of non-null values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AGEP         0\n",
       "COW          0\n",
       "SCHL         0\n",
       "MAR          0\n",
       "OCCP         0\n",
       "POBP         0\n",
       "RELP         0\n",
       "WKHP         0\n",
       "SEX          0\n",
       "RAC1P        0\n",
       "PWGTP        0\n",
       "GCL      46273\n",
       "SCH          0\n",
       ">50k         0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Show missing values\n",
    "df.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To fill missing values we will use Sklearns `SimpleImputer`. `SimpleImputer` is a Sklearn transformer which means we first need to fit it and then we can apply the transformation to our data. We start by initializing the transformer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Depending on the data type we need different imputation strategies!\n",
    "\n",
    "# If we have missing values in a numerical column, we can backfill with the mean\n",
    "imputer_numerical = SimpleImputer(strategy=\"mean\")\n",
    "\n",
    "# If we have missing values in a categorical column, we can backfill with \"missing\"\n",
    "imputer_categorical = SimpleImputer(strategy=\"constant\", fill_value=\"missing\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the transformers have been initialized, we can fit them and apply them to data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>SimpleImputer(fill_value=&#x27;missing&#x27;, strategy=&#x27;constant&#x27;)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">SimpleImputer</label><div class=\"sk-toggleable__content\"><pre>SimpleImputer(fill_value=&#x27;missing&#x27;, strategy=&#x27;constant&#x27;)</pre></div></div></div></div></div>"
      ],
      "text/plain": [
       "SimpleImputer(fill_value='missing', strategy='constant')"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "imputer_numerical.fit(df[numerical_features])\n",
    "imputer_categorical.fit(df[categorical_features])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `.fit()` method learns the transformation (i.e. learns the mean per column, finds most frequent value, ...). Now that the transformation is learned, we can apply it. Careful when doing this on a dataset that was split to create a train, test and validation subset. The transformation needs to be learned on the training set and can then be applied to all other subsets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AGEP     0\n",
       "WKHP     0\n",
       "PWGTP    0\n",
       "COW      0\n",
       "SCHL     0\n",
       "MAR      0\n",
       "OCCP     0\n",
       "POBP     0\n",
       "RELP     0\n",
       "SEX      0\n",
       "RAC1P    0\n",
       "GCL      0\n",
       "SCH      0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_num = imputer_numerical.transform(df[numerical_features])\n",
    "df_cat = imputer_categorical.transform(\n",
    "    df[categorical_features].astype(str)\n",
    ")  # make sure to cast all other categoricals as string\n",
    "\n",
    "df = pd.concat(\n",
    "    [\n",
    "        pd.DataFrame(df_num, columns=numerical_features),\n",
    "        pd.DataFrame(df_cat, columns=categorical_features),\n",
    "    ],\n",
    "    axis=1,\n",
    ").copy(deep=True)\n",
    "\n",
    "# Show missing values\n",
    "df.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a quick detour and rename the columns to make them easier to understand."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 4. <a name=\"4\">Data Prep: Renaming Columns</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "When looking at the dataframe, we notice that the column headers are not self-explanatory. This will make debugging and communicating results potentially confusing. We should therefore consider to rename the column headers. We can do this with `.rename()`. To perform the renaming, we need to create a mapping between the old name and the new name we want to use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create column name mapping\n",
    "name_mapping = {\n",
    "    \"AGEP\": \"age_individual\",\n",
    "    \"COW\": \"class_of_worker\",\n",
    "    \"SCHL\": \"educational_attainment\",\n",
    "    \"MAR\": \"marital_status\",\n",
    "    \"OCCP\": \"occupation\",\n",
    "    \"POBP\": \"place_of_birth\",\n",
    "    \"RELP\": \"relationship\",\n",
    "    \"WKHP\": \"hours_worked_weekly_past_year\",\n",
    "    \"SEX\": \"sex\",\n",
    "    \"RAC1P\": \"race_code\",\n",
    "    \"PWGTP\": \"persons_weight\",\n",
    "    \"GCL\": \"grand_parents_living_with_grandchildren\",\n",
    "    \"SCH\": \"school_enrollment\",\n",
    "}\n",
    "\n",
    "# Rename the columns\n",
    "df.rename(name_mapping, axis=1, inplace=True)\n",
    "\n",
    "# Make sure to update the lists that contain the categorical and numerical features\n",
    "categorical_features = [\n",
    "    name_mapping[k] for k in name_mapping.keys() if k in categorical_features\n",
    "]\n",
    "numerical_features = [\n",
    "    name_mapping[k] for k in name_mapping.keys() if k in numerical_features\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have dealt with the missing values and renamed the columns, we can convert the categorical columns to one-hot encoded versions (dummies)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 5. <a name=\"5\">Data Prep: Encoding Categoricals</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "One-hot encoding only works if there are no NAs left in the dataframe, hence why we had to deal with the missing values first. Once again, we will use a transformer from Sklearn, `OneHotEncoder`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>class_of_worker_1.0</th>\n",
       "      <th>class_of_worker_2.0</th>\n",
       "      <th>class_of_worker_3.0</th>\n",
       "      <th>class_of_worker_4.0</th>\n",
       "      <th>class_of_worker_5.0</th>\n",
       "      <th>class_of_worker_6.0</th>\n",
       "      <th>class_of_worker_7.0</th>\n",
       "      <th>class_of_worker_8.0</th>\n",
       "      <th>educational_attainment_1.0</th>\n",
       "      <th>educational_attainment_10.0</th>\n",
       "      <th>...</th>\n",
       "      <th>race_code_6.0</th>\n",
       "      <th>race_code_7.0</th>\n",
       "      <th>race_code_8.0</th>\n",
       "      <th>race_code_9.0</th>\n",
       "      <th>grand_parents_living_with_grandchildren_1.0</th>\n",
       "      <th>grand_parents_living_with_grandchildren_2.0</th>\n",
       "      <th>grand_parents_living_with_grandchildren_nan</th>\n",
       "      <th>school_enrollment_1.0</th>\n",
       "      <th>school_enrollment_2.0</th>\n",
       "      <th>school_enrollment_3.0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 820 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   class_of_worker_1.0  class_of_worker_2.0  class_of_worker_3.0  \\\n",
       "0                  0.0                  0.0                  0.0   \n",
       "1                  0.0                  0.0                  0.0   \n",
       "2                  0.0                  1.0                  0.0   \n",
       "3                  1.0                  0.0                  0.0   \n",
       "4                  0.0                  1.0                  0.0   \n",
       "\n",
       "   class_of_worker_4.0  class_of_worker_5.0  class_of_worker_6.0  \\\n",
       "0                  0.0                  0.0                  1.0   \n",
       "1                  1.0                  0.0                  0.0   \n",
       "2                  0.0                  0.0                  0.0   \n",
       "3                  0.0                  0.0                  0.0   \n",
       "4                  0.0                  0.0                  0.0   \n",
       "\n",
       "   class_of_worker_7.0  class_of_worker_8.0  educational_attainment_1.0  \\\n",
       "0                  0.0                  0.0                         0.0   \n",
       "1                  0.0                  0.0                         0.0   \n",
       "2                  0.0                  0.0                         0.0   \n",
       "3                  0.0                  0.0                         0.0   \n",
       "4                  0.0                  0.0                         0.0   \n",
       "\n",
       "   educational_attainment_10.0  ...  race_code_6.0  race_code_7.0  \\\n",
       "0                          0.0  ...            0.0            0.0   \n",
       "1                          0.0  ...            0.0            0.0   \n",
       "2                          0.0  ...            0.0            0.0   \n",
       "3                          0.0  ...            0.0            0.0   \n",
       "4                          0.0  ...            0.0            0.0   \n",
       "\n",
       "   race_code_8.0  race_code_9.0  grand_parents_living_with_grandchildren_1.0  \\\n",
       "0            1.0            0.0                                          0.0   \n",
       "1            0.0            0.0                                          0.0   \n",
       "2            0.0            0.0                                          0.0   \n",
       "3            0.0            0.0                                          0.0   \n",
       "4            0.0            0.0                                          0.0   \n",
       "\n",
       "   grand_parents_living_with_grandchildren_2.0  \\\n",
       "0                                          1.0   \n",
       "1                                          0.0   \n",
       "2                                          1.0   \n",
       "3                                          1.0   \n",
       "4                                          0.0   \n",
       "\n",
       "   grand_parents_living_with_grandchildren_nan  school_enrollment_1.0  \\\n",
       "0                                          0.0                    1.0   \n",
       "1                                          1.0                    0.0   \n",
       "2                                          0.0                    1.0   \n",
       "3                                          0.0                    1.0   \n",
       "4                                          1.0                    0.0   \n",
       "\n",
       "   school_enrollment_2.0  school_enrollment_3.0  \n",
       "0                    0.0                    0.0  \n",
       "1                    1.0                    0.0  \n",
       "2                    0.0                    0.0  \n",
       "3                    0.0                    0.0  \n",
       "4                    0.0                    1.0  \n",
       "\n",
       "[5 rows x 820 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Initialize OneHotEncoder\n",
    "ohe = OneHotEncoder(handle_unknown=\"ignore\", sparse=False)\n",
    "\n",
    "# Fit and transform in one step\n",
    "df_cat_ohe = ohe.fit_transform(df[categorical_features])\n",
    "\n",
    "# Create dataframe\n",
    "df_cat_new = pd.DataFrame(\n",
    "    df_cat_ohe, columns=ohe.get_feature_names(categorical_features)\n",
    ")\n",
    "\n",
    "df_cat_new.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 6. <a name=\"6\">Data Prep: Scaling Numericals</a>\n",
    "(<a href=\"#0\">Go to top</a>)\n",
    "\n",
    "Generally in ML we want all our numerical features to be on the same scale. This avoids that certain features are seen as more important based on their values alone. Scaling also helps for algorithms that use distance measures to evaluate similarity. We can use `MinMaxScaler` or `StandardScaler` for scaling numerical features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age_individual</th>\n",
       "      <th>hours_worked_weekly_past_year</th>\n",
       "      <th>persons_weight</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.168831</td>\n",
       "      <td>0.397959</td>\n",
       "      <td>0.019411</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.051948</td>\n",
       "      <td>0.193878</td>\n",
       "      <td>0.031935</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.623377</td>\n",
       "      <td>0.071429</td>\n",
       "      <td>0.020038</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.207792</td>\n",
       "      <td>0.397959</td>\n",
       "      <td>0.032561</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.012987</td>\n",
       "      <td>0.173469</td>\n",
       "      <td>0.065748</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age_individual  hours_worked_weekly_past_year  persons_weight\n",
       "0        0.168831                       0.397959        0.019411\n",
       "1        0.051948                       0.193878        0.031935\n",
       "2        0.623377                       0.071429        0.020038\n",
       "3        0.207792                       0.397959        0.032561\n",
       "4        0.012987                       0.173469        0.065748"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Initialize MinMaxScaler\n",
    "mms = MinMaxScaler()\n",
    "\n",
    "# Fit and transform in one step\n",
    "df_num_mms = mms.fit_transform(df[numerical_features])\n",
    "\n",
    "# Create dataframe\n",
    "df_num_new = pd.DataFrame(df_num_mms, columns=numerical_features)\n",
    "\n",
    "df_num_new.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the end of this notebook."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p39",
   "language": "python",
   "name": "conda_pytorch_p39"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}