### Scikit-Learn Data Processing and Model Evaluation


This notebook shows how you can:

- run a processing job to run a Scikit-Learn script to clean, pre-process, perform feature engineering, and split the input data into train and test sets.
- run a training job on the pre-processed training data to train a model model
- run a processing job on the pre-processed test data to evaluate the trained model's performance
- use your own custom container with to run processing jobs with your own Python libraries and dependencies.

The dataset used is the [Census-Income KDD Dataset](https://archive.ics.uci.edu/ml/datasets/Census-Income+%28KDD%29). We will select features from this dataset, clean the data, and turn the data into features that our training algorithm can use to train a binary classification model, and split the data into train and test sets.

The task is to predict whether rows representing census responders have an income greater than `$50K`, or less than `50K`. The dataset is heavily class imbalanced, with most records being labeled as earning less than `$50K`. After training a logistic regression model, we will evaluate the model against a hold-out test dataset, and save the classification evaluation metrics, including precision, recall, and F1 score for each label, and accuracy and ROC AUC for the model.