# Performing K-fold Cross-validation with Amazon Machine Learning

This document and the associated sample scripts show how to use Amazon Machine Learning's  (Amazon ML’s) `DataRearrangement` parameter in the `CreateDataSource*` APIs to perform k-fold cross-validation for a binary classification model, using the AWS SDK for Python (Boto).

This example shows how to perform a 4-fold cross-validation using Amazon ML. Because k=4 in this example, the script generates four disjoint splitting ranges (folds) of equal length. Amazon ML takes one range and uses it to create an evaluation datasource, and then uses the data outside of the range to create a training datasource. This process is repeated three more times, once for each of the folds, resulting in four models and four evaluations. Amazon ML then averages the quality metrics from the four evaluations to produce a single metric that represents the aggregate model quality. In this example, the single metric is the mean of the area under curve (AUC) score for the four evaluations.

The single metric generated by cross-validating the model reflects how well the model succeeded in generalizing the patterns it found in the data (also known as the generalization performance). Cross-validating model evaluations helps you select model settings (also known as hyperparameters), without overfitting to the evaluation data, i.e., failing to generalize patterns. For example, say that we are deciding which value to use for the `sgd.l2RegularizationAmount` setting of a binary classification model, `1e-4` or `1e-5`. To decide which value to use, use the sample scripts twice, once with `sgd.l2RegularizationAmount = 1e-4` for all four folds, and once with `sgd.l2RegularizationAmount = 1e-5` for all four folds. After you have the single metric from each of the two cross-validations, you can choose which `sgd.l2RegularizationAmount` value to use by comparing the two metrics. The model with the higher metric did a better job of generalizing the patterns in the data.

For information about cross-validation, see [Cross-validation](http://docs.aws.amazon.com/machine-learning/latest/dg/cross-validation.html) in the *Amazon Machine Learning Developer Guide*.

## Setting Up

### Install Python

This sample scripts are written in Python 3. Although we have tested it successfully on Python 2, we recommend using Python 3. To see which version of Python you have, run this command in your CLI:

	python --version

Learn more about how to download the latest version of Python at [Python.org](https://www.python.org/downloads/).

### Pull Dependent Libraries

The sample scripts run in an isolated Python environment. You create the environment by using the `virtualenv` tool, and then installing the dependent package (boto) using `pip` in the `./local-python` directory. The `setup.sh` script (in `machine-learning-samples/k-fold-cross-validation/setup.sh`) creates the environment and installs the package.

If you are a Python 3 developer, run:

	source setup.sh

Note that Python 3 includes the `virtualenv` and `pip` tools.

If you are a Python 2 developer and do not already have `virtualenv` and `pip` tools in your local Python environment, you will need to install them before running the sample scripts. For example, if you are using Linux with `apt-get`, install them with the following command:

	sudo apt-get update
	sudo apt-get install python-pip python-virtualenv

Users of other operating systems and package managers can learn more about installing `pip` [here](http://pip.readthedocs.org/en/stable/installing/), and about installing `virtualenv` [here](http://virtualenv.readthedocs.org/en/latest/installation.html).

After you’ve installed the `virtualenv` and `pip` tools, run:

	source setup.sh

Setup is complete. To exit the virtual environment, type `deactivate`. To clean up the dependent libraries after exiting, remove the `./local-python` directory.

### Configure AWS Credentials

Your AWS credentials must be stored in a `~/.boto` or `~/.aws/credentials` file. Your credential file should look like this:

	[Credentials]
	aws_access_key_id = YOURACCESSKEY
	aws_secret_access_key = YOURSECRETKEY

To learn more about configuring your AWS credentials with Boto, go to [Getting Started with Boto](http://boto.readthedocs.org/en/latest/getting_started.html#getting-started-with-boto).

### Add Sample Scripts

Get the samples by cloning this repository.

	git clone https://github.com/awslabs/machine-learning-samples.git

After you have tried the sample scripts, you can integrate `machine-learning-samples/k-fold-cross-validation/build_folds.py` into your Python application by calling the `build_folds` function in the `build_folds.py` module.

## Demo

The sample code includes two scripts, `build_folds.py` and `collect_perf.py` under `machine-learning-samples/k-fold-cross-validation` directory. The first script (`build_folds.py`) uses the `DataRearrangement` parameter of the `CreateDatasourceFromS3`, `CreateDatasourceFromRedshift` or `CreateDatasourceFromRDS` APIs to create the training and evaluation datasources for the cross-validation models, and then trains and evaluates ML models using these datasources. All datasources, ML models, and evaluations are based on the sample data used in the [Amazon ML tutorial](http://docs.aws.amazon.com/machine-learning/latest/dg/tutorial.html). The second script (`collect_perf.py`) averages the quality metrics of the resulting evaluations to produce a single metric that you can use to compare models.

`build_folds.py` takes a resource name prefix and the number of folds as arguments, and generates the `DataRearrangement` parameters for each fold. It then uses the generated `DataRearrangement` parameters to create the datasources for training and evaluating the models. After it has created the datasources, it uses the training datasources to train the models, and uses the evaluation datasources to evaluate the models. For each fold, `build_folds.py` creates two datasources, one model, and one evaluation.

For example, let’s say that we want to perform a 4-fold cross-validation. We would run `build_folds.py` with the following command:

	python build_folds.py --name 4-fold-cv-demo 4

In this example, the `--name 4-fold-cv-demo` argument (optional) defines the prefix that Amazon ML adds to the names of all of the entities created by `build_folds.py` (the datasources, models, and evaluations). The `4` argument (required) specifies the number of folds that Amazon ML creates for the cross-validation process. Replace these values with your own values when you execute `build_folds.py`.

When `build_folds.py` executes, it displays the IDs of the objects that it creates. For the datasources, it also displays the `DataRearrangement` parameter that it used to create the datasource. Here is an example of the `DataRearrangement` parameter from one of the four folds:

	{
	    "splitting": {
	        "complement": true,
	        "percentBegin": 25,
	        "percentEnd": 50,
	        "strategy": "random",
	        "strategyParams": {
	            "randomSeed": "RANDOMSEED"
	        }
	    }
	}

The `DataRearrangement` parameter has the following parameters:
* `complement` – This parameter is a boolean flag. To use data within the given range to create a datasource, set `complement` to `false`. To use data from outside of the given range to create a datasource, set `complement` to `true`. For example, suppose that the given range is [25, 50] and `complement` is `false`. Amazon ML selects only records within the 25-50 percent range. The selected range has roughly 25% of the input data’s records. In contrast, suppose that the given range is the same ([25, 50]) but that `complement` is `true`. In that case, Amazon ML selects the records outside of the range, and the selected range has roughly 75% of the input data’s records.
* `percentBegin`, `percentEnd` – These parameters specify the beginning and end range of the input data that is used to create a datasource . Valid values are [0, 100].
* `strategy` – When `percentBegin` and `percentEnd`  parameters are specified, this parameter determines how the records from your data are selected for inclusion into the specified range. The `sequential` strategy splits data sequentially, based on the record order, while the `random` strategy selects records using a pseudo-random order. If your data is already shuffled, choose `sequential`. If your data has not been shuffled, we recommend that you choose `random` for your splitting strategy, in order to make the distribution of data consistent between the training and evaluation datasources. For more information about splitting strategies, see [Splitting Your Data](http://docs.aws.amazon.com/machine-learning/latest/dg/splitting-types.html) in the *Amazon Machine Learning Developer Guide*.
* `randomSeed` – (Optional) This parameter specifies the seed value that is used by the `random` strategy. The default is an empty string. In this sample code, the seed value is defined in `config.py` module (in `machine-learning-samples/k-fold-cross-validation/config.py`). You may replace this value with your own string in the `config.py` file when you execute the script. For more information about random seeds, see [Splitting Your Data](http://docs.aws.amazon.com/machine-learning/latest/dg/splitting-types.html) in the *Amazon Machine Learning Developer Guide*.

After `build_folds.py` finishes creating the evaluations, use `collect_perf.py` to collect the AUC scores from the four evaluations and average them to produce a single AUC metric. Non-binary models use a different type of evaluation score, such as macro-average F1 or RMSE, but this script handles only binary models. For example, let’s say that `build_folds.py` created the following four evaluations: `4-fold-cv-demo-eval-1`, `4-fold-cv-demo-eval-2`, `4-fold-cv-demo-eval-3`, and `4-fold-cv-demo-eval-4`. To run `collect_perf.py`, we would run the following command:

	python collect_perf.py 4-fold-cv-demo-eval-1 4-fold-cv-demo-eval-2 4-fold-cv-demo-eval-3 4-fold-cv-demo-eval-4

The number of arguments that `collect_perf.py` takes depends on the number of folds that you specified. `collect_perf.py` takes the same number of arguments as folds, so for a 4-fold cross-validation, you would execute `collect_perf.py` with four arguments. Replace these values with your own values when you execute `collect_perf.py`.

In addition to the single metric, `collect_perf.py` displays the variance between the models and a sorted list of all of the AUC scores that it collected. This allows you to see the distribution of AUC scores. If the variance is high enough that it affects your ability to compare the AUC metric with another AUC metric, there are several things that you can do to try to reduce the variance:
1. If you’re using a large number of folds, try reducing the number of folds. For example, instead of creating a 10-fold cross-validation, try creating a 5-fold cross-validation. Fewer folds means that models are trained and evaluated with larger datasources, which reduces the variability of the data from datasource to datasource, and therefore reduces the variation between the models that are trained and evaluated with the datasources.
2. If you’re using a random splitting strategy, try changing the value of the `config.RANDOM_STRATEGY_RANDOM_SEED` parameter in the `config.py` file, to change the way data is selected for your datasources.
3. If you’re using a sequential splitting strategy, try shuffling your data by using the random splitting strategy.

All resources created by these scripts are billed at the regular Amazon ML rates. For information about Amazon ML pricing, see [Amazon Machine Learning Pricing](https://aws.amazon.com/machine-learning/pricing/). For information on how to delete your resources, see [Clean Up](http://docs.aws.amazon.com/machine-learning/latest/dg/step-6-clean-up.html) in the tutorial in the *Amazon Machine Learning Developer Guide*.

For more information about how cross-validation works in Amazon ML, see [Cross-Validation](http://docs.aws.amazon.com/machine-learning/latest/dg/cross-validation.html) in the *Amazon Machine Learning Developer Guide*. For more information about the `DataRearrangement` parameter, see [Data Rearrangement](http://docs.aws.amazon.com/machine-learning/latest/dg/data-rearrangement.html) in the *Amazon Machine Learning Developer Guide*. To learn about performance metrics for Amazon ML, see [PerformanceMetrics](http://docs.aws.amazon.com/machine-learning/latest/APIReference/API_PerformanceMetrics.html) in the *Amazon Machine Learning API Reference*.