# Prerequisite

This notebook assumes you are using the `conda-env-dvc-kernel` image built and attached to a SageMaker Studio domain. Setup guidelines are available [here](https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/blob/main/sagemaker-studio-dvc-image/README.md).

# Training a CatBoost regression model with data from DVC

This notebook will guide you through an example that shows you how to build a Docker containers for SageMaker and use it for processing, training, and inference in conjunction with [DVC](https://dvc.org/).

By packaging libraries and algorithms in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies.

### California Housing dataset

We use the California Housing dataset, present in [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). 

The California Housing dataset was originally published in:

Pace, R. Kelley, and Ronald Barry. "Sparse spatial auto-regressions." Statistics & Probability Letters 33.3 (1997): 291-297.

### DVC

DVC is built to make machine learning (ML) models shareable and reproducible.
It is designed to handle large files, data sets, machine learning models, and metrics as well as code.

## Part 1: Configure DVC for data versioning

Let us create a subdirectory where we prepare the data, i.e. `sagemaker-dvc-sample`.
Within this subdirectory, we initialize a new git repository and set the remote to a repository we create in [AWS CodeCommit](https://aws.amazon.com/codecommit/).
The `dvc` configurations and files for data tracking will be versioned in this repository.
Git offers native capabilities to manage subprojects via, for example, `git submodules` and `git subtrees`, and you can extend this notebook to use any of the aforementioned tools that best fit your workflow.

One of the great advantage of using AWS CodeCommit in this context is its native integration with IAM for authentication purposes, meaning we can use SageMaker execution role to interact with the git server without the need to worry about how to store and retrieve credentials. Of course, you can always replace AWS CodeCommit with any other version control system based on git such as GitHub, Gitlab, or Bitbucket, keeping in mind you will need to handle the credentials in a secure manner, for example, by introducing [Amazon Secret Managers](https://aws.amazon.com/secrets-manager/) to store and pull credentials at run time in the notebook as well as the processing and training jobs.

Setting the appropriate permissions on SageMaker execution role will also allow the SageMaker processing and training job to interact securely with the AWS CodeCommit.

In [None]:
%%sh

## Create the repository

repo_name="sagemaker-dvc-sample"

aws codecommit create-repository --repository-name ${repo_name} --repository-description "Sample repository to describe how to use dvc with sagemaker and codecommit"

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to eu-west-1 if none defined)
region=$(python -c "import boto3;print(boto3.Session().region_name)")
region=${region:-eu-west-1}

## repo_name is already in the .gitignore of the root repo

mkdir -p ${repo_name}
cd ${repo_name}

# initalize new repo in subfolder
git init
## Change the remote to the codecommit
git remote add origin https://git-codecommit."${region}".amazonaws.com/v1/repos/"${repo_name}"

# Configure git - change it according to your needs
git config --global user.email "sagemaker-studio-user@example.com"
git config --global user.name "SageMaker Studio User"

git config --global credential.helper '!aws codecommit credential-helper $@'
git config --global credential.UseHttpPath true

# Initialize dvc
dvc init

git commit -m 'Add dvc configuration'

# Set the DVC remote storage to S3 - uses the sagemaker standard default bucket
dvc remote add -d storage s3://sagemaker-"${region}"-"${account}"/DEMO-sagemaker-experiments-dvc
git commit .dvc/config -m "initialize DVC local remote"

# set the DVC cache to S3
dvc remote add s3cache s3://sagemaker-"${region}"-"${account}"/DEMO-sagemaker-experiments-dvc/cache
dvc config cache.s3 s3cache

# disable sending anonymized data to dvc for troubleshooting
dvc config core.analytics false

git add .dvc/config
git commit -m 'update dvc config'

git push --set-upstream origin master #--force

## Part 2: Processing and Training with DVC and SageMaker

In this section we explore two different approaches to tackle our problem and how we can keep track of the 2 tests using SageMaker Experiments.

The high level conceptual architecture is depicted in the figure below.

<img src="./img/high-level-architecture.png">
<i>Fig. 1 High level architecture</i>


### Import libraries and initial setup

Lets start by importing the libraries and setup variables that will be useful as we go along in the notebook.

In [None]:
import boto3
import sagemaker
import time
from time import strftime

boto_session = boto3.Session()
sagemaker_session = sagemaker.Session(boto_session=boto_session)
sm_client = boto3.client("sagemaker")
region = boto_session.region_name
bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()
account = sagemaker_session.boto_session.client("sts").get_caller_identity()["Account"]

prefix = 'DEMO-sagemaker-experiments-dvc'

print(f"account: {account}")
print(f"bucket: {bucket}")
print(f"region: {region}")
print(f"role: {role}")

### Prepare raw data

We upload the raw data to S3 in the default bucket.

In [None]:
import pandas as pd
import numpy as np

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

from pathlib import Path

databunch = fetch_california_housing()
dataset = np.concatenate((databunch["target"].reshape(-1, 1), databunch["data"]), axis=1)

print(f"Dataset shape = {dataset.shape}")
np.savetxt("dataset.csv", dataset, delimiter=",")

data_prefix_path = f"{prefix}/input/dataset.csv"
s3_data_path = f"s3://{bucket}/{data_prefix_path}"
print(f"Raw data location in S3: {s3_data_path}")

s3 = boto3.client("s3")
s3.upload_file("dataset.csv", bucket, data_prefix_path)

### Setup SageMaker Experiments

Amazon SageMaker Experiments have been built for data scientists that are performing different experiments as part of their model development process and want a simple way to organize, track, compare, and evaluate their machine learning experiments.

Let’s start first with an overview of Amazon SageMaker Experiments features:

* Organize Experiments: Amazon SageMaker Experiments structures experimentation with a first top level entity called experiment that contains a set of trials. Each trial contains a set of steps called trial components. Each trial component is a combination of datasets, algorithms, parameters, and artifacts. You can picture experiments as the top level “folder” for organizing your hypotheses, your trials as the “subfolders” for each group test run, and your trial components as your “files” for each instance of a test run.
* Track Experiments: Amazon SageMaker Experiments allows the data scientist to track experiments automatically or manually. Amazon SageMaker Experiments offers the possibility to automatically assign the sagemaker jobs to a trial specifying the `experiment_config` argument, or to manually call the tracking APIs.
* Compare and Evaluate Experiments: The integration of Amazon SageMaker Experiments with Amazon SageMaker Studio makes it easier to produce data visualizations and compare different trials to identify the best combination of hyperparameters.

Now, in order to track this test in SageMaker, we need to create an experiment.

In [None]:
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

experiment_name = 'DEMO-sagemaker-experiments-dvc'

# create the experiment if it doesn't exist
try:
    my_experiment = Experiment.load(experiment_name=experiment_name)
    print("existing experiment loaded")
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        my_experiment = Experiment.create(
            experiment_name = experiment_name,
            description = "How to integrate DVC"
        )
        print("new experiment created")
    else:
        print(f"Unexpected {ex}=, {type(ex)}")
        print("Dont go forward!")
        raise

We need to also define trials within the experiment.
While it is possible to have any number of trials within an experiment, for our excercise, we will create 2 trials, one for each processing strategy.

### Test 1: generate single files for training and validation

In this test, we show how to create a processing script that fetches the raw data directly from S3 as an input, process it to create the triplet `train`, `validation` and `test`, and store the results back to S3 using `dvc`. Furthermore, we show how you can pair `dvc` with SageMaker native tracking capabilities when executing Processing and Training Jobs and via SageMaker Experiments.

In [None]:
first_trial_name = "dvc-trial-single-file"

try:
    my_first_trial = Trial.load(trial_name=first_trial_name)
    print("existing trial loaded")
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        my_first_trial = Trial.create(
            experiment_name=experiment_name,
            trial_name=first_trial_name,
        )
        print("new trial created")
    else:
        print(f"Unexpected {ex}=, {type(ex)}")
        print("Dont go forward!")
        raise

### Processing script: version data with DVC

The processing script expects the address of the git repository and the branch we want to <i>create</i> to store the `dvc` metadata passed via environmental variables.
The datasets themselves will be then stored in S3.
Environmental variables are automatically tracked in SageMaker Experiments in the automatically generated <i>TrialComponent</i>.
The <i>TrialComponent</i> generated by SageMaker can be loaded within the Processing Job and further enrich with any extra data, which then become available for visualization in the SageMaker Studio UI.
In our case, we will store the following data:
* `DVC_REPO_URL`
* `DVC_BRANCH`
* `USER`
* `data_commit_hash`
* `train_test_split_ratio`

In [None]:
!pygmentize 'source_dir/preprocessing-experiment.py'

### SageMaker Processing job

SageMaker Processing gives us the possibility to execute our processing script on container images managed by AWS that are optimized to run on the AWS infrastructure.
If our script requires additional dependencies, we can supply a `requirements.txt` file.
Upon starting of the processing job, SageMaker will `pip`-install all libraries we need (e.g., `dvc`-related libraries).

We have now all ingredients to execute our SageMaker Processing Job:
* a processing script that can process several arguments (i.e., `--train-test-split-ratio`) and two environmental variables (i.e., `DVC_REPO_URL` and `DVC_BRANCH`)
* a `requiremets.txt` file
* a git repository (in AWS CodeCommit)
* a SageMaker Experiment and a Trial

In [None]:
from sagemaker.processing import FrameworkProcessor, ProcessingInput
from sagemaker.sklearn.estimator import SKLearn

dvc_repo_url = "codecommit::{}://sagemaker-dvc-sample".format(region)
dvc_branch = my_first_trial.trial_name

script_processor = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version='0.23-1',
    instance_count=1,
    instance_type='ml.m5.xlarge',
    env={
        "DVC_REPO_URL": dvc_repo_url,
        "DVC_BRANCH": dvc_branch,
        "USER": "sagemaker"
    },
    role=role
)

experiment_config={
    "ExperimentName": my_experiment.experiment_name,
    "TrialName": my_first_trial.trial_name
}

Executing the processing job will take around 3-4 minutes.

In [None]:
%%time

script_processor.run(
    code='./source_dir/preprocessing-experiment.py',
    dependencies=['./source_dir/requirements.txt'],
    inputs=[ProcessingInput(source=s3_data_path, destination="/opt/ml/processing/input")],
    experiment_config=experiment_config,
    arguments=["--train-test-split-ratio", "0.2"]
)


### Create an estimator and fit the model

To use DVC integration, pass a `dvc_repo_url` and `dvc_branch` as environmental variables when you create the Estimator object.

We will train on the `dvc-trial-single-file` branch first.

When doing `dvc pull` in the training script, the following dataset structure will be generated:

```
dataset
    |-- train
    |   |-- california_train.csv
    |-- test
    |   |-- california_test.csv
    |-- validation
    |   |-- california_validation.csv
```

#### Metric definition

SageMaker emits every log that is going to STDOUT to CloudWatch. In order to capture the metrics we are interested in, we need to specify a metric definition object to define the format of the metrics via regex.
By doing so, SageMaker will know how to capture the metrics from the CloudWatch logs of the training job.

In our case, we are interested in the median error.
```
metric_definitions = [{'Name': 'median-AE', 'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}]
```

In [None]:
metric_definitions = [{'Name': 'median-AE', 'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}]

hyperparameters={ 
        "learning_rate" : 1,
        "depth": 6
    }
estimator = SKLearn(
    entry_point='train.py',
    source_dir='source_dir',
    role=role,
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    instance_count=1,
    instance_type='ml.m5.large',
    framework_version='0.23-1',
    base_job_name='training-with-dvc-data',
    environment={
        "DVC_REPO_URL": dvc_repo_url,
        "DVC_BRANCH": dvc_branch,
        "USER": "sagemaker"
    }
)

experiment_config={
    "ExperimentName": my_experiment.experiment_name,
    "TrialName": my_first_trial.trial_name
}

In [None]:
%%time

estimator.fit(experiment_config=experiment_config)

On the logs above you can see those lines, indicating about the files pulled by dvc:

```
Running dvc pull command
A       train/california_train.csv
A       test/california_test.csv
A       validation/california_validation.csv
3 files added and 3 files fetched
Starting the training.
Found train files: ['/opt/ml/input/data/dataset/train/california_train.csv']
Found validation files: ['/opt/ml/input/data/dataset/train/california_train.csv']
```

### Test 2: generate multiple files for training and validation

In [None]:
second_trial_name = "dvc-trial-multi-files"

try:
    my_second_trial = Trial.load(trial_name=second_trial_name)
    print("existing trial loaded")
except Exception as ex:
    if "ResourceNotFound" in str(ex):
        my_second_trial = Trial.create(
            experiment_name=experiment_name,
            trial_name=second_trial_name,
        )
        print("new trial created")
    else:
        print(f"Unexpected {ex}=, {type(ex)}")
        print("Dont go forward!")
        raise

Differently from the first processing script, we now create out of the original dataset multiple files for training and validation and store the `dvc` metadata in a different branch.

In [None]:
!pygmentize 'source_dir/preprocessing-experiment-multifiles.py'

In [None]:
from sagemaker.processing import FrameworkProcessor, ProcessingInput
from sagemaker.sklearn.estimator import SKLearn

dvc_branch = my_second_trial.trial_name

script_processor = FrameworkProcessor(
    estimator_cls=SKLearn,
    framework_version='0.23-1',
    instance_count=1,
    instance_type='ml.m5.xlarge',
    env={
        "DVC_REPO_URL": dvc_repo_url,
        "DVC_BRANCH": dvc_branch,
        "USER": "sagemaker",
    },
    role=role
)

experiment_config={
    "ExperimentName": my_experiment.experiment_name,
    "TrialName": my_second_trial.trial_name
}

Executing the processing job will take ~5 minutes

In [None]:
%%time

script_processor.run(
    code='./source_dir/preprocessing-experiment-multifiles.py',
    dependencies=['./source_dir/requirements.txt'],
    inputs=[ProcessingInput(source=s3_data_path, destination="/opt/ml/processing/input")],
    experiment_config=experiment_config,
    arguments=["--train-test-split-ratio", "0.1"]
)

We will now train on the `dvc-trial-multi-files` branch.

When doing `dvc pull`, this is the dataset structure:

```
dataset
    |-- train
    |   |-- california_train_1.csv
    |   |-- california_train_2.csv
    |   |-- california_train_3.csv
    |   |-- california_train_4.csv
    |   |-- california_train_5.csv
    |-- test
    |   |-- california_test.csv
    |-- validation
    |   |-- california_validation_1.csv
    |   |-- california_validation_2.csv
    |   |-- california_validation_3.csv
```

In [None]:
metric_definitions = [{'Name': 'median-AE', 'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}]

hyperparameters={ 
        "learning_rate" : 1,
        "depth": 6
    }

estimator = SKLearn(
    entry_point='train.py',
    source_dir='source_dir',
    role=role,
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    instance_count=1,
    instance_type='ml.m5.large',
    framework_version='0.23-1',
    base_job_name='training-with-dvc-data',
    environment={
        "DVC_REPO_URL": dvc_repo_url,
        "DVC_BRANCH": dvc_branch,
        "USER": "sagemaker"
    }
)

experiment_config={
    "ExperimentName": my_experiment.experiment_name,
    "TrialName": my_second_trial.trial_name,
}

The training job will take around ~5 minutes

In [None]:
%%time

estimator.fit(experiment_config=experiment_config)

On the logs above you can see those lines, indicating about the files pulled by dvc:

```
Running dvc pull command
A       validation/california_validation_2.csv
A       validation/california_validation_1.csv
A       validation/california_validation_3.csv
A       train/california_train_4.csv
A       train/california_train_5.csv
A       train/california_train_2.csv
A       train/california_train_3.csv
A       train/california_train_1.csv
A       test/california_test.csv
9 files added and 9 files fetched
Starting the training.
Found train files: ['/opt/ml/input/data/dataset/train/california_train_2.csv', '/opt/ml/input/data/dataset/train/california_train_5.csv', '/opt/ml/input/data/dataset/train/california_train_4.csv', '/opt/ml/input/data/dataset/train/california_train_1.csv', '/opt/ml/input/data/dataset/train/california_train_3.csv']
Found validation files: ['/opt/ml/input/data/dataset/validation/california_validation_2.csv', '/opt/ml/input/data/dataset/validation/california_validation_1.csv', '/opt/ml/input/data/dataset/validation/california_validation_3.csv']
```

## Part 3: Hosting your model in SageMaker

In [None]:
from sagemaker.serializers import CSVSerializer

predictor = estimator.deploy(1, "ml.t2.medium", serializer=CSVSerializer())

### Fetch the testing data

Read the raw test data stored in S3 via DVC created by the SageMaker Processing Job. We use the `dvc` python API.

In [None]:
import io
import dvc.api

git_repo_https = f"https://git-codecommit.{region}.amazonaws.com/v1/repos/sagemaker-dvc-sample"

raw = dvc.api.read(
    "dataset/test/california_test.csv",
    repo=git_repo_https,
    rev=dvc_branch
)

Prepare the data

In [None]:
test = pd.read_csv(io.StringIO(raw), sep=",", header=None)
X_test = test.iloc[:, 1:].values
y_test = test.iloc[:, 0:1].values

## Invoke endpoint with the Python SDK

In [None]:
predicted = predictor.predict(X_test)
for i in range(len(predicted)-1):
    print(f"predicted: {predicted[i]}, actual: {y_test[i][0]}")

### Delete the Endpoint

Make sure to delete the endpoint to avoid un-expected costs

In [None]:
predictor.delete_endpoint()

### (Optional) Delete the Experiment, and all Trails, TrialComponents

In [None]:
#my_experiment.delete_all(action="--force")

### (Optional) Delete the AWS CodeCommit repository

In [None]:
#!aws codecommit delete-repository --repository-name sagemaker-dvc-sample