# Prerequisite

This notebook assumes you are using the `conda-env-dvc-kernel` image built and attached to a SageMaker Studio domain. Setup guidelines are available [here](https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/blob/main/sagemaker-studio-dvc-image/README.md).

# Training a CatBoost regression model with data from DVC

This notebook will guide you through an example that shows you how to build a Docker containers for SageMaker and use it for processing, training, and inference in conjunction with [DVC](https://dvc.org/).

By packaging libraries and algorithms in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies.

### California Housing dataset

We use the California Housing dataset, present in [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). 

The California Housing dataset was originally published in:

Pace, R. Kelley, and Ronald Barry. "Sparse spatial auto-regressions." Statistics & Probability Letters 33.3 (1997): 291-297.

### DVC

DVC is built to make machine learning (ML) models shareable and reproducible.
It is designed to handle large files, data sets, machine learning models, and metrics as well as code.

## Part 1: Configure DVC for data versioning

Let us create a subdirectory where we prepare the data, i.e. `sagemaker-dvc-sample`.
Within this subdirectory, we initialize a new git repository and set the remote to a repository we create in [AWS CodeCommit](https://aws.amazon.com/codecommit/).
The `dvc` configurations and files for data tracking will be versioned in this repository.
Git offers native capabilities to manage subprojects via, for example, `git submodules` and `git subtrees`, and you can extend this notebook to use any of the aforementioned tools that best fit your workflow.

One of the great advantage of using AWS CodeCommit in this context is its native integration with IAM for authentication purposes, meaning we can use SageMaker execution role to interact with the git server without the need to worry about how to store and retrieve credentials. Of course, you can always replace AWS CodeCommit with any other version control system based on git such as GitHub, Gitlab, or Bitbucket, keeping in mind you will need to handle the credentials in a secure manner, for example, by introducing Amazon Secret Managers to store and pull credentials at run time in the notebook as well as the processing and training jobs.

Setting the appropriate permissions on SageMaker execution role will also allow the SageMaker processing and training job to interact securely with the AWS CodeCommit.

In [None]:
%%sh

## Create the repository

repo_name="sagemaker-dvc-sample"

aws codecommit create-repository --repository-name ${repo_name} --repository-description "Sample repository to describe how to use dvc with sagemaker and codecommit"

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to eu-west-1 if none defined)
region=$(python -c "import boto3;print(boto3.Session().region_name)")
region=${region:-eu-west-1}

## repo_name is already in the .gitignore of the root repo

mkdir -p ${repo_name}
cd ${repo_name}

# initalize new repo in subfolder
git init
## Change the remote to the codecommit
git remote add origin https://git-codecommit."${region}".amazonaws.com/v1/repos/"${repo_name}"

# Configure git - change it according to your needs
git config --global user.email "sagemaker-studio-user@example.com"
git config --global user.name "SageMaker Studio User"

git config --global credential.helper '!aws codecommit credential-helper $@'
git config --global credential.UseHttpPath true

# Initialize dvc
dvc init

git commit -m 'Add dvc configuration'

# Set the DVC remote storage to S3 - uses the sagemaker standard default bucket
dvc remote add -d storage s3://sagemaker-"${region}"-"${account}"/DEMO-sagemaker-experiments-dvc
git commit .dvc/config -m "initialize DVC local remote"

# set the DVC cache to S3
dvc remote add s3cache s3://sagemaker-"${region}"-"${account}"/DEMO-sagemaker-experiments-dvc/cache
dvc config cache.s3 s3cache

# disable sending anonymized data to dvc for troubleshooting
dvc config core.analytics false

git add .dvc/config
git commit -m 'update dvc config'

git push --set-upstream origin master #--force

## Part 2: Packaging and Uploading your container images for use with Amazon SageMaker

### An overview of Docker

If you're familiar with Docker already, you can skip ahead to the next section.

For many data scientists, Docker containers are a new concept, but they are not difficult, as you'll see here. 

Docker provides a simple way to package arbitrary code into an _image_ that is totally self-contained. Once you have an image, you can use Docker to run a _container_ based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way you set up your program is the way it runs, no matter where you run it.

Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language independent and (b) it comprises your whole operating environment, including startup commands, environment variable, etc.

In some ways, a Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run on the same physical machine or virtual machine instance.

Docker uses a simple file called a `Dockerfile` to specify how the image is assembled. We'll see an example of that below. You can build your Docker images based on Docker images built by yourself or others, which can simplify things quite a bit.

Docker has become very popular in the programming and devops communities for its flexibility and well-defined specification of the code to be run. It is the underpinning of many services built in the past few years, such as [Amazon ECS].

Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.

In Amazon SageMaker, Docker containers are invoked in a certain way for training and a slightly different way for hosting. The following sections outline how to build containers for the SageMaker environment.

Some helpful links:

* [Docker home page](http://www.docker.com)
* [Getting started with Docker](https://docs.docker.com/get-started/)
* [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)
* [`docker run` reference](https://docs.docker.com/engine/reference/run/)

[Amazon ECS]: https://aws.amazon.com/ecs/

### SageMaker Docker container for processing

Let us now build and register the container for processing. In doing so, we ensure that all `dvc` related dependencies are already installed and we do not need to `pip install` or `git configure` anything within the processing scripts, and we can concentrate on the data preparation and feature engineering.

We aim to have one image for processing where we can supply our own processing script. More information on how to build your own processing container can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-container-run-scripts.html). For a formal specification that defines the contract for an Amazon SageMaker Processing container, see [Build Your Own Processing Container (Advanced Scenario)](https://docs.aws.amazon.com/sagemaker/latest/dg/build-your-own-processing-container.html). 

In [None]:
!cat container/processing/Dockerfile

### Building and registering the containers

We will use [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/) to store our container images.

To easily build custom container images from your Studio notebooks, we use the [SageMaker Docker Build CLI](https://github.com/aws-samples/sagemaker-studio-image-build-cli). For more information on the SageMaker Docker Build CLI, interested readers can refer to [this blogpost](https://aws.amazon.com/blogs/machine-learning/using-the-amazon-sagemaker-studio-image-build-cli-to-build-container-images-from-your-studio-notebooks/).

In [None]:
%%sh

# The name of the image
image_name=sagemaker-processing-dvc

cd container/processing

# Get the region defined in the current configuration (default to eu-west-1 if none defined)
region=$(python -c "import boto3;print(boto3.Session().region_name)")
region=${region:-eu-west-1}

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --region "${region}" --repository-names "${image_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
 aws ecr create-repository --region "${region}" --repository-name "${image_name}" > /dev/null
fi

sm-docker build . --repository "${image_name}:latest"

### SageMaker Docker container for training and hosting

Because you can run the same image in training or hosting, Amazon SageMaker runs your container with the argument `train` or `serve`. How your container processes this argument depends on the container:

* In the example here, we don't define an `ENTRYPOINT` in the Dockerfile so Docker will run the command `train` at training time and `serve` at serving time. In this example, we define these as executable Python scripts, but they could be any program that we want to start in that environment.
* If you specify a program as an `ENTRYPOINT` in the Dockerfile, that program will be run at startup and its first argument will be `train` or `serve`. The program can then look at that argument and decide what to do.
* If you are building separate containers for training and hosting (or building only for one or the other), you can define a program as an `ENTRYPOINT` in the Dockerfile and ignore (or verify) the first argument passed in. 


#### Running your container during training

When Amazon SageMaker runs training, your `train` script is run just like a regular Python program. A number of files are laid out for your use, under the `/opt/ml` directory:

 /opt/ml
 |-- input
 | |-- config
 | | |-- hyperparameters.json
 | | `-- resourceConfig.json
 | `-- data
 | `-- 
 | `-- 
 |-- model
 | `-- 
 `-- output
 `-- failure

##### The input

* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn't support distributed training, we'll ignore it here.
* `/opt/ml/input/data//` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it's generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. 
* `/opt/ml/input/data/_` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

##### The output

* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the `DescribeTrainingJob` result.
* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file will be returned in the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it will be ignored.

#### Running your container during hosting

Hosting has a very different model than training because hosting is responding to inference requests that come in via HTTP. In this example, we use our recommended Python serving stack to provide robust and scalable serving of inference requests.

This stack is implemented in the sample code here and you can mostly just leave it alone. 

Amazon SageMaker uses two URLs in the container:

* `/ping` will receive `GET` requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.
* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these will be passed in as well. 

The container will have the model files in the same place they were written during training:

 /opt/ml
 `-- model
 `-- 

### The parts of the training and inference container

In the `container/train_and_serve` directory are all the components you need to package the sample algorithm for Amazon SageMager:

 .
 |-- Dockerfile
 |-- README.md
 `-- catboost_regressor
 |-- nginx.conf
 |-- predictor.py
 |-- serve
 |-- train
 `-- wsgi.py

Let's discuss each of these in turn:

* __`Dockerfile`__ describes how to build your Docker container image. More details below.
* __`catboost_regressor`__ is the directory which contains the files that will be installed in the container.

In this simple application, we only install five files in the container. You may only need that many or, if you have many supporting routines, you may wish to install more. These five show the standard structure of our Python containers, although you are free to choose a different toolset and therefore could have a different layout. If you're writing in a different programming language, you'll certainly have a different layout depending on the frameworks and tools you choose.

The files that we'll put in the container are:

* __`nginx.conf`__ is the configuration file for the nginx front-end. Generally, you should be able to take this file as-is.
* __`predictor.py`__ is the program that actually implements the Flask web server and the decision tree predictions for this app. You'll want to customize the actual prediction parts to your application. Since this algorithm is simple, we do all the processing here in this file, but you may choose to have separate files for implementing your custom logic.
* __`serve`__ is the program started when the container is started for hosting. It simply launches the gunicorn server which runs multiple instances of the Flask app defined in `predictor.py`. You should be able to take this file as-is.
* __`train`__ is the program that is invoked when the container is run for training. You will modify this program to implement your training algorithm.
* __`wsgi.py`__ is a small wrapper used to invoke the Flask app. You should be able to take this file as-is.

In summary, the two files you will probably want to change for your application are `train` and `predictor.py`.

In [None]:
!cat container/train_and_serve/Dockerfile

In the `container/train_and_serve` directory are all the components you need to package the sample algorithm for Amazon SageMaker:

 .
 `-- container/train_and_serve/
 |-- Dockerfile
 |-- README.md
 `--catboost_regressor/
 |-- nginx.conf
 |-- predictor.py
 |-- serve
 |-- train
 |-- wsgi.py


In [None]:
%%sh

# The name of our algorithm
algorithm_name=sagemaker-catboost-dvc

cd container/train_and_serve

chmod +x catboost_regressor/train
chmod +x catboost_regressor/serve

# Get the region defined in the current configuration (default to us-west-1 if none defined)
region=$(python -c "import boto3;print(boto3.Session().region_name)")
region=${region:-eu-west-1}

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --region "${region}" --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
 aws ecr create-repository --region "${region}" --repository-name "${algorithm_name}" > /dev/null
fi

sm-docker build . --repository "${algorithm_name}:latest"

## Part 3: Processing and Training with DVC and SageMaker

In this section we explore two different approaches to tackle our problem and how we can keep track of the 2 tests using SageMaker Experiments.

The high level conceptual architecture is depicted in the figure below



Let's unfold in the following sections the implementation details of the two experiments.

### Import libraries and initial setup

Lets start by importing the libraries and setup variables that will be useful as we go along in the notebook.

In [None]:
import boto3
import sagemaker
import time
from time import strftime

boto_session = boto3.Session()
sagemaker_session = sagemaker.Session(boto_session=boto_session)
sm_client = boto3.client("sagemaker")
region = boto_session.region_name
bucket = sagemaker_session.default_bucket()
role = sagemaker.get_execution_role()
account = sagemaker_session.boto_session.client("sts").get_caller_identity()["Account"]

prefix = 'DEMO-sagemaker-experiments-dvc'

print(f"account: {account}")
print(f"bucket: {bucket}")
print(f"region: {region}")
print(f"role: {role}")

### Prepare raw data

We upload the raw data to S3 in the default bucket.

In [None]:
import pandas as pd
import numpy as np

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

from pathlib import Path

databunch = fetch_california_housing()
dataset = np.concatenate((databunch["target"].reshape(-1, 1), databunch["data"]), axis=1)

print(f"Dataset shape = {dataset.shape}")
np.savetxt("dataset.csv", dataset, delimiter=",")

data_prefix_path = f"{prefix}/input/dataset.csv"
s3_data_path = f"s3://{bucket}/{data_prefix_path}"
print(f"Raw data location in S3: {s3_data_path}")

s3 = boto3.client("s3")
s3.upload_file("dataset.csv", bucket, data_prefix_path)

### Setup SageMaker Experiments

Amazon SageMaker Experiments have been built for data scientists that are performing different experiments as part of their model development process and want a simple way to organize, track, compare, and evaluate their machine learning experiments.

Let’s start first with an overview of Amazon SageMaker Experiments features:

* Organize Experiments: Amazon SageMaker Experiments structures experimentation with a first top level entity called experiment that contains a set of trials. Each trial contains a set of steps called trial components. Each trial component is a combination of datasets, algorithms, parameters, and artifacts. You can picture experiments as the top level “folder” for organizing your hypotheses, your trials as the “subfolders” for each group test run, and your trial components as your “files” for each instance of a test run.
* Track Experiments: Amazon SageMaker Experiments allows the data scientist to track experiments automatically or manually. Amazon SageMaker Experiments offers the possibility to automatically assign the sagemaker jobs to a trial specifying the `experiment_config` argument, or to manually call the tracking APIs.
* Compare and Evaluate Experiments: The integration of Amazon SageMaker Experiments with Amazon SageMaker Studio makes it easier to produce data visualizations and compare different trials to identify the best combination of hyperparameters.

Now, in order to track this test in SageMaker, we need to create an experiment.

In [None]:
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

experiment_name = 'DEMO-sagemaker-experiments-dvc'

# create the experiment if it doesn't exist
try:
 my_experiment = Experiment.load(experiment_name=experiment_name)
 print("existing experiment loaded")
except Exception as ex:
 if "ResourceNotFound" in str(ex):
 my_experiment = Experiment.create(
 experiment_name = experiment_name,
 description = "How to integrate DVC"
 )
 print("new experiment created")
 else:
 print(f"Unexpected {ex}=, {type(ex)}")
 print("Dont go forward!")
 raise

We need to also define trials within the experiment.
While it is possible to have any number of trials within an experiment, for our excercise, we will create 2 trials, one for each processing strategy.

### Test 1: generate single files for training and validation

In this test, we show how to create a processing script that fetches the raw data directly from S3 as an input, process it to create the triplet `train`, `validation` and `test`, and store the results back to S3 using `dvc`. Furthermore, we show how you can pair `dvc` with SageMaker native tracking capabilities when executing Processing and Training Jobs and via SageMaker Experiments.

In [None]:
first_trial_name = "dvc-trial-single-file"

try:
 my_first_trial = Trial.load(trial_name=first_trial_name)
 print("existing trial loaded")
except Exception as ex:
 if "ResourceNotFound" in str(ex):
 my_first_trial = Trial.create(
 experiment_name=experiment_name,
 trial_name=first_trial_name,
 )
 print("new trial created")
 else:
 print(f"Unexpected {ex}=, {type(ex)}")
 print("Dont go forward!")
 raise

### Processing script: version data with DVC

The processing script takes as arguments the address of the git repository, and the branch we want to create to store the `dvc` metadata. The datasets themselves will be then stored in S3. The arguments passed to the processing scripts are not automatically tracked in SageMaker Experiments in the automatically generated TrialComponent. The TrialComponent generated by SageMaker can be loaded within the Processing Job and further enrich with any extra data, which then become available for visualization in the SageMaker Studio UI. In our case, we will store the following data:
* `data_repo_url`
* `data_branch`
* `data_commit_hash`
* `train_test_split_ratio`

In [None]:
!pygmentize 'source_dir/preprocessing-experiment.py'

### SageMaker Processing job

We have now all ingredients to execute our SageMaker Processing Job:
* a custom image with dvc installed
* a git repository (i.e., AWS CodeCommit)
* a processing script that can process several arguments (i.e., `--train-test-split-ratio`, `--dvc-repo-url`, `--dvc-branch`)
* a SageMaker Experiment and a Trial

In [None]:
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

dvc_repo_url = "codecommit::{}://sagemaker-dvc-sample".format(region)
dvc_branch = my_first_trial.trial_name

image = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-processing-dvc:latest".format(account, region)

script_processor = ScriptProcessor(command=['python3'],
 image_uri=image,
 role=role,
 instance_count=1,
 instance_type='ml.m5.xlarge',
 env={
 "DVC_REPO_URL": dvc_repo_url,
 "DVC_BRANCH": dvc_branch,
 "USER": "sagemaker"
 },
 )

experiment_config={
 "ExperimentName": my_experiment.experiment_name,
 "TrialName": my_first_trial.trial_name
}

Executing the processing job will take around 3-4 minutes.

In [None]:
%%time

script_processor.run(
 code='source_dir/preprocessing-experiment.py',
 inputs=[ProcessingInput(source=s3_data_path, destination="/opt/ml/processing/input")],
 experiment_config=experiment_config,
 arguments=["--train-test-split-ratio", "0.2"],
)

### Create an estimator and fit the model

To use DVC integration, pass a `dvc_repo_url` and `dvc_branch` as parameters when you create the Estimator object.

We will train on the `dvc-trial-single-file` branch first.

When doing `dvc pull` in the training script, the following dataset structure will be generated:

```
dataset
 |-- train
 | |-- california_train.csv
 |-- test
 | |-- california_test.csv
 |-- validation
 | |-- california_validation.csv
```

#### Metric definition

SageMaker emits every log that is going to STDOUT to CLoudWatch. In order to capture the metrics we are interested in, we need to specify a metric definition object to define the format of the metrics via regex. By doing so, SageMaker will know how to capture the metrics from the CloudWatch logs of the training job.

In our case, we are interested in the median error.
```
metric_definitions = [{'Name': 'median-AE', 'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}]
```

In [None]:
image = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-catboost-dvc:latest".format(account, region)

metric_definitions = [{'Name': 'median-AE', 'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}]

hyperparameters={ 
 "learning_rate" : 1,
 "depth": 6
 }

estimator = sagemaker.estimator.Estimator(
 image,
 role,
 instance_count=1,
 metric_definitions=metric_definitions,
 instance_type="ml.m5.large",
 sagemaker_session=sagemaker_session,
 hyperparameters=hyperparameters,
 environment={
 "DVC_REPO_URL": dvc_repo_url,
 "DVC_BRANCH": dvc_branch,
 "USER": "sagemaker"
 }
)

experiment_config={
 "ExperimentName": my_experiment.experiment_name,
 "TrialName": my_first_trial.trial_name
}

In [None]:
%%time

estimator.fit(experiment_config=experiment_config)

On the logs above you can see those lines, indicating about the files pulled by dvc:

```
Running dvc pull command
A train/california_train.csv
A test/california_test.csv
A validation/california_validation.csv
3 files added and 3 files fetched
Starting the training.
Found train files: ['/opt/ml/input/data/dataset/train/california_train.csv']
Found validation files: ['/opt/ml/input/data/dataset/train/california_train.csv']
```

### Test 2: generate multiple files for training and validation

In [None]:
second_trial_name = "dvc-trial-multi-files"

try:
 my_second_trial = Trial.load(trial_name=second_trial_name)
 print("existing trial loaded")
except Exception as ex:
 if "ResourceNotFound" in str(ex):
 my_second_trial = Trial.create(
 experiment_name=experiment_name,
 trial_name=second_trial_name,
 )
 print("new trial created")
 else:
 print(f"Unexpected {ex}=, {type(ex)}")
 print("Dont go forward!")
 raise

Differently from the first processing script, we now create out of the original dataset multiple files for training and validation and store the `dvc` metadata in a different branch.

In [None]:
!pygmentize 'code/preprocessing-experiment-multifiles.py'

In [None]:
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

image = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-processing-dvc:latest".format(account, region)

dvc_branch = my_second_trial.trial_name

script_processor = ScriptProcessor(command=['python3'],
 image_uri=image,
 role=role,
 instance_count=1,
 instance_type='ml.m5.xlarge',
 env={
 "DVC_REPO_URL": dvc_repo_url,
 "DVC_BRANCH": dvc_branch,
 "USER": "sagemaker"
 },
 )

experiment_config={
 "ExperimentName": my_experiment.experiment_name,
 "TrialName": my_second_trial.trial_name
}

Executing the processing job will take ~5 minutes

In [None]:
%%time

script_processor.run(
 code='source_dir/preprocessing-experiment-multifiles.py',
 inputs=[ProcessingInput(source=s3_data_path, destination="/opt/ml/processing/input")],
 experiment_config=experiment_config,
 arguments=["--train-test-split-ratio", "0.1"],
)

We will now train on the `dvc-trial-multi-files` branch.

When doing `dvc pull`, this is the dataset structure:

```
dataset
 |-- train
 | |-- california_train_1.csv
 | |-- california_train_2.csv
 | |-- california_train_3.csv
 | |-- california_train_4.csv
 | |-- california_train_5.csv
 |-- test
 | |-- california_test.csv
 |-- validation
 | |-- california_validation_1.csv
 | |-- california_validation_2.csv
 | |-- california_validation_3.csv
```

In [None]:
image = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-catboost-dvc:latest".format(account, region)

hyperparameters={ 
 "learning_rate" : 1,
 "depth": 6
 }

estimator = sagemaker.estimator.Estimator(
 image,
 role,
 instance_count=1,
 metric_definitions=metric_definitions,
 instance_type="ml.m5.large",
 sagemaker_session=sagemaker_session,
 hyperparameters=hyperparameters,
 environment={
 "DVC_REPO_URL": dvc_repo_url,
 "DVC_BRANCH": dvc_branch,
 "USER": "sagemaker"
 }
)

experiment_config={
 "ExperimentName": my_experiment.experiment_name,
 "TrialName": my_second_trial.trial_name,
}

The training job will take aroudn ~5 minutes

In [None]:
%%time

estimator.fit(experiment_config=experiment_config)

On the logs above you can see those lines, indicating about the files pulled by dvc:

```
Running dvc pull command
A validation/california_validation_2.csv
A validation/california_validation_1.csv
A validation/california_validation_3.csv
A train/california_train_4.csv
A train/california_train_5.csv
A train/california_train_2.csv
A train/california_train_3.csv
A train/california_train_1.csv
A test/california_test.csv
9 files added and 9 files fetched
Starting the training.
Found train files: ['/opt/ml/input/data/dataset/train/california_train_2.csv', '/opt/ml/input/data/dataset/train/california_train_5.csv', '/opt/ml/input/data/dataset/train/california_train_4.csv', '/opt/ml/input/data/dataset/train/california_train_1.csv', '/opt/ml/input/data/dataset/train/california_train_3.csv']
Found validation files: ['/opt/ml/input/data/dataset/validation/california_validation_2.csv', '/opt/ml/input/data/dataset/validation/california_validation_1.csv', '/opt/ml/input/data/dataset/validation/california_validation_3.csv']
```

## Part 4: Hosting your model in SageMaker

In [None]:
from sagemaker.predictor import csv_serializer

predictor = estimator.deploy(1, "ml.t2.medium", serializer=csv_serializer)

### Fetch the testing data

Save locally the test data stored in S3 via DVC created by the SageMaker Processing Job.

In [None]:
%%sh

cd sagemaker-dvc-sample

# get all remote branches
git fetch --all

# move to the ddvc-trial-multi-files
git checkout dvc-trial-multi-files

# gather the data (for testing purpuse)
dvc pull

Prepare the data

In [None]:
test = pd.read_csv("./sagemaker-dvc-sample/dataset/test/california_test.csv",header=None)
X_test = test.iloc[:, 1:].values
y_test = test.iloc[:, 0:1].values

## Invoke endpoint with the Python SDK

In [None]:
predicted = predictor.predict(X_test).decode('utf-8').split('\n')
for i in range(len(predicted)-1):
 print(f"predicted: {predicted[i]}, actual: {y_test[i][0]}")

### Delete the Endpoint

Make sure to delete the endpoint to avoid un-expected costs

In [None]:
predictor.delete_endpoint()

### (Optional) Delete the Experiment, and all Trails, TrialComponents

In [None]:
my_experiment.delete_all(action="--force")

### (Optional) Delete the AWS CodeCommit repository

In [None]:
!aws codecommit delete-repository --repository-name sagemaker-dvc-sample

### (Optional) Delete the AWS ECR repositories

In [None]:
!aws ecr delete-repository --repository-name sagemaker-catboost-dvc --force
!aws ecr delete-repository --repository-name sagemaker-processing-dvc --force