{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "def06897",
   "metadata": {},
   "source": [
    "# Prerequisite\n",
    "\n",
    "This notebook assumes you are using the `conda-env-dvc-kernel` image built and attached to a SageMaker Studio domain. Setup guidelines are available [here](https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/blob/main/sagemaker-studio-dvc-image/README.md).\n",
    "\n",
    "# Training a CatBoost regression model with data from DVC\n",
    "\n",
    "This notebook will guide you through an example that shows you how to build a Docker containers for SageMaker and use it for processing, training, and inference in conjunction with [DVC](https://dvc.org/).\n",
    "\n",
    "By packaging libraries and algorithms in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies.\n",
    "\n",
    "### California Housing dataset\n",
    "\n",
    "We use the California Housing dataset, present in [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). \n",
    "\n",
    "The California Housing dataset was originally published in:\n",
    "\n",
    "Pace, R. Kelley, and Ronald Barry. \"Sparse spatial auto-regressions.\" Statistics & Probability Letters 33.3 (1997): 291-297.\n",
    "\n",
    "### DVC\n",
    "\n",
    "DVC is built to make machine learning (ML) models shareable and reproducible.\n",
    "It is designed to handle large files, data sets, machine learning models, and metrics as well as code."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21f94258",
   "metadata": {},
   "source": [
    "## Part 1: Configure DVC for data versioning\n",
    "\n",
    "Let us create a subdirectory where we prepare the data, i.e. `sagemaker-dvc-sample`.\n",
    "Within this subdirectory, we initialize a new git repository and set the remote to a repository we create in [AWS CodeCommit](https://aws.amazon.com/codecommit/).\n",
    "The `dvc` configurations and files for data tracking will be versioned in this repository.\n",
    "Git offers native capabilities to manage subprojects via, for example, `git submodules` and `git subtrees`, and you can extend this notebook to use any of the aforementioned tools that best fit your workflow.\n",
    "\n",
    "One of the great advantage of using AWS CodeCommit in this context is its native integration with IAM for authentication purposes, meaning we can use SageMaker execution role to interact with the git server without the need to worry about how to store and retrieve credentials. Of course, you can always replace AWS CodeCommit with any other version control system based on git such as GitHub, Gitlab, or Bitbucket, keeping in mind you will need to handle the credentials in a secure manner, for example, by introducing Amazon Secret Managers to store and pull credentials at run time in the notebook as well as the processing and training jobs.\n",
    "\n",
    "Setting the appropriate permissions on SageMaker execution role will also allow the SageMaker processing and training job to interact securely with the AWS CodeCommit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7ddaba7",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "\n",
    "## Create the repository\n",
    "\n",
    "repo_name=\"sagemaker-dvc-sample\"\n",
    "\n",
    "aws codecommit create-repository --repository-name ${repo_name} --repository-description \"Sample repository to describe how to use dvc with sagemaker and codecommit\"\n",
    "\n",
    "account=$(aws sts get-caller-identity --query Account --output text)\n",
    "\n",
    "# Get the region defined in the current configuration (default to eu-west-1 if none defined)\n",
    "region=$(python -c \"import boto3;print(boto3.Session().region_name)\")\n",
    "region=${region:-eu-west-1}\n",
    "\n",
    "## repo_name is already in the .gitignore of the root repo\n",
    "\n",
    "mkdir -p ${repo_name}\n",
    "cd ${repo_name}\n",
    "\n",
    "# initalize new repo in subfolder\n",
    "git init\n",
    "## Change the remote to the codecommit\n",
    "git remote add origin https://git-codecommit.\"${region}\".amazonaws.com/v1/repos/\"${repo_name}\"\n",
    "\n",
    "# Configure git - change it according to your needs\n",
    "git config --global user.email \"sagemaker-studio-user@example.com\"\n",
    "git config --global user.name \"SageMaker Studio User\"\n",
    "\n",
    "git config --global credential.helper '!aws codecommit credential-helper $@'\n",
    "git config --global credential.UseHttpPath true\n",
    "\n",
    "# Initialize dvc\n",
    "dvc init\n",
    "\n",
    "git commit -m 'Add dvc configuration'\n",
    "\n",
    "# Set the DVC remote storage to S3 - uses the sagemaker standard default bucket\n",
    "dvc remote add -d storage s3://sagemaker-\"${region}\"-\"${account}\"/DEMO-sagemaker-experiments-dvc\n",
    "git commit .dvc/config -m \"initialize DVC local remote\"\n",
    "\n",
    "# set the DVC cache to S3\n",
    "dvc remote add s3cache s3://sagemaker-\"${region}\"-\"${account}\"/DEMO-sagemaker-experiments-dvc/cache\n",
    "dvc config cache.s3 s3cache\n",
    "\n",
    "# disable sending anonymized data to dvc for troubleshooting\n",
    "dvc config core.analytics false\n",
    "\n",
    "git add .dvc/config\n",
    "git commit -m 'update dvc config'\n",
    "\n",
    "git push --set-upstream origin master #--force"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba876ca1",
   "metadata": {},
   "source": [
    "## Part 2: Packaging and Uploading your container images for use with Amazon SageMaker\n",
    "\n",
    "### An overview of Docker\n",
    "\n",
    "If you're familiar with Docker already, you can skip ahead to the next section.\n",
    "\n",
    "For many data scientists, Docker containers are a new concept, but they are not difficult, as you'll see here. \n",
    "\n",
    "Docker provides a simple way to package arbitrary code into an _image_ that is totally self-contained. Once you have an image, you can use Docker to run a _container_ based on that image. Running a container is just like running a program on the machine except that the container creates a fully self-contained environment for the program to run. Containers are isolated from each other and from the host environment, so the way you set up your program is the way it runs, no matter where you run it.\n",
    "\n",
    "Docker is more powerful than environment managers like conda or virtualenv because (a) it is completely language independent and (b) it comprises your whole operating environment, including startup commands, environment variable, etc.\n",
    "\n",
    "In some ways, a Docker container is like a virtual machine, but it is much lighter weight. For example, a program running in a container can start in less than a second and many containers can run on the same physical machine or virtual machine instance.\n",
    "\n",
    "Docker uses a simple file called a `Dockerfile` to specify how the image is assembled. We'll see an example of that below. You can build your Docker images based on Docker images built by yourself or others, which can simplify things quite a bit.\n",
    "\n",
    "Docker has become very popular in the programming and devops communities for its flexibility and well-defined specification of the code to be run. It is the underpinning of many services built in the past few years, such as [Amazon ECS].\n",
    "\n",
    "Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.\n",
    "\n",
    "In Amazon SageMaker, Docker containers are invoked in a certain way for training and a slightly different way for hosting. The following sections outline how to build containers for the SageMaker environment.\n",
    "\n",
    "Some helpful links:\n",
    "\n",
    "* [Docker home page](http://www.docker.com)\n",
    "* [Getting started with Docker](https://docs.docker.com/get-started/)\n",
    "* [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)\n",
    "* [`docker run` reference](https://docs.docker.com/engine/reference/run/)\n",
    "\n",
    "[Amazon ECS]: https://aws.amazon.com/ecs/"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31ba56ff",
   "metadata": {},
   "source": [
    "### SageMaker Docker container for processing\n",
    "\n",
    "Let us now build and register the container for processing. In doing so, we ensure that all `dvc` related dependencies are already installed and we do not need to `pip install` or `git configure` anything within the processing scripts, and we can concentrate on the data preparation and feature engineering.\n",
    "\n",
    "We aim to have one image for processing where we can supply our own processing script. More information on how to build your own processing container can be found [here](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-container-run-scripts.html). For a formal specification that defines the contract for an Amazon SageMaker Processing container, see [Build Your Own Processing Container (Advanced Scenario)](https://docs.aws.amazon.com/sagemaker/latest/dg/build-your-own-processing-container.html). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df0f1445",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat container/processing/Dockerfile"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b61c7fc6",
   "metadata": {},
   "source": [
    "### Building and registering the containers\n",
    "\n",
    "We will use [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/) to store our container images.\n",
    "\n",
    "To easily build custom container images from your Studio notebooks, we use the [SageMaker Docker Build CLI](https://github.com/aws-samples/sagemaker-studio-image-build-cli). For more information on the SageMaker Docker Build CLI, interested readers can refer to [this blogpost](https://aws.amazon.com/blogs/machine-learning/using-the-amazon-sagemaker-studio-image-build-cli-to-build-container-images-from-your-studio-notebooks/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa713178",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "\n",
    "# The name of the image\n",
    "image_name=sagemaker-processing-dvc\n",
    "\n",
    "cd container/processing\n",
    "\n",
    "# Get the region defined in the current configuration (default to eu-west-1 if none defined)\n",
    "region=$(python -c \"import boto3;print(boto3.Session().region_name)\")\n",
    "region=${region:-eu-west-1}\n",
    "\n",
    "# If the repository doesn't exist in ECR, create it.\n",
    "aws ecr describe-repositories --region \"${region}\" --repository-names \"${image_name}\" > /dev/null 2>&1\n",
    "\n",
    "if [ $? -ne 0 ]\n",
    "then\n",
    "    aws ecr create-repository --region \"${region}\" --repository-name \"${image_name}\" > /dev/null\n",
    "fi\n",
    "\n",
    "sm-docker build . --repository \"${image_name}:latest\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ac30a72",
   "metadata": {},
   "source": [
    "### SageMaker Docker container for training and hosting\n",
    "\n",
    "Because you can run the same image in training or hosting, Amazon SageMaker runs your container with the argument `train` or `serve`. How your container processes this argument depends on the container:\n",
    "\n",
    "* In the example here, we don't define an `ENTRYPOINT` in the Dockerfile so Docker will run the command `train` at training time and `serve` at serving time. In this example, we define these as executable Python scripts, but they could be any program that we want to start in that environment.\n",
    "* If you specify a program as an `ENTRYPOINT` in the Dockerfile, that program will be run at startup and its first argument will be `train` or `serve`. The program can then look at that argument and decide what to do.\n",
    "* If you are building separate containers for training and hosting (or building only for one or the other), you can define a program as an `ENTRYPOINT` in the Dockerfile and ignore (or verify) the first argument passed in. \n",
    "\n",
    "\n",
    "#### Running your container during training\n",
    "\n",
    "When Amazon SageMaker runs training, your `train` script is run just like a regular Python program. A number of files are laid out for your use, under the `/opt/ml` directory:\n",
    "\n",
    "    /opt/ml\n",
    "    |-- input\n",
    "    |   |-- config\n",
    "    |   |   |-- hyperparameters.json\n",
    "    |   |   `-- resourceConfig.json\n",
    "    |   `-- data\n",
    "    |       `-- <channel_name>\n",
    "    |           `-- <input data>\n",
    "    |-- model\n",
    "    |   `-- <model files>\n",
    "    `-- output\n",
    "        `-- failure\n",
    "\n",
    "##### The input\n",
    "\n",
    "* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn't support distributed training, we'll ignore it here.\n",
    "* `/opt/ml/input/data/<channel_name>/` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it's generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. \n",
    "* `/opt/ml/input/data/<channel_name>_<epoch_number>` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.\n",
    "\n",
    "##### The output\n",
    "\n",
    "* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the `DescribeTrainingJob` result.\n",
    "* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file will be returned in the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it will be ignored.\n",
    "\n",
    "#### Running your container during hosting\n",
    "\n",
    "Hosting has a very different model than training because hosting is responding to inference requests that come in via HTTP. In this example, we use our recommended Python serving stack to provide robust and scalable serving of inference requests.\n",
    "\n",
    "This stack is implemented in the sample code here and you can mostly just leave it alone. \n",
    "\n",
    "Amazon SageMaker uses two URLs in the container:\n",
    "\n",
    "* `/ping` will receive `GET` requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.\n",
    "* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these will be passed in as well. \n",
    "\n",
    "The container will have the model files in the same place they were written during training:\n",
    "\n",
    "    /opt/ml\n",
    "    `-- model\n",
    "        `-- <model files>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "afb0b22a",
   "metadata": {},
   "source": [
    "### The parts of the training and inference container\n",
    "\n",
    "In the `container/train_and_serve` directory are all the components you need to package the sample algorithm for Amazon SageMager:\n",
    "\n",
    "    .\n",
    "    |-- Dockerfile\n",
    "    |-- README.md\n",
    "    `-- catboost_regressor\n",
    "        |-- nginx.conf\n",
    "        |-- predictor.py\n",
    "        |-- serve\n",
    "        |-- train\n",
    "        `-- wsgi.py\n",
    "\n",
    "Let's discuss each of these in turn:\n",
    "\n",
    "* __`Dockerfile`__ describes how to build your Docker container image. More details below.\n",
    "* __`catboost_regressor`__ is the directory which contains the files that will be installed in the container.\n",
    "\n",
    "In this simple application, we only install five files in the container. You may only need that many or, if you have many supporting routines, you may wish to install more. These five show the standard structure of our Python containers, although you are free to choose a different toolset and therefore could have a different layout. If you're writing in a different programming language, you'll certainly have a different layout depending on the frameworks and tools you choose.\n",
    "\n",
    "The files that we'll put in the container are:\n",
    "\n",
    "* __`nginx.conf`__ is the configuration file for the nginx front-end. Generally, you should be able to take this file as-is.\n",
    "* __`predictor.py`__ is the program that actually implements the Flask web server and the decision tree predictions for this app. You'll want to customize the actual prediction parts to your application. Since this algorithm is simple, we do all the processing here in this file, but you may choose to have separate files for implementing your custom logic.\n",
    "* __`serve`__ is the program started when the container is started for hosting. It simply launches the gunicorn server which runs multiple instances of the Flask app defined in `predictor.py`. You should be able to take this file as-is.\n",
    "* __`train`__ is the program that is invoked when the container is run for training. You will modify this program to implement your training algorithm.\n",
    "* __`wsgi.py`__ is a small wrapper used to invoke the Flask app. You should be able to take this file as-is.\n",
    "\n",
    "In summary, the two files you will probably want to change for your application are `train` and `predictor.py`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7b6b75bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat container/train_and_serve/Dockerfile"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a14543c",
   "metadata": {},
   "source": [
    "In the `container/train_and_serve` directory are all the components you need to package the sample algorithm for Amazon SageMaker:\n",
    "\n",
    "    .\n",
    "    `-- container/train_and_serve/\n",
    "        |-- Dockerfile\n",
    "        |-- README.md\n",
    "        `--catboost_regressor/\n",
    "            |-- nginx.conf\n",
    "            |-- predictor.py\n",
    "            |-- serve\n",
    "            |-- train\n",
    "            |-- wsgi.py\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "61992f99",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "\n",
    "# The name of our algorithm\n",
    "algorithm_name=sagemaker-catboost-dvc\n",
    "\n",
    "cd container/train_and_serve\n",
    "\n",
    "chmod +x catboost_regressor/train\n",
    "chmod +x catboost_regressor/serve\n",
    "\n",
    "# Get the region defined in the current configuration (default to us-west-1 if none defined)\n",
    "region=$(python -c \"import boto3;print(boto3.Session().region_name)\")\n",
    "region=${region:-eu-west-1}\n",
    "\n",
    "# If the repository doesn't exist in ECR, create it.\n",
    "aws ecr describe-repositories --region \"${region}\" --repository-names \"${algorithm_name}\" > /dev/null 2>&1\n",
    "\n",
    "if [ $? -ne 0 ]\n",
    "then\n",
    "    aws ecr create-repository --region \"${region}\" --repository-name \"${algorithm_name}\" > /dev/null\n",
    "fi\n",
    "\n",
    "sm-docker build . --repository \"${algorithm_name}:latest\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c337c82",
   "metadata": {},
   "source": [
    "## Part 3: Processing and Training with DVC and SageMaker\n",
    "\n",
    "In this section we explore two different approaches to tackle our problem and how we can keep track of the 2 tests using SageMaker Experiments.\n",
    "\n",
    "The high level conceptual architecture is depicted in the figure below\n",
    "\n",
    "<img src=\"./img/high-level-architecture.png\">\n",
    "\n",
    "Let's unfold in the following sections the implementation details of the two experiments.\n",
    "\n",
    "### Import libraries and initial setup\n",
    "\n",
    "Lets start by importing the libraries and setup variables that will be useful as we go along in the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "feead477",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker\n",
    "import time\n",
    "from time import strftime\n",
    "\n",
    "boto_session = boto3.Session()\n",
    "sagemaker_session = sagemaker.Session(boto_session=boto_session)\n",
    "sm_client = boto3.client(\"sagemaker\")\n",
    "region = boto_session.region_name\n",
    "bucket = sagemaker_session.default_bucket()\n",
    "role = sagemaker.get_execution_role()\n",
    "account = sagemaker_session.boto_session.client(\"sts\").get_caller_identity()[\"Account\"]\n",
    "\n",
    "prefix = 'DEMO-sagemaker-experiments-dvc'\n",
    "\n",
    "print(f\"account: {account}\")\n",
    "print(f\"bucket: {bucket}\")\n",
    "print(f\"region: {region}\")\n",
    "print(f\"role: {role}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aae3a438",
   "metadata": {},
   "source": [
    "### Prepare raw data\n",
    "\n",
    "We upload the raw data to S3 in the default bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d9dfa8b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "from pathlib import Path\n",
    "\n",
    "databunch = fetch_california_housing()\n",
    "dataset = np.concatenate((databunch[\"target\"].reshape(-1, 1), databunch[\"data\"]), axis=1)\n",
    "\n",
    "print(f\"Dataset shape = {dataset.shape}\")\n",
    "np.savetxt(\"dataset.csv\", dataset, delimiter=\",\")\n",
    "\n",
    "data_prefix_path = f\"{prefix}/input/dataset.csv\"\n",
    "s3_data_path = f\"s3://{bucket}/{data_prefix_path}\"\n",
    "print(f\"Raw data location in S3: {s3_data_path}\")\n",
    "\n",
    "s3 = boto3.client(\"s3\")\n",
    "s3.upload_file(\"dataset.csv\", bucket, data_prefix_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "622d14ce",
   "metadata": {},
   "source": [
    "### Setup SageMaker Experiments\n",
    "\n",
    "Amazon SageMaker Experiments have been built for data scientists that are performing different experiments as part of their model development process and want a simple way to organize, track, compare, and evaluate their machine learning experiments.\n",
    "\n",
    "Let’s start first with an overview of Amazon SageMaker Experiments features:\n",
    "\n",
    "* Organize Experiments: Amazon SageMaker Experiments structures experimentation with a first top level entity called experiment that contains a set of trials. Each trial contains a set of steps called trial components. Each trial component is a combination of datasets, algorithms, parameters, and artifacts. You can picture experiments as the top level “folder” for organizing your hypotheses, your trials as the “subfolders” for each group test run, and your trial components as your “files” for each instance of a test run.\n",
    "* Track Experiments: Amazon SageMaker Experiments allows the data scientist to track experiments automatically or manually. Amazon SageMaker Experiments offers the possibility to automatically assign the sagemaker jobs to a trial specifying the `experiment_config` argument, or to manually call the tracking APIs.\n",
    "* Compare and Evaluate Experiments: The integration of Amazon SageMaker Experiments with Amazon SageMaker Studio makes it easier to produce data visualizations and compare different trials to identify the best combination of hyperparameters.\n",
    "\n",
    "Now, in order to track this test in SageMaker, we need to create an experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a2af348",
   "metadata": {},
   "outputs": [],
   "source": [
    "from smexperiments.experiment import Experiment\n",
    "from smexperiments.trial import Trial\n",
    "from smexperiments.trial_component import TrialComponent\n",
    "from smexperiments.tracker import Tracker\n",
    "\n",
    "experiment_name = 'DEMO-sagemaker-experiments-dvc'\n",
    "\n",
    "# create the experiment if it doesn't exist\n",
    "try:\n",
    "    my_experiment = Experiment.load(experiment_name=experiment_name)\n",
    "    print(\"existing experiment loaded\")\n",
    "except Exception as ex:\n",
    "    if \"ResourceNotFound\" in str(ex):\n",
    "        my_experiment = Experiment.create(\n",
    "            experiment_name = experiment_name,\n",
    "            description = \"How to integrate DVC\"\n",
    "        )\n",
    "        print(\"new experiment created\")\n",
    "    else:\n",
    "        print(f\"Unexpected {ex}=, {type(ex)}\")\n",
    "        print(\"Dont go forward!\")\n",
    "        raise"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0947c82b",
   "metadata": {},
   "source": [
    "We need to also define trials within the experiment.\n",
    "While it is possible to have any number of trials within an experiment, for our excercise, we will create 2 trials, one for each processing strategy.\n",
    "\n",
    "### Test 1: generate single files for training and validation\n",
    "\n",
    "In this test, we show how to create a processing script that fetches the raw data directly from S3 as an input, process it to create the triplet `train`, `validation` and `test`, and store the results back to S3 using `dvc`. Furthermore, we show how you can pair `dvc` with SageMaker native tracking capabilities when executing Processing and Training Jobs and via SageMaker Experiments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a34b0c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "first_trial_name = \"dvc-trial-single-file\"\n",
    "\n",
    "try:\n",
    "    my_first_trial = Trial.load(trial_name=first_trial_name)\n",
    "    print(\"existing trial loaded\")\n",
    "except Exception as ex:\n",
    "    if \"ResourceNotFound\" in str(ex):\n",
    "        my_first_trial = Trial.create(\n",
    "            experiment_name=experiment_name,\n",
    "            trial_name=first_trial_name,\n",
    "        )\n",
    "        print(\"new trial created\")\n",
    "    else:\n",
    "        print(f\"Unexpected {ex}=, {type(ex)}\")\n",
    "        print(\"Dont go forward!\")\n",
    "        raise"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9471d00",
   "metadata": {},
   "source": [
    "### Processing script: version data with DVC\n",
    "\n",
    "The processing script takes as arguments the address of the git repository, and the branch we want to <i>create</i> to store the `dvc` metadata. The datasets themselves will be then stored in S3. The arguments passed to the processing scripts are not automatically tracked in SageMaker Experiments in the automatically generated <i>TrialComponent</i>. The <i>TrialComponent</i> generated by SageMaker can be loaded within the Processing Job and further enrich with any extra data, which then become available for visualization in the SageMaker Studio UI. In our case, we will store the following data:\n",
    "* `data_repo_url`\n",
    "* `data_branch`\n",
    "* `data_commit_hash`\n",
    "* `train_test_split_ratio`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65896ab5",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pygmentize 'source_dir/preprocessing-experiment.py'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1cd9076",
   "metadata": {},
   "source": [
    "### SageMaker Processing job\n",
    "\n",
    "We have now all ingredients to execute our SageMaker Processing Job:\n",
    "* a custom image with dvc installed\n",
    "* a git repository (i.e., AWS CodeCommit)\n",
    "* a processing script that can process several arguments (i.e., `--train-test-split-ratio`, `--dvc-repo-url`, `--dvc-branch`)\n",
    "* a SageMaker Experiment and a Trial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "62fa75e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput\n",
    "\n",
    "dvc_repo_url = \"codecommit::{}://sagemaker-dvc-sample\".format(region)\n",
    "dvc_branch = my_first_trial.trial_name\n",
    "\n",
    "image = \"{}.dkr.ecr.{}.amazonaws.com/sagemaker-processing-dvc:latest\".format(account, region)\n",
    "\n",
    "script_processor = ScriptProcessor(command=['python3'],\n",
    "                image_uri=image,\n",
    "                role=role,\n",
    "                instance_count=1,\n",
    "                instance_type='ml.m5.xlarge',\n",
    "                env={\n",
    "                    \"DVC_REPO_URL\": dvc_repo_url,\n",
    "                    \"DVC_BRANCH\": dvc_branch,\n",
    "                    \"USER\": \"sagemaker\"\n",
    "                },\n",
    "            )\n",
    "\n",
    "experiment_config={\n",
    "    \"ExperimentName\": my_experiment.experiment_name,\n",
    "    \"TrialName\": my_first_trial.trial_name\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "baf3d39c",
   "metadata": {},
   "source": [
    "Executing the processing job will take around 3-4 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ba0c629f",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "script_processor.run(\n",
    "    code='source_dir/preprocessing-experiment.py',\n",
    "    inputs=[ProcessingInput(source=s3_data_path, destination=\"/opt/ml/processing/input\")],\n",
    "    experiment_config=experiment_config,\n",
    "    arguments=[\"--train-test-split-ratio\", \"0.2\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e4aa1f57",
   "metadata": {},
   "source": [
    "### Create an estimator and fit the model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2be8cc5",
   "metadata": {},
   "source": [
    "To use DVC integration, pass a `dvc_repo_url` and `dvc_branch` as parameters when you create the Estimator object.\n",
    "\n",
    "We will train on the `dvc-trial-single-file` branch first.\n",
    "\n",
    "When doing `dvc pull` in the training script, the following dataset structure will be generated:\n",
    "\n",
    "```\n",
    "dataset\n",
    "    |-- train\n",
    "    |   |-- california_train.csv\n",
    "    |-- test\n",
    "    |   |-- california_test.csv\n",
    "    |-- validation\n",
    "    |   |-- california_validation.csv\n",
    "```\n",
    "\n",
    "#### Metric definition\n",
    "\n",
    "SageMaker emits every log that is going to STDOUT to CLoudWatch. In order to capture the metrics we are interested in, we need to specify a metric definition object to define the format of the metrics via regex. By doing so, SageMaker will know how to capture the metrics from the CloudWatch logs of the training job.\n",
    "\n",
    "In our case, we are interested in the median error.\n",
    "```\n",
    "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5dbce5d",
   "metadata": {},
   "outputs": [],
   "source": [
    "image = \"{}.dkr.ecr.{}.amazonaws.com/sagemaker-catboost-dvc:latest\".format(account, region)\n",
    "\n",
    "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n",
    "\n",
    "hyperparameters={ \n",
    "        \"learning_rate\" : 1,\n",
    "        \"depth\": 6\n",
    "    }\n",
    "\n",
    "estimator = sagemaker.estimator.Estimator(\n",
    "    image,\n",
    "    role,\n",
    "    instance_count=1,\n",
    "    metric_definitions=metric_definitions,\n",
    "    instance_type=\"ml.m5.large\",\n",
    "    sagemaker_session=sagemaker_session,\n",
    "    hyperparameters=hyperparameters,\n",
    "    environment={\n",
    "        \"DVC_REPO_URL\": dvc_repo_url,\n",
    "        \"DVC_BRANCH\": dvc_branch,\n",
    "        \"USER\": \"sagemaker\"\n",
    "    }\n",
    ")\n",
    "\n",
    "experiment_config={\n",
    "    \"ExperimentName\": my_experiment.experiment_name,\n",
    "    \"TrialName\": my_first_trial.trial_name\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f766d30",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "estimator.fit(experiment_config=experiment_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f83c9dde",
   "metadata": {},
   "source": [
    "On the logs above you can see those lines, indicating about the files pulled by dvc:\n",
    "\n",
    "```\n",
    "Running dvc pull command\n",
    "A       train/california_train.csv\n",
    "A       test/california_test.csv\n",
    "A       validation/california_validation.csv\n",
    "3 files added and 3 files fetched\n",
    "Starting the training.\n",
    "Found train files: ['/opt/ml/input/data/dataset/train/california_train.csv']\n",
    "Found validation files: ['/opt/ml/input/data/dataset/train/california_train.csv']\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6bb08ce",
   "metadata": {},
   "source": [
    "### Test 2: generate multiple files for training and validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e24154ae",
   "metadata": {},
   "outputs": [],
   "source": [
    "second_trial_name = \"dvc-trial-multi-files\"\n",
    "\n",
    "try:\n",
    "    my_second_trial = Trial.load(trial_name=second_trial_name)\n",
    "    print(\"existing trial loaded\")\n",
    "except Exception as ex:\n",
    "    if \"ResourceNotFound\" in str(ex):\n",
    "        my_second_trial = Trial.create(\n",
    "            experiment_name=experiment_name,\n",
    "            trial_name=second_trial_name,\n",
    "        )\n",
    "        print(\"new trial created\")\n",
    "    else:\n",
    "        print(f\"Unexpected {ex}=, {type(ex)}\")\n",
    "        print(\"Dont go forward!\")\n",
    "        raise"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbe33f33",
   "metadata": {},
   "source": [
    "Differently from the first processing script, we now create out of the original dataset multiple files for training and validation and store the `dvc` metadata in a different branch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a045eb4",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pygmentize 'code/preprocessing-experiment-multifiles.py'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "167185e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput\n",
    "\n",
    "image = \"{}.dkr.ecr.{}.amazonaws.com/sagemaker-processing-dvc:latest\".format(account, region)\n",
    "\n",
    "dvc_branch = my_second_trial.trial_name\n",
    "\n",
    "script_processor = ScriptProcessor(command=['python3'],\n",
    "                image_uri=image,\n",
    "                role=role,\n",
    "                instance_count=1,\n",
    "                instance_type='ml.m5.xlarge',\n",
    "                env={\n",
    "                    \"DVC_REPO_URL\": dvc_repo_url,\n",
    "                    \"DVC_BRANCH\": dvc_branch,\n",
    "                    \"USER\": \"sagemaker\"\n",
    "                },\n",
    "                )\n",
    "\n",
    "experiment_config={\n",
    "    \"ExperimentName\": my_experiment.experiment_name,\n",
    "    \"TrialName\": my_second_trial.trial_name\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14b26ae5",
   "metadata": {},
   "source": [
    "Executing the processing job will take ~5 minutes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "46d29d7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "script_processor.run(\n",
    "    code='source_dir/preprocessing-experiment-multifiles.py',\n",
    "    inputs=[ProcessingInput(source=s3_data_path, destination=\"/opt/ml/processing/input\")],\n",
    "    experiment_config=experiment_config,\n",
    "    arguments=[\"--train-test-split-ratio\", \"0.1\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1bb32869",
   "metadata": {},
   "source": [
    "We will now train on the `dvc-trial-multi-files` branch.\n",
    "\n",
    "When doing `dvc pull`, this is the dataset structure:\n",
    "\n",
    "```\n",
    "dataset\n",
    "    |-- train\n",
    "    |   |-- california_train_1.csv\n",
    "    |   |-- california_train_2.csv\n",
    "    |   |-- california_train_3.csv\n",
    "    |   |-- california_train_4.csv\n",
    "    |   |-- california_train_5.csv\n",
    "    |-- test\n",
    "    |   |-- california_test.csv\n",
    "    |-- validation\n",
    "    |   |-- california_validation_1.csv\n",
    "    |   |-- california_validation_2.csv\n",
    "    |   |-- california_validation_3.csv\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd8db6a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "image = \"{}.dkr.ecr.{}.amazonaws.com/sagemaker-catboost-dvc:latest\".format(account, region)\n",
    "\n",
    "hyperparameters={ \n",
    "        \"learning_rate\" : 1,\n",
    "        \"depth\": 6\n",
    "    }\n",
    "\n",
    "estimator = sagemaker.estimator.Estimator(\n",
    "    image,\n",
    "    role,\n",
    "    instance_count=1,\n",
    "    metric_definitions=metric_definitions,\n",
    "    instance_type=\"ml.m5.large\",\n",
    "    sagemaker_session=sagemaker_session,\n",
    "    hyperparameters=hyperparameters,\n",
    "    environment={\n",
    "        \"DVC_REPO_URL\": dvc_repo_url,\n",
    "        \"DVC_BRANCH\": dvc_branch,\n",
    "        \"USER\": \"sagemaker\"\n",
    "    }\n",
    ")\n",
    "\n",
    "experiment_config={\n",
    "    \"ExperimentName\": my_experiment.experiment_name,\n",
    "    \"TrialName\": my_second_trial.trial_name,\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6e0e936",
   "metadata": {},
   "source": [
    "The training job will take aroudn ~5 minutes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e216113",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "estimator.fit(experiment_config=experiment_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f767fa22",
   "metadata": {},
   "source": [
    "On the logs above you can see those lines, indicating about the files pulled by dvc:\n",
    "\n",
    "```\n",
    "Running dvc pull command\n",
    "A       validation/california_validation_2.csv\n",
    "A       validation/california_validation_1.csv\n",
    "A       validation/california_validation_3.csv\n",
    "A       train/california_train_4.csv\n",
    "A       train/california_train_5.csv\n",
    "A       train/california_train_2.csv\n",
    "A       train/california_train_3.csv\n",
    "A       train/california_train_1.csv\n",
    "A       test/california_test.csv\n",
    "9 files added and 9 files fetched\n",
    "Starting the training.\n",
    "Found train files: ['/opt/ml/input/data/dataset/train/california_train_2.csv', '/opt/ml/input/data/dataset/train/california_train_5.csv', '/opt/ml/input/data/dataset/train/california_train_4.csv', '/opt/ml/input/data/dataset/train/california_train_1.csv', '/opt/ml/input/data/dataset/train/california_train_3.csv']\n",
    "Found validation files: ['/opt/ml/input/data/dataset/validation/california_validation_2.csv', '/opt/ml/input/data/dataset/validation/california_validation_1.csv', '/opt/ml/input/data/dataset/validation/california_validation_3.csv']\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0bd85708",
   "metadata": {},
   "source": [
    "## Part 4: Hosting your model in SageMaker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48387d98",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.predictor import csv_serializer\n",
    "\n",
    "predictor = estimator.deploy(1, \"ml.t2.medium\", serializer=csv_serializer)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57bb1f58",
   "metadata": {},
   "source": [
    "### Fetch the testing data\n",
    "\n",
    "Save locally the test data stored in S3 via DVC created by the SageMaker Processing Job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "428729d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "\n",
    "cd sagemaker-dvc-sample\n",
    "\n",
    "# get all remote branches\n",
    "git fetch --all\n",
    "\n",
    "# move to the ddvc-trial-multi-files\n",
    "git checkout dvc-trial-multi-files\n",
    "\n",
    "# gather the data (for testing purpuse)\n",
    "dvc pull"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a2d99e5",
   "metadata": {},
   "source": [
    "Prepare the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bfc4eeb7",
   "metadata": {},
   "outputs": [],
   "source": [
    "test = pd.read_csv(\"./sagemaker-dvc-sample/dataset/test/california_test.csv\",header=None)\n",
    "X_test = test.iloc[:, 1:].values\n",
    "y_test = test.iloc[:, 0:1].values"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3594f09e",
   "metadata": {},
   "source": [
    "## Invoke endpoint with the Python SDK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff43812f",
   "metadata": {},
   "outputs": [],
   "source": [
    "predicted = predictor.predict(X_test).decode('utf-8').split('\\n')\n",
    "for i in range(len(predicted)-1):\n",
    "    print(f\"predicted: {predicted[i]}, actual: {y_test[i][0]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "103db7d0",
   "metadata": {},
   "source": [
    "### Delete the Endpoint\n",
    "\n",
    "Make sure to delete the endpoint to avoid un-expected costs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6112817a",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor.delete_endpoint()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71c48355",
   "metadata": {},
   "source": [
    "### (Optional) Delete the Experiment, and all Trails, TrialComponents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ba71c3c",
   "metadata": {},
   "outputs": [],
   "source": [
    "my_experiment.delete_all(action=\"--force\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a4612c2",
   "metadata": {},
   "source": [
    "### (Optional) Delete the AWS CodeCommit repository"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f5db756",
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws codecommit delete-repository --repository-name sagemaker-dvc-sample"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b14fc3d",
   "metadata": {},
   "source": [
    "### (Optional) Delete the AWS ECR repositories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a0705de",
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws ecr delete-repository --repository-name sagemaker-catboost-dvc --force\n",
    "!aws ecr delete-repository --repository-name sagemaker-processing-dvc --force"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93cd4af8",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python [conda env: dvc] (conda-env-dvc-kernel/latest)",
   "language": "python",
   "name": "conda-env-dvc-py__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:583558296381:image/conda-env-dvc-kernel"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}