{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5289742c",
   "metadata": {},
   "source": [
    "# Prerequisite\n",
    "\n",
    "This notebook assumes you are using the `conda-env-dvc-kernel` image built and attached to a SageMaker Studio domain. Setup guidelines are available [here](https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo/blob/main/sagemaker-studio-dvc-image/README.md).\n",
    "\n",
    "# Training a CatBoost regression model with data from DVC\n",
    "\n",
    "This notebook will guide you through an example that shows you how to build a Docker containers for SageMaker and use it for processing, training, and inference in conjunction with [DVC](https://dvc.org/).\n",
    "\n",
    "By packaging libraries and algorithms in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies.\n",
    "\n",
    "### California Housing dataset\n",
    "\n",
    "We use the California Housing dataset, present in [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). \n",
    "\n",
    "The California Housing dataset was originally published in:\n",
    "\n",
    "Pace, R. Kelley, and Ronald Barry. \"Sparse spatial auto-regressions.\" Statistics & Probability Letters 33.3 (1997): 291-297.\n",
    "\n",
    "### DVC\n",
    "\n",
    "DVC is built to make machine learning (ML) models shareable and reproducible.\n",
    "It is designed to handle large files, data sets, machine learning models, and metrics as well as code."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f55d6f3f",
   "metadata": {},
   "source": [
    "## Part 1: Configure DVC for data versioning\n",
    "\n",
    "Let us create a subdirectory where we prepare the data, i.e. `sagemaker-dvc-sample`.\n",
    "Within this subdirectory, we initialize a new git repository and set the remote to a repository we create in [AWS CodeCommit](https://aws.amazon.com/codecommit/).\n",
    "The `dvc` configurations and files for data tracking will be versioned in this repository.\n",
    "Git offers native capabilities to manage subprojects via, for example, `git submodules` and `git subtrees`, and you can extend this notebook to use any of the aforementioned tools that best fit your workflow.\n",
    "\n",
    "One of the great advantage of using AWS CodeCommit in this context is its native integration with IAM for authentication purposes, meaning we can use SageMaker execution role to interact with the git server without the need to worry about how to store and retrieve credentials. Of course, you can always replace AWS CodeCommit with any other version control system based on git such as GitHub, Gitlab, or Bitbucket, keeping in mind you will need to handle the credentials in a secure manner, for example, by introducing [Amazon Secret Managers](https://aws.amazon.com/secrets-manager/) to store and pull credentials at run time in the notebook as well as the processing and training jobs.\n",
    "\n",
    "Setting the appropriate permissions on SageMaker execution role will also allow the SageMaker processing and training job to interact securely with the AWS CodeCommit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8952c81c",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "\n",
    "## Create the repository\n",
    "\n",
    "repo_name=\"sagemaker-dvc-sample\"\n",
    "\n",
    "aws codecommit create-repository --repository-name ${repo_name} --repository-description \"Sample repository to describe how to use dvc with sagemaker and codecommit\"\n",
    "\n",
    "account=$(aws sts get-caller-identity --query Account --output text)\n",
    "\n",
    "# Get the region defined in the current configuration (default to eu-west-1 if none defined)\n",
    "region=$(python -c \"import boto3;print(boto3.Session().region_name)\")\n",
    "region=${region:-eu-west-1}\n",
    "\n",
    "## repo_name is already in the .gitignore of the root repo\n",
    "\n",
    "mkdir -p ${repo_name}\n",
    "cd ${repo_name}\n",
    "\n",
    "# initalize new repo in subfolder\n",
    "git init\n",
    "## Change the remote to the codecommit\n",
    "git remote add origin https://git-codecommit.\"${region}\".amazonaws.com/v1/repos/\"${repo_name}\"\n",
    "\n",
    "# Configure git - change it according to your needs\n",
    "git config --global user.email \"sagemaker-studio-user@example.com\"\n",
    "git config --global user.name \"SageMaker Studio User\"\n",
    "\n",
    "git config --global credential.helper '!aws codecommit credential-helper $@'\n",
    "git config --global credential.UseHttpPath true\n",
    "\n",
    "# Initialize dvc\n",
    "dvc init\n",
    "\n",
    "git commit -m 'Add dvc configuration'\n",
    "\n",
    "# Set the DVC remote storage to S3 - uses the sagemaker standard default bucket\n",
    "dvc remote add -d storage s3://sagemaker-\"${region}\"-\"${account}\"/DEMO-sagemaker-experiments-dvc\n",
    "git commit .dvc/config -m \"initialize DVC local remote\"\n",
    "\n",
    "# set the DVC cache to S3\n",
    "dvc remote add s3cache s3://sagemaker-\"${region}\"-\"${account}\"/DEMO-sagemaker-experiments-dvc/cache\n",
    "dvc config cache.s3 s3cache\n",
    "\n",
    "# disable sending anonymized data to dvc for troubleshooting\n",
    "dvc config core.analytics false\n",
    "\n",
    "git add .dvc/config\n",
    "git commit -m 'update dvc config'\n",
    "\n",
    "git push --set-upstream origin master #--force"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93d74587",
   "metadata": {},
   "source": [
    "## Part 2: Processing and Training with DVC and SageMaker\n",
    "\n",
    "In this section we explore two different approaches to tackle our problem and how we can keep track of the 2 tests using SageMaker Experiments.\n",
    "\n",
    "The high level conceptual architecture is depicted in the figure below.\n",
    "\n",
    "<img src=\"./img/high-level-architecture.png\">\n",
    "<i>Fig. 1 High level architecture</i>\n",
    "\n",
    "\n",
    "### Import libraries and initial setup\n",
    "\n",
    "Lets start by importing the libraries and setup variables that will be useful as we go along in the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bdbc951e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import sagemaker\n",
    "import time\n",
    "from time import strftime\n",
    "\n",
    "boto_session = boto3.Session()\n",
    "sagemaker_session = sagemaker.Session(boto_session=boto_session)\n",
    "sm_client = boto3.client(\"sagemaker\")\n",
    "region = boto_session.region_name\n",
    "bucket = sagemaker_session.default_bucket()\n",
    "role = sagemaker.get_execution_role()\n",
    "account = sagemaker_session.boto_session.client(\"sts\").get_caller_identity()[\"Account\"]\n",
    "\n",
    "prefix = 'DEMO-sagemaker-experiments-dvc'\n",
    "\n",
    "print(f\"account: {account}\")\n",
    "print(f\"bucket: {bucket}\")\n",
    "print(f\"region: {region}\")\n",
    "print(f\"role: {role}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8edec916",
   "metadata": {},
   "source": [
    "### Prepare raw data\n",
    "\n",
    "We upload the raw data to S3 in the default bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cae15de4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "from sklearn.datasets import fetch_california_housing\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "from pathlib import Path\n",
    "\n",
    "databunch = fetch_california_housing()\n",
    "dataset = np.concatenate((databunch[\"target\"].reshape(-1, 1), databunch[\"data\"]), axis=1)\n",
    "\n",
    "print(f\"Dataset shape = {dataset.shape}\")\n",
    "np.savetxt(\"dataset.csv\", dataset, delimiter=\",\")\n",
    "\n",
    "data_prefix_path = f\"{prefix}/input/dataset.csv\"\n",
    "s3_data_path = f\"s3://{bucket}/{data_prefix_path}\"\n",
    "print(f\"Raw data location in S3: {s3_data_path}\")\n",
    "\n",
    "s3 = boto3.client(\"s3\")\n",
    "s3.upload_file(\"dataset.csv\", bucket, data_prefix_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08e50d6a",
   "metadata": {},
   "source": [
    "### Setup SageMaker Experiments\n",
    "\n",
    "Amazon SageMaker Experiments have been built for data scientists that are performing different experiments as part of their model development process and want a simple way to organize, track, compare, and evaluate their machine learning experiments.\n",
    "\n",
    "Let’s start first with an overview of Amazon SageMaker Experiments features:\n",
    "\n",
    "* Organize Experiments: Amazon SageMaker Experiments structures experimentation with a first top level entity called experiment that contains a set of trials. Each trial contains a set of steps called trial components. Each trial component is a combination of datasets, algorithms, parameters, and artifacts. You can picture experiments as the top level “folder” for organizing your hypotheses, your trials as the “subfolders” for each group test run, and your trial components as your “files” for each instance of a test run.\n",
    "* Track Experiments: Amazon SageMaker Experiments allows the data scientist to track experiments automatically or manually. Amazon SageMaker Experiments offers the possibility to automatically assign the sagemaker jobs to a trial specifying the `experiment_config` argument, or to manually call the tracking APIs.\n",
    "* Compare and Evaluate Experiments: The integration of Amazon SageMaker Experiments with Amazon SageMaker Studio makes it easier to produce data visualizations and compare different trials to identify the best combination of hyperparameters.\n",
    "\n",
    "Now, in order to track this test in SageMaker, we need to create an experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc5aa1cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "from smexperiments.experiment import Experiment\n",
    "from smexperiments.trial import Trial\n",
    "from smexperiments.trial_component import TrialComponent\n",
    "from smexperiments.tracker import Tracker\n",
    "\n",
    "experiment_name = 'DEMO-sagemaker-experiments-dvc'\n",
    "\n",
    "# create the experiment if it doesn't exist\n",
    "try:\n",
    "    my_experiment = Experiment.load(experiment_name=experiment_name)\n",
    "    print(\"existing experiment loaded\")\n",
    "except Exception as ex:\n",
    "    if \"ResourceNotFound\" in str(ex):\n",
    "        my_experiment = Experiment.create(\n",
    "            experiment_name = experiment_name,\n",
    "            description = \"How to integrate DVC\"\n",
    "        )\n",
    "        print(\"new experiment created\")\n",
    "    else:\n",
    "        print(f\"Unexpected {ex}=, {type(ex)}\")\n",
    "        print(\"Dont go forward!\")\n",
    "        raise"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c13953c",
   "metadata": {},
   "source": [
    "We need to also define trials within the experiment.\n",
    "While it is possible to have any number of trials within an experiment, for our excercise, we will create 2 trials, one for each processing strategy.\n",
    "\n",
    "### Test 1: generate single files for training and validation\n",
    "\n",
    "In this test, we show how to create a processing script that fetches the raw data directly from S3 as an input, process it to create the triplet `train`, `validation` and `test`, and store the results back to S3 using `dvc`. Furthermore, we show how you can pair `dvc` with SageMaker native tracking capabilities when executing Processing and Training Jobs and via SageMaker Experiments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "83654fb0",
   "metadata": {},
   "outputs": [],
   "source": [
    "first_trial_name = \"dvc-trial-single-file\"\n",
    "\n",
    "try:\n",
    "    my_first_trial = Trial.load(trial_name=first_trial_name)\n",
    "    print(\"existing trial loaded\")\n",
    "except Exception as ex:\n",
    "    if \"ResourceNotFound\" in str(ex):\n",
    "        my_first_trial = Trial.create(\n",
    "            experiment_name=experiment_name,\n",
    "            trial_name=first_trial_name,\n",
    "        )\n",
    "        print(\"new trial created\")\n",
    "    else:\n",
    "        print(f\"Unexpected {ex}=, {type(ex)}\")\n",
    "        print(\"Dont go forward!\")\n",
    "        raise"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93aa35a4",
   "metadata": {},
   "source": [
    "### Processing script: version data with DVC\n",
    "\n",
    "The processing script expects the address of the git repository and the branch we want to <i>create</i> to store the `dvc` metadata passed via environmental variables.\n",
    "The datasets themselves will be then stored in S3.\n",
    "Environmental variables are automatically tracked in SageMaker Experiments in the automatically generated <i>TrialComponent</i>.\n",
    "The <i>TrialComponent</i> generated by SageMaker can be loaded within the Processing Job and further enrich with any extra data, which then become available for visualization in the SageMaker Studio UI.\n",
    "In our case, we will store the following data:\n",
    "* `DVC_REPO_URL`\n",
    "* `DVC_BRANCH`\n",
    "* `USER`\n",
    "* `data_commit_hash`\n",
    "* `train_test_split_ratio`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1b391c3",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pygmentize 'source_dir/preprocessing-experiment.py'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cb3bc2f",
   "metadata": {},
   "source": [
    "### SageMaker Processing job\n",
    "\n",
    "SageMaker Processing gives us the possibility to execute our processing script on container images managed by AWS that are optimized to run on the AWS infrastructure.\n",
    "If our script requires additional dependencies, we can supply a `requirements.txt` file.\n",
    "Upon starting of the processing job, SageMaker will `pip`-install all libraries we need (e.g., `dvc`-related libraries).\n",
    "\n",
    "We have now all ingredients to execute our SageMaker Processing Job:\n",
    "* a processing script that can process several arguments (i.e., `--train-test-split-ratio`) and two environmental variables (i.e., `DVC_REPO_URL` and `DVC_BRANCH`)\n",
    "* a `requiremets.txt` file\n",
    "* a git repository (in AWS CodeCommit)\n",
    "* a SageMaker Experiment and a Trial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "400b363d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.processing import FrameworkProcessor, ProcessingInput\n",
    "from sagemaker.sklearn.estimator import SKLearn\n",
    "\n",
    "dvc_repo_url = \"codecommit::{}://sagemaker-dvc-sample\".format(region)\n",
    "dvc_branch = my_first_trial.trial_name\n",
    "\n",
    "script_processor = FrameworkProcessor(\n",
    "    estimator_cls=SKLearn,\n",
    "    framework_version='0.23-1',\n",
    "    instance_count=1,\n",
    "    instance_type='ml.m5.xlarge',\n",
    "    env={\n",
    "        \"DVC_REPO_URL\": dvc_repo_url,\n",
    "        \"DVC_BRANCH\": dvc_branch,\n",
    "        \"USER\": \"sagemaker\"\n",
    "    },\n",
    "    role=role\n",
    ")\n",
    "\n",
    "experiment_config={\n",
    "    \"ExperimentName\": my_experiment.experiment_name,\n",
    "    \"TrialName\": my_first_trial.trial_name\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b4b7f46",
   "metadata": {},
   "source": [
    "Executing the processing job will take around 3-4 minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1a4dfc06",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "script_processor.run(\n",
    "    code='./source_dir/preprocessing-experiment.py',\n",
    "    dependencies=['./source_dir/requirements.txt'],\n",
    "    inputs=[ProcessingInput(source=s3_data_path, destination=\"/opt/ml/processing/input\")],\n",
    "    experiment_config=experiment_config,\n",
    "    arguments=[\"--train-test-split-ratio\", \"0.2\"]\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "991638a4",
   "metadata": {},
   "source": [
    "### Create an estimator and fit the model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8071e644",
   "metadata": {},
   "source": [
    "To use DVC integration, pass a `dvc_repo_url` and `dvc_branch` as environmental variables when you create the Estimator object.\n",
    "\n",
    "We will train on the `dvc-trial-single-file` branch first.\n",
    "\n",
    "When doing `dvc pull` in the training script, the following dataset structure will be generated:\n",
    "\n",
    "```\n",
    "dataset\n",
    "    |-- train\n",
    "    |   |-- california_train.csv\n",
    "    |-- test\n",
    "    |   |-- california_test.csv\n",
    "    |-- validation\n",
    "    |   |-- california_validation.csv\n",
    "```\n",
    "\n",
    "#### Metric definition\n",
    "\n",
    "SageMaker emits every log that is going to STDOUT to CloudWatch. In order to capture the metrics we are interested in, we need to specify a metric definition object to define the format of the metrics via regex.\n",
    "By doing so, SageMaker will know how to capture the metrics from the CloudWatch logs of the training job.\n",
    "\n",
    "In our case, we are interested in the median error.\n",
    "```\n",
    "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d464d52b",
   "metadata": {},
   "outputs": [],
   "source": [
    "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n",
    "\n",
    "hyperparameters={ \n",
    "        \"learning_rate\" : 1,\n",
    "        \"depth\": 6\n",
    "    }\n",
    "estimator = SKLearn(\n",
    "    entry_point='train.py',\n",
    "    source_dir='source_dir',\n",
    "    role=role,\n",
    "    metric_definitions=metric_definitions,\n",
    "    hyperparameters=hyperparameters,\n",
    "    instance_count=1,\n",
    "    instance_type='ml.m5.large',\n",
    "    framework_version='0.23-1',\n",
    "    base_job_name='training-with-dvc-data',\n",
    "    environment={\n",
    "        \"DVC_REPO_URL\": dvc_repo_url,\n",
    "        \"DVC_BRANCH\": dvc_branch,\n",
    "        \"USER\": \"sagemaker\"\n",
    "    }\n",
    ")\n",
    "\n",
    "experiment_config={\n",
    "    \"ExperimentName\": my_experiment.experiment_name,\n",
    "    \"TrialName\": my_first_trial.trial_name\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b4e1302",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "estimator.fit(experiment_config=experiment_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1aaa3a46",
   "metadata": {},
   "source": [
    "On the logs above you can see those lines, indicating about the files pulled by dvc:\n",
    "\n",
    "```\n",
    "Running dvc pull command\n",
    "A       train/california_train.csv\n",
    "A       test/california_test.csv\n",
    "A       validation/california_validation.csv\n",
    "3 files added and 3 files fetched\n",
    "Starting the training.\n",
    "Found train files: ['/opt/ml/input/data/dataset/train/california_train.csv']\n",
    "Found validation files: ['/opt/ml/input/data/dataset/train/california_train.csv']\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4de6faf",
   "metadata": {},
   "source": [
    "### Test 2: generate multiple files for training and validation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2cfe7996",
   "metadata": {},
   "outputs": [],
   "source": [
    "second_trial_name = \"dvc-trial-multi-files\"\n",
    "\n",
    "try:\n",
    "    my_second_trial = Trial.load(trial_name=second_trial_name)\n",
    "    print(\"existing trial loaded\")\n",
    "except Exception as ex:\n",
    "    if \"ResourceNotFound\" in str(ex):\n",
    "        my_second_trial = Trial.create(\n",
    "            experiment_name=experiment_name,\n",
    "            trial_name=second_trial_name,\n",
    "        )\n",
    "        print(\"new trial created\")\n",
    "    else:\n",
    "        print(f\"Unexpected {ex}=, {type(ex)}\")\n",
    "        print(\"Dont go forward!\")\n",
    "        raise"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a90c0238",
   "metadata": {},
   "source": [
    "Differently from the first processing script, we now create out of the original dataset multiple files for training and validation and store the `dvc` metadata in a different branch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7eda4b2",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pygmentize 'source_dir/preprocessing-experiment-multifiles.py'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25c05bae",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.processing import FrameworkProcessor, ProcessingInput\n",
    "from sagemaker.sklearn.estimator import SKLearn\n",
    "\n",
    "dvc_branch = my_second_trial.trial_name\n",
    "\n",
    "script_processor = FrameworkProcessor(\n",
    "    estimator_cls=SKLearn,\n",
    "    framework_version='0.23-1',\n",
    "    instance_count=1,\n",
    "    instance_type='ml.m5.xlarge',\n",
    "    env={\n",
    "        \"DVC_REPO_URL\": dvc_repo_url,\n",
    "        \"DVC_BRANCH\": dvc_branch,\n",
    "        \"USER\": \"sagemaker\",\n",
    "    },\n",
    "    role=role\n",
    ")\n",
    "\n",
    "experiment_config={\n",
    "    \"ExperimentName\": my_experiment.experiment_name,\n",
    "    \"TrialName\": my_second_trial.trial_name\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "624a6b65",
   "metadata": {},
   "source": [
    "Executing the processing job will take ~5 minutes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "099c269e",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "script_processor.run(\n",
    "    code='./source_dir/preprocessing-experiment-multifiles.py',\n",
    "    dependencies=['./source_dir/requirements.txt'],\n",
    "    inputs=[ProcessingInput(source=s3_data_path, destination=\"/opt/ml/processing/input\")],\n",
    "    experiment_config=experiment_config,\n",
    "    arguments=[\"--train-test-split-ratio\", \"0.1\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb210f96",
   "metadata": {},
   "source": [
    "We will now train on the `dvc-trial-multi-files` branch.\n",
    "\n",
    "When doing `dvc pull`, this is the dataset structure:\n",
    "\n",
    "```\n",
    "dataset\n",
    "    |-- train\n",
    "    |   |-- california_train_1.csv\n",
    "    |   |-- california_train_2.csv\n",
    "    |   |-- california_train_3.csv\n",
    "    |   |-- california_train_4.csv\n",
    "    |   |-- california_train_5.csv\n",
    "    |-- test\n",
    "    |   |-- california_test.csv\n",
    "    |-- validation\n",
    "    |   |-- california_validation_1.csv\n",
    "    |   |-- california_validation_2.csv\n",
    "    |   |-- california_validation_3.csv\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5cadd7b2",
   "metadata": {},
   "outputs": [],
   "source": [
    "metric_definitions = [{'Name': 'median-AE', 'Regex': \"AE-at-50th-percentile: ([0-9.]+).*$\"}]\n",
    "\n",
    "hyperparameters={ \n",
    "        \"learning_rate\" : 1,\n",
    "        \"depth\": 6\n",
    "    }\n",
    "\n",
    "estimator = SKLearn(\n",
    "    entry_point='train.py',\n",
    "    source_dir='source_dir',\n",
    "    role=role,\n",
    "    metric_definitions=metric_definitions,\n",
    "    hyperparameters=hyperparameters,\n",
    "    instance_count=1,\n",
    "    instance_type='ml.m5.large',\n",
    "    framework_version='0.23-1',\n",
    "    base_job_name='training-with-dvc-data',\n",
    "    environment={\n",
    "        \"DVC_REPO_URL\": dvc_repo_url,\n",
    "        \"DVC_BRANCH\": dvc_branch,\n",
    "        \"USER\": \"sagemaker\"\n",
    "    }\n",
    ")\n",
    "\n",
    "experiment_config={\n",
    "    \"ExperimentName\": my_experiment.experiment_name,\n",
    "    \"TrialName\": my_second_trial.trial_name,\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce4aa067",
   "metadata": {},
   "source": [
    "The training job will take around ~5 minutes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc8e059e",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "estimator.fit(experiment_config=experiment_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f34d4aa3",
   "metadata": {},
   "source": [
    "On the logs above you can see those lines, indicating about the files pulled by dvc:\n",
    "\n",
    "```\n",
    "Running dvc pull command\n",
    "A       validation/california_validation_2.csv\n",
    "A       validation/california_validation_1.csv\n",
    "A       validation/california_validation_3.csv\n",
    "A       train/california_train_4.csv\n",
    "A       train/california_train_5.csv\n",
    "A       train/california_train_2.csv\n",
    "A       train/california_train_3.csv\n",
    "A       train/california_train_1.csv\n",
    "A       test/california_test.csv\n",
    "9 files added and 9 files fetched\n",
    "Starting the training.\n",
    "Found train files: ['/opt/ml/input/data/dataset/train/california_train_2.csv', '/opt/ml/input/data/dataset/train/california_train_5.csv', '/opt/ml/input/data/dataset/train/california_train_4.csv', '/opt/ml/input/data/dataset/train/california_train_1.csv', '/opt/ml/input/data/dataset/train/california_train_3.csv']\n",
    "Found validation files: ['/opt/ml/input/data/dataset/validation/california_validation_2.csv', '/opt/ml/input/data/dataset/validation/california_validation_1.csv', '/opt/ml/input/data/dataset/validation/california_validation_3.csv']\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "514a78d9",
   "metadata": {},
   "source": [
    "## Part 3: Hosting your model in SageMaker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3330abce",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.serializers import CSVSerializer\n",
    "\n",
    "predictor = estimator.deploy(1, \"ml.t2.medium\", serializer=CSVSerializer())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf141b4f",
   "metadata": {},
   "source": [
    "### Fetch the testing data\n",
    "\n",
    "Read the raw test data stored in S3 via DVC created by the SageMaker Processing Job. We use the `dvc` python API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a5bcecb-a6ca-4d65-a239-be53c32f737a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "import dvc.api\n",
    "\n",
    "git_repo_https = f\"https://git-codecommit.{region}.amazonaws.com/v1/repos/sagemaker-dvc-sample\"\n",
    "\n",
    "raw = dvc.api.read(\n",
    "    \"dataset/test/california_test.csv\",\n",
    "    repo=git_repo_https,\n",
    "    rev=dvc_branch\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86d9841f",
   "metadata": {},
   "source": [
    "Prepare the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d931947d",
   "metadata": {},
   "outputs": [],
   "source": [
    "test = pd.read_csv(io.StringIO(raw), sep=\",\", header=None)\n",
    "X_test = test.iloc[:, 1:].values\n",
    "y_test = test.iloc[:, 0:1].values"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5b796e8",
   "metadata": {},
   "source": [
    "## Invoke endpoint with the Python SDK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e0bd7491",
   "metadata": {},
   "outputs": [],
   "source": [
    "predicted = predictor.predict(X_test)\n",
    "for i in range(len(predicted)-1):\n",
    "    print(f\"predicted: {predicted[i]}, actual: {y_test[i][0]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a976c7bf",
   "metadata": {},
   "source": [
    "### Delete the Endpoint\n",
    "\n",
    "Make sure to delete the endpoint to avoid un-expected costs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f0231db",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor.delete_endpoint()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70a7499e",
   "metadata": {},
   "source": [
    "### (Optional) Delete the Experiment, and all Trails, TrialComponents"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6093da3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "#my_experiment.delete_all(action=\"--force\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "897cc5f2",
   "metadata": {},
   "source": [
    "### (Optional) Delete the AWS CodeCommit repository"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb3f762a",
   "metadata": {},
   "outputs": [],
   "source": [
    "#!aws codecommit delete-repository --repository-name sagemaker-dvc-sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf55fe4a-97d0-41d6-9796-83491cb0c640",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python [conda env: dvc] (conda-env-dvc-kernel/latest)",
   "language": "python",
   "name": "conda-env-dvc-py__SAGEMAKER_INTERNAL__arn:aws:sagemaker:eu-west-1:583558296381:image/conda-env-dvc-kernel"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}