# Creating a custom container and Estimator to run Catboost on SageMaker

In this notebook, we use the SageMaker Training Toolkit (https://github.com/aws/sagemaker-training-toolkit) to create a SageMaker-compatible docker image to run python scripts using the Catboost algorithm library. We also show how to create a custom SageMaker training `Estimator` from the SageMaker `Framework` class (https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.Framework)

CatBoost is a high-performance open source library for gradient boosting on decision trees. You can learn more about it at the following links:
* https://tech.yandex.com/catboost/
* https://catboost.ai/
* https://github.com/catboost/catboost


<br/><br/><br/>

We use the California Housing dataset, present in Scikit-Learn: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html. The California Housing dataset was originally published in:

Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297.

**This sample is provided for demonstration purposes, make sure to conduct appropriate testing if derivating this code for your own use-cases!**

## Step 1: Container creation and upload to Amazon ECR

### Creating a SageMaker-compatible Catboost container
We derive our dockerfile from the SageMaker Scikit-Learn dockerfile https://github.com/aws/sagemaker-scikit-learn-container/blob/master/docker/0.20.0/base/Dockerfile.cpu

In [None]:
%%writefile Dockerfile

FROM ubuntu:16.04

RUN apt-get update && \
    apt-get -y install build-essential libatlas-dev git wget curl nginx jq libatlas3-base

RUN curl -LO http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    bash Miniconda3-latest-Linux-x86_64.sh -bfp /miniconda3 && \
    rm Miniconda3-latest-Linux-x86_64.sh

ENV PATH=/miniconda3/bin:${PATH}
        
RUN apt-get update && apt-get install -y python-pip && pip install sagemaker-training catboost

ENV PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1 PYTHONIOENCODING=UTF-8

### Sending the container to ECR

In [None]:
import boto3
import sagemaker

from sagemaker import get_execution_role

role = get_execution_role()

ecr_namespace = 'sagemaker-training-containers/'
prefix = 'catboost-image'

ecr_repository_name = ecr_namespace + prefix
account_id = role.split(':')[4]
region = boto3.Session().region_name
sess = sagemaker.session.Session()
bucket = sess.default_bucket()

print('Account: {}'.format(account_id))
print('Region: {}'.format(region))
print('Role: {}'.format(role))
print('S3 Bucket: {}'.format(bucket))

In [None]:
%%writefile build_and_push.sh

ACCOUNT_ID=$1
REGION=$2
REPO_NAME=$3

if [[ $REGION =~ ^cn.* ]]
then
    FULLNAME="${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com.cn/${REPO_NAME}:latest"
else
    FULLNAME="${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/${REPO_NAME}:latest"
fi

echo $FULLNAME

docker build -f Dockerfile -t $REPO_NAME .

docker tag $REPO_NAME $FULLNAME

$(aws ecr get-login --no-include-email --registry-ids $ACCOUNT_ID)

aws ecr describe-repositories --repository-names $REPO_NAME || aws ecr create-repository --repository-name $REPO_NAME

docker push $FULLNAME

In [None]:
! bash build_and_push.sh $account_id $region $ecr_repository_name

In [None]:
if 'cn' in region:
    container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com.cn/{2}:latest'.format(account_id, region, ecr_repository_name)
else:
    container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print('ECR container ARN: {}'.format(container_image_uri))

The docker image is now pushed to ECR and is ready for consumption! In the next section, we go in the shoes of an ML practitioner that develops a Catboost model and runs it remotely on Amazon SageMaker

## Step 2: local ML development and remote training job with Amazon SageMaker

We install catboost locally for local development

In [None]:
! pip install catboost

### Data processing
We use pandas to process a small local dataset into a training and testing piece.

We could also design code that loads all the data and runs cross-validation within the script. 

In [None]:
import os

import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# we use the California housing dataset
data = fetch_california_housing()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

In [None]:
local_train = 'california_housing_train.csv'
local_test = 'california_housing_test.csv'

trainX.to_csv(local_train)
testX.to_csv(local_test)

In [None]:
# send data to S3. SageMaker will take training data from S3
train_location = sess.upload_data(
    path=local_train, 
    bucket=bucket,
    key_prefix='catboost')

test_location = sess.upload_data(
    path=local_test, 
    bucket=bucket,
    key_prefix='catboost')

### Developing a local training script

In [None]:
%%writefile catboost_training.py

import argparse
import logging
import os

from catboost import CatBoostRegressor
import numpy as np
import pandas as pd


if __name__ =='__main__':

    print('extracting arguments')
    parser = argparse.ArgumentParser()
    
    parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train-file', type=str, default='california_housing_train.csv')
    parser.add_argument('--test-file', type=str, default='california_housing_test.csv')
    parser.add_argument('--model-name', type=str, default='catboost_model.dump')
    parser.add_argument('--features', type=str)  # in this script we ask user to explicitly name features
    parser.add_argument('--target', type=str) # in this script we ask user to explicitly name the target
    

    args, _ = parser.parse_known_args()

    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    
    logging.info('reading data')
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))

    logging.info('building training and testing datasets')
    X_train = train_df[args.features.split()]
    X_test = test_df[args.features.split()]
    y_train = train_df[args.target]
    y_test = test_df[args.target]
        
    # define and train model
    model = CatBoostRegressor()
    
    model.fit(X_train, y_train, eval_set=(X_test, y_test), logging_level='Silent') 
    
    # print abs error
    logging.info('validating model')
    abs_err = np.abs(model.predict(X_test) - y_test)

    # print couple perf metrics
    for q in [10, 50, 90]:
        logging.info('AE-at-' + str(q) + 'th-percentile: '
              + str(np.percentile(a=abs_err, q=q)))
    
    # persist model
    path = os.path.join(args.model_dir, args.model_name)
    logging.info('saving to {}'.format(path))
    model.save_model(path)

### Testing our script locally

In [None]:
features = 'MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude'

In [None]:
# local test
! python catboost_training.py \
    --train ./ \
    --test ./ \
    --model-dir ./ \
    --features $features \
    --target target

## Remote training in SageMaker

### Option 1: Launch a SageMaker training job from code uploaded to S3

With that option, we first need to send code to S3. This could also be done automatically by a build system.

In [None]:
import tarfile

In [None]:
# first compress the code and send to S3
program = 'catboost_training.py'
source = 'source.tar.gz'
project = 'catboost'

tar = tarfile.open(source, 'w:gz')
tar.add(program)
tar.close()

submit_dir = sess.upload_data(
    path=source, 
    bucket=bucket,
    key_prefix=project)

print(submit_dir)

We then launch a training job with the `Estimator` class

In [None]:
from sagemaker.estimator import Estimator

output_path = 's3://' + bucket + '/' + project + '/' + 'training_jobs'

estimator = Estimator(image_uri=container_image_uri,
                      role=role,
                      instance_count=1,
                      instance_type='ml.m5.xlarge',
                      output_path=output_path,
                      hyperparameters={'sagemaker_program': program,
                                       'sagemaker_submit_directory': submit_dir,
                                       'features': features,
                                       'target': 'target'})

In [None]:
estimator.fit({'train':train_location, 'test': test_location}, logs=True)

### Option 2: Launch a SageMaker training job using a custom Estimator and a local training script

To make it even faster to iterate between local development and remote training in SageMaker, we can create a custom `Estimator` by extending the [Framework](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.Framework) class from the SageMaker SDK. This will perform the code compression and S3 upload for us:



In [None]:
from sagemaker.estimator import Framework

class CatBoostEstimator(Framework):
    def __init__(
        self,
        entry_point,
        source_dir=None,
        hyperparameters=None,
        py_version="py3",
        framework_version=None,
        image_uri=None,
        distributions=None,
        **kwargs):
        
        super(CatBoostEstimator, self).__init__(
            entry_point, source_dir, hyperparameters, image_uri=image_uri, **kwargs)
        
        self.framework_version = framework_version
        self.py_version = py_version
    
    
    def _configure_distribution(self, distributions):
        return
    
    def create_model(
        self,
        model_server_workers=None,
        role=None,
        vpc_config_override=None,
        entry_point=None,
        source_dir=None,
        dependencies=None,
        image_uri=None,
        **kwargs):
        
        return None

In [None]:
catboost = CatBoostEstimator(
    image_uri=container_image_uri,
    role=role,
    entry_point='catboost_training.py',
    output_path=output_path,
    instance_count=1, 
    instance_type='ml.m5.xlarge',
    hyperparameters={'features': features,
                     'target': 'target'})

In [None]:
catboost.fit({'train':train_location, 'test': test_location}, logs=True)

Now we can accelerate our use of Catboost with all the nice SageMaker features! including:
 1. Bayesian tuning of hyperparameters
 1. Remote persistance of metadata, hyperparameter, model artifacts, metrics and logs
 1. Hardware scaling and GPU use
 1. Connection to large S3 data sources