## Train and deploy model on Kubeflow in Notebooks

This examples comes from a upstream fairing [example](https://github.com/kubeflow/fairing/tree/master/examples/prediction).


Please check Kaggle competiton [
House Prices: Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)
for details about the ML problem we want to resolve.

This notebook introduces you to using Kubeflow Fairing to train and deploy a model to Kubeflow on Amazon EKS. This notebook demonstrate how to:

* Train an XGBoost model in a local notebook,
* Use Kubeflow Fairing to train an XGBoost model remotely on Kubeflow,
* Use Kubeflow Fairing to deploy a trained model to Kubeflow,
* Call the deployed endpoint for predictions.


### Install python dependencies

In [None]:
%%writefile requirements.txt
pandas
joblib
numpy
xgboost
scikit-learn>=0.21.0
seldon-core
tornado>=6.0.3

In [None]:
!pip install -r requirements.txt

### Develop your model

In [None]:
import argparse
import logging
import joblib
import sys
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from xgboost import XGBRegressor

logging.basicConfig(format='%(message)s')
logging.getLogger().setLevel(logging.INFO)

In [None]:
def read_input(file_name, test_size=0.25):
 """Read input data and split it into train and test."""
 data = pd.read_csv(file_name)
 data.dropna(axis=0, subset=['SalePrice'], inplace=True)

 y = data.SalePrice
 X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])

 train_X, test_X, train_y, test_y = train_test_split(X.values,
 y.values,
 test_size=test_size,
 shuffle=False)

 imputer = SimpleImputer()
 train_X = imputer.fit_transform(train_X)
 test_X = imputer.transform(test_X)

 return (train_X, train_y), (test_X, test_y)

def train_model(train_X,
 train_y,
 test_X,
 test_y,
 n_estimators,
 learning_rate):
 """Train the model using XGBRegressor."""
 model = XGBRegressor(n_estimators=n_estimators, learning_rate=learning_rate)

 model.fit(train_X,
 train_y,
 early_stopping_rounds=40,
 eval_set=[(test_X, test_y)])

 print("Best RMSE on eval: %.2f with %d rounds" %
 (model.best_score,
 model.best_iteration+1))
 return model

def eval_model(model, test_X, test_y):
 """Evaluate the model performance."""
 predictions = model.predict(test_X)
 logging.info("mean_absolute_error=%.2f", mean_absolute_error(predictions, test_y))

def save_model(model, model_file):
 """Save XGBoost model for serving."""
 joblib.dump(model, model_file)
 logging.info("Model export success: %s", model_file)
 
 
class HousingServe(object):
 
 def __init__(self):
 self.train_input = "ames_dataset/train.csv"
 self.n_estimators = 50
 self.learning_rate = 0.1
 self.model_file = "trained_ames_model.dat"
 self.model = None

 def train(self):
 (train_X, train_y), (test_X, test_y) = read_input(self.train_input)
 model = train_model(train_X,
 train_y,
 test_X,
 test_y,
 self.n_estimators,
 self.learning_rate)

 eval_model(model, test_X, test_y)
 save_model(model, self.model_file)

 def predict(self, X, feature_names=None):
 """Predict using the model for given ndarray."""
 if not self.model:
 self.model = joblib.load(self.model_file)
 # Do any preprocessing
 prediction = self.model.predict(data=X)
 # Do any postprocessing
 return prediction

### Train an XGBoost model in a notebook

In [None]:
model = HousingServe()
model.train()

### Create an S3 bucket to store pipeline data
> Note: Be sure to change the HASH variable to random hash before running next cell

> Note: if you use `us-east-1`, please use command `!aws s3 mb s3://{HASH}'-kubeflow-pipeline-data' --region $AWS_REGION --endpoint-url https://s3.us-east-1.amazonaws.com`

In [None]:
import random, string
HASH = ''.join([random.choice(string.ascii_lowercase) for n in range(16)] + [random.choice(string.digits) for n in range(16)])
AWS_REGION = 'us-west-2'
!aws s3 mb s3://{HASH}'-kubeflow-pipeline-data' --region $AWS_REGION

### Set up Kubeflow Fairing for training and predictions

> Note: remember to change `kubeflow-pipeline-data` to your own s3 bucket.

In [None]:
from kubeflow import fairing
from kubeflow.fairing import TrainJob
from kubeflow.fairing.backends import KubeflowAWSBackend


from kubeflow import fairing

FAIRING_BACKEND = 'KubeflowAWSBackend'

AWS_ACCOUNT_ID = fairing.cloud.aws.guess_account_id()
AWS_REGION = 'us-west-2'
DOCKER_REGISTRY = '{}.dkr.ecr.{}.amazonaws.com'.format(AWS_ACCOUNT_ID, AWS_REGION)
S3_BUCKET = f'{HASH}-kubeflow-pipeline-data'

In [None]:
import importlib

if FAIRING_BACKEND == 'KubeflowAWSBackend':
 from kubeflow.fairing.builders.cluster.s3_context import S3ContextSource
 BuildContext = S3ContextSource(
 aws_account=AWS_ACCOUNT_ID, region=AWS_REGION,
 bucket_name=S3_BUCKET
 )

BackendClass = getattr(importlib.import_module('kubeflow.fairing.backends'), FAIRING_BACKEND)

### Train an XGBoost model remotely on Kubeflow
Import the `TrainJob` and use the configured backend class. Kubeflow Fairing packages the `HousingServe` class, the training data, and the training job's software prerequisites as a Docker image. Then Kubeflow Fairing deploys and runs the training job on Kubeflow.


In [None]:
from kubeflow.fairing import TrainJob
train_job = TrainJob(HousingServe, input_files=['ames_dataset/train.csv', "requirements.txt"],
 docker_registry=DOCKER_REGISTRY,
 backend=BackendClass(build_context_source=BuildContext))
train_job.submit()

### Deploy the trained model to Kubeflow for predictions

Import the `PredictionEndpoint` and use the configured backend class. Kubeflow Fairing packages the `HousingServe` class, the trained model, and the prediction endpoint's software prerequisites as a Docker image. Then Kubeflow Fairing deploys and runs the prediction endpoint on Kubeflow.

In [None]:
from kubeflow.fairing import PredictionEndpoint
endpoint = PredictionEndpoint(HousingServe, input_files=['trained_ames_model.dat', "requirements.txt"],
 docker_registry=DOCKER_REGISTRY,
 service_type='ClusterIP',
 backend=BackendClass(build_context_source=BuildContext))
endpoint.create()

### Call the prediction endpoint
Create a test dataset, then call the endpoint on Kubeflow for predictions.

In [1]:
# Wait service a while to be ready and replace `` with the output from last step.
# Here's an example !nc -vz fairing-service-srwh2.anonymous.svc.cluster.local 5000

!nc -vz 5000

In [None]:
# Get sample data and query endpoint
(train_X, train_y), (test_X, test_y) = read_input("ames_dataset/train.csv")

# PR https://github.com/kubeflow/fairing/pull/376
# Add `:5000/predict` to mitigate the issue.
endpoint.url='http://fairing-service-n8qv2.anonymous.svc.cluster.local:5000/predict'

endpoint.predict_nparray(test_X)

### Clean up the prediction endpoint
Delete the prediction endpoint created by this notebook.

In [None]:
endpoint.delete()

### Clean up S3 bucket and ECR Repository
Delete S3 bucket and ECR Repository that was created for this exercise

In [None]:
!aws s3 rb s3://$S3_BUCKET --force
!aws ecr delete-repository --repository-name fairing-job --region $AWS_REGION --force