# Train a Scikit-Learn model in SageMaker and track with MLFlow

## Intro

The main objective of this notebook is to show how a user without write permissions to the MLflow server, is forbidden to create runs, experiments, register models, etc. Nontheless, with read permissions, the user can check details of what is going on.
The SageMaker Studio user profile we well test is the `mlflow-reader`.
This for example can be useful for auditing users on the MLflow server. 

## Pre-Requisites

* Successfullyd deployed the CDK sample in [this repository](https://github.com/aws-samples/sagemaker-studio-mlflow-integration.git).
* Access to the `mlflow-reader` user profile in the created SageMaker Studio domain and use the `Base Python 2.0` image on a `Python 3` kernel.

## Install required and/or update libraries

At the time of writing, we have used the `sagemaker` SDK version 2. The MLFlow SDK library used is the one corresponding to our MLflow server version, i.e., `2.3.1`.
We install the `mlflow[extras]==2.3.1` to ensure that all required dependencies are installed.

In [None]:
#This cell might take several minutes to execute

!pip install -q --upgrade pip setuptools wheel
!pip install sagemaker sagemaker-experiments scikit-learn==1.0.1 boto3 mlflow[extras]==2.3.1

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the notebook instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/using-identity-based-policies.html) for more details on creating these. Note, if a role not associated with the current notebook instance, or more than one role is required for training and/or hosting, please replace `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s).
- The tracking URI where the MLFlow server runs

If you examine the SageMaker execution role of the `mlflow-reader`, you will note that it has a in-line policy attached called `restApiReader` grating read permissions on all resources and methods on the REST API Gateway shielding MLflow and it looks like the following:

```json
{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Action": "execute-api:Invoke",
 "Resource": [
 "arn:aws:execute-api:::/*/GET/*",
 "arn:aws:execute-api:::/*/POST/api/2.0/mlflow/runs/search",
 "arn:aws:execute-api:::/*/POST/api/2.0/mlflow/experiments/search"
 ],
 "Effect": "Allow"
 }
 ]
}
```

In [None]:
import os
import pandas as pd
import json
import random
import boto3

## SageMaker and SKlearn libraries
import sagemaker
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.tuner import IntegerParameter, HyperparameterTuner

## SKLearn libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

## MLFlow libraries
import mlflow
from mlflow.tracking.client import MlflowClient
import mlflow.sagemaker

ssm = boto3.client('ssm')

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()
region = sess.boto_region_name
account = role.split("::")[1].split(":")[0]
tracking_uri = ssm.get_parameter(Name="mlflow-restApiUrl")['Parameter']['Value']
mlflow_amplify_ui = ssm.get_parameter(Name="mlflow-uiUrl")['Parameter']['Value']
api_gw_id = tracking_uri.split('//')[1].split('.')[0]
experiment_name = 'DEMO-sigv4'
model_name = 'california-housing-model'

NOTEBOOK_METADATA_FILE = "/opt/ml/metadata/resource-metadata.json"

if os.path.exists(NOTEBOOK_METADATA_FILE):
 with open(NOTEBOOK_METADATA_FILE, "rb") as f:
 user = json.loads(f.read())['UserProfileName']
 if user != 'mlflow-reader':
 raise ValueError("Sorry, you should use the 'mlflow-reader' user profile to run this sample.")

print("Tracking URI: {}".format(tracking_uri))
print("MLFlow UI (on Amplify): {}".format(mlflow_amplify_ui))
print('SageMaker role: {}'.format(role.split("/")[-1]))
print('bucket: {}'.format(bucket))
print('Account: {}'.format(account))
print("Using AWS Region: {}".format(region))
print("MLflow server URI: {}".format(tracking_uri))
print("user profile: {}".format(user))

### With env variable set: should succeed is the sagemaker execution role has permission to call the MLFlow endpoint

In [None]:
os.environ['MLFLOW_TRACKING_AWS_SIGV4'] = "True"
mlflow.set_tracking_uri(tracking_uri)
mlflow.set_experiment(experiment_name)

In [None]:
!python3 -m requests_auth_aws_sigv4 https://{api_gw_id}.execute-api.{region}.amazonaws.com/prod/api/2.0/mlflow/experiments/get?experiment_id=0 -v

## Data Preparation
We load the dataset from sklearn, then split the data in training and testing datasets, where we allocate 75% of the data to the training dataset, and the remaining 25% to the traning dataset.

The variable `target` is what we intend to estimate, which represents the value of a house, expressed in hundreds of thousands of dollars ($100,000)

In [None]:
# we use the California housing dataset 
data = fetch_california_housing()

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

Finally, we save a copy of the data locally, as well as in S3. The data stored in S3 will be used SageMaker to train and test the model.

In [None]:
# save the data locally
trainX.to_csv('california_train.csv', index=False)
testX.to_csv('california_test.csv', index=False)

import random
import string

prefix = f"mlflow-sample/{random.choices(string.ascii_lowercase, k=8)}/sklearncontainer"
# save the data to S3.
train_path = sess.upload_data(path='california_train.csv', bucket=bucket, key_prefix=prefix)
test_path = sess.upload_data(path='california_test.csv', bucket=bucket, key_prefix=prefix)

### Setup SageMaker Experiments

SageMaker Experiments is an AWS service for tracking machine learning Experiments. The SageMaker Experiments Python SDK is a high-level interface to this service that helps you track Experiment information using Python.

Conceptually, these are the following entities within `SageMaker Experiments`:

* Experiment: A collection of related Trials. Add Trials to an Experiment that you wish to compare together.
* Trial: A description of a multi-step machine learning workflow. Each step in the workflow is described by a TrialComponent.
* TrialComponent: A description of a single step in a machine learning workflow.
* Tracker: A Python context-manager for logging information about a single TrialComponent.

When running jobs (both training and processing ones) in the SageMaker managed infrastructure, SageMaker creates automatically a TrialComponent. TrialComponents includes by default jobs metadata and lineage information about the input and output data, models artifacts and metrics (for training jobs), and within your training script these data can be further enriched.

We want to show how you can easily enable a two-way interaction between MLflow and SageMaker Experiments.

Let us first create an `Experiment` and a `Trial`. These two entities are used to keep your experimentation organized.

In [None]:
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

import time

try:
 my_experiment = Experiment.load(experiment_name=experiment_name)
 print("existing experiment loaded")
except Exception as ex:
 if "ResourceNotFound" in str(ex):
 my_experiment = Experiment.create(
 experiment_name = experiment_name,
 description = "MLFlow and SageMaker integration"
 )
 print("new experiment created")
 else:
 print(f"Unexpected {ex}=, {type(ex)}")
 print("Dont go forward!")
 raise

trial_name = "trial-v1"

try:
 my_first_trial = Trial.load(trial_name=trial_name)
 print("existing trial loaded")
except Exception as ex:
 if "ResourceNotFound" in str(ex):
 my_first_trial = Trial.create(
 experiment_name=experiment_name,
 trial_name=trial_name,
 )
 print("new trial created")
 else:
 print(f"Unexpected {ex}=, {type(ex)}")
 print("Dont go forward!")
 raise

create_date = time.strftime("%Y-%m-%d-%H-%M-%S")

experiment_config = {
 "ExperimentName": experiment_name,
 "TrialName": trial_name,
}

## Training

For this example, we use the `SKlearn` framework in script mode with SageMaker. Let us explore in more details the different components we need to define.

### Traning script and SageMaker environment

The `./source_dir/train_env_variables.py` script provides all the code we need for training a SageMaker model. The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:

* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to. These artifacts are uploaded to S3 for model hosting.
* `SM_CHANNEL_TRAIN`: A string representing the path to the directory containing data in the 'training' channel.
* `SM_CHANNEL_TEST`: A string representing the path to the directory containing data in the 'testing' channel.


For more information about training environment variables, please visit 
[SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md).

We want to highlight in particular `SM_TRAINING_ENV` since it provides all the training information as a JSON-encoded dictionary (see [here](https://github.com/aws/sagemaker-training-toolkit/blob/master/ENVIRONMENT_VARIABLES.md#sm_training_env) for more details).

#### Hyperparmeters

We are using the `RandomForestRegressor` algorithm from the SKlearn framework. For the purpose of this exercise, we are only using a subset of hyperparameters supported by this algorithm, i.e. `n-estimators` and `min-samples-leaf`

If you would like to know more the different hyperparmeters for this algorithm, please refer to the [`RandomForestRegressor` official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html).

Furthermore, it is important to note that for the purpose of this excercise, we are essentially omitting completely the feature engineering step, which is an essential step in any machine learning problem.

#### MLFlow interaction

To interact with the MLFlow server, we use the mlflow SDK, which allows us to set the tracking URI and the experiment name. One this initial setup is completed, we can store the parameters used (`mlflow.log_params(params)`), the model that is generated (`mlflow.sklearn.log_model(model, "model")`) with its associated metrics (`mlflow.log_metric(f'AE-at-{str(q)}th-percentile', np.percentile(a=abs_err, q=q))`).

TODO: explain the `mlflow.autolog()` and the System Tags (add link) and how to overwrite them to have the right reference in SageMaker

#### SageMaker

In [None]:
!pygmentize ./source_dir/train.py

### SKlearn container

For this example, we use the `SKlearn` framework in script mode with SageMaker. For more information please refere to [the official documentation](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html)

Our training script makes use of other 3rd party libraries, i.e. `mlflow`, which are not installed by default in the `Sklearn` container SageMaker provides. However, this can be easily overcome by supplying a `requirement.txt` file in the `source_dir` folder, which then SageMaker will `pip`-install before executing the training script.

### Metric definition

SageMaker emits every log to CLoudWatch. Since we are using scripting mode, we need to specify a metric definition object to define the format of the metric we are interested in via regex, so that SageMaker knows how to extract this metric from the CloudWatch logs of the training job.

In our case our custom metric is as follow

```python
metric_definitions = [{'Name': 'median-AE', 'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}]
```

In [None]:
metric_definitions = [{'Name': 'median-AE', 'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}]

hyperparameters = {
 'n-estimators': 100,
 'min-samples-leaf': 3,
 'features': 'MedInc HouseAge AveRooms AveBedrms Population AveOccup',
 'target': 'target'
}

environment={
 "AWS_DEFAULT_REGION": region,
 "MLFLOW_EXPERIMENT_NAME": experiment_name,
 "MLFLOW_TRACKING_URI": tracking_uri,
 "MLFLOW_AMPLIFY_UI_URI": mlflow_amplify_ui,
 "MLFLOW_TRACKING_AWS_SIGV4": "true",
 "MLFLOW_USER": user
 }

estimator = SKLearn(
 entry_point='train.py',
 source_dir='source_dir',
 role=role,
 metric_definitions=metric_definitions,
 hyperparameters=hyperparameters,
 instance_count=1,
 instance_type='ml.m5.large', # to run SageMaker in a managed infrastructure
 framework_version='1.0-1',
 base_job_name='mlflow',
 environment=environment
)

Now we are ready to execute the training as a SageMaker Training job on the SageMaker managed infrastructure. However, differently from the `mlflow-admin` user, the SageMaker execution role of `mlflow-reader` cannot create new runs, thus the SageMaker Training job will fail.
The error message returned explains the reasons and it looks like the following:

```json
{
 'Message': 'User: arn:aws:sts:::assumed-role/SageMakerStudioUserStack-sagemakermlflowreaderrole-1NX32OI2LUKEN/SageMaker is not authorized to perform: execute-api:Invoke on resource: arn:aws:execute-api::********2473:/prod/POST/api/2.0/mlflow/runs/create'
}
```

In [None]:
estimator.fit({'train':train_path, 'test': test_path}, experiment_config=experiment_config)

## Read details about executed runs and model registered

Nonetheless, this user is capable of reading details about specific runs, registered models, etc. For example, in this case we want to see the best run for our experiment by looking at the `metrics.accuracy` value.

In [None]:
from mlflow.entities import ViewType

experiment = mlflow.set_experiment(experiment_name)

client = MlflowClient()

run =client.search_runs(
 experiment_ids=experiment.experiment_id,
 filter_string="",
 run_view_type=ViewType.ACTIVE_ONLY,
 max_results=1,
 order_by=["metrics.accuracy DESC"],
)[0]

print(run)

### Create registered models and model versions

As expected, it is not possible to create either a new registered model, nor a new model version

In [None]:
try:
 client.create_registered_model(model_name)
except Exception as e:
 print(f"Exception: {str(e)}")

try:
 model_version = client.create_model_version(
 name=model_name,
 source="{}/model".format(run.info.artifact_uri),
 run_id=run.info.run_uuid
 )
 print("model_version: {}".format(model_version))
except Exception as e:
 print(f"Exception: {str(e)}")

## Listing and searching MLflow Models

Nonetheless, it is possible to access the existing registered models and all model versions.
See [official docs](https://mlflow.org/docs/latest/model-registry.html#listing-and-searching-mlflow-models) for more info.

### Search for registered models

In [None]:
from pprint import pprint

client = MlflowClient()
for rm in client.search_registered_models():
 pprint(dict(rm), indent=4)

### Search for model versions

In [None]:
client = MlflowClient()
for mv in client.search_model_versions(f"name='{model_name}'"):
 pprint(dict(mv), indent=4)

In [None]:
### Get the model URI from the MLflow model registry

In [None]:
registered_model = client.search_registered_models(f"name='{model_name}'")[0]
model_version = registered_model.latest_versions[0]

model_uri = model_version.source
print("Model URI: {}".format(model_uri))

# Load model as a Sklearn model.
loaded_model = mlflow.sklearn.load_model(model_uri)

# get a random index to test the prediction from the test data
index = random.randrange(0, len(testX))
print("Random index value: {}".format(index))

# Prepare data on a Pandas DataFrame to make a prediction.
data = testX.drop(['Latitude','Longitude','target'], axis=1).iloc[[index]]

print("#######\nData for prediction \n{}".format(data))

y_hat = loaded_model.predict(data)[0]
y = y_test[index]

print("Predicted value: {}".format(y_hat))
print("Actual value: {}".format(y))

## Star Github Repository

If you have found this sample useful, do not hesitate to star the GitHub repository

In [None]:
%%html

Star
