# Train a Scikit-Learn model in SageMaker and track with MLFlow

## Intro

The main objective of this notebook is to show how to fine grained permissions for a user in MLflow. For example, we want this user to be able to access experiments, runs, registered models and model versions, and we further want to grant permissions to create new registered models and new model versions, but we dont want it to create new runs.
The SageMaker Studio user profile we well test is the `mlflow-model-approver`.

## Pre-Requisites

* Successfullyd deployed the CDK sample in [this repository](https://github.com/aws-samples/sagemaker-studio-mlflow-integration.git).
* Access to the `mlflow-model-approver` user profile in the created SageMaker Studio domain and use the `Base Python 2.0` image on a `Python 3` kernel.

## Install required and/or update libraries

At the time of writing, we have used the `sagemaker` SDK version 2. The MLFlow SDK library used is the one corresponding to our MLflow server version, i.e., `2.3.1`.
We install the `mlflow[extras]==2.3.1` to ensure that all required dependencies are installed.

In [None]:
#This cell might take several minutes to execute

!pip install -q --upgrade pip setuptools wheel
!pip install sagemaker sagemaker-experiments scikit-learn==1.0.1 boto3 mlflow[extras]==2.3.1

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the notebook instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/using-identity-based-policies.html) for more details on creating these. Note, if a role not associated with the current notebook instance, or more than one role is required for training and/or hosting, please replace `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s).
- The tracking URI where the MLFlow server runs

If you examine the SageMaker execution role of the `mlflow-model-approver`, you will note that it has a in-line policy attached called `restApiModelApprover` grating read permissions on all resources and methods on the REST API Gateway shielding MLflow and it looks like the following:

```json
{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Action": "execute-api:Invoke",
 "Resource": [
 "arn:aws:execute-api:::/*/GET/*",
 "arn:aws:execute-api:::/*/POST/api/2.0/mlflow/runs/search",
 "arn:aws:execute-api:::/*/POST/api/2.0/mlflow/experiments/search",
 "arn:aws:execute-api:::/*/POST/api/2.0/mlflow/model-versions/*",
 "arn:aws:execute-api:::/*/POST/api/2.0/mlflow/registered-models/*"
 ],
 "Effect": "Allow"
 }
 ]
}
```

In [None]:
import os
import pandas as pd
import json
import random
import boto3

## SageMaker and SKlearn libraries
import sagemaker
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.tuner import IntegerParameter, HyperparameterTuner

## SKLearn libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

## MLFlow libraries
import mlflow
from mlflow.tracking.client import MlflowClient
import mlflow.sagemaker

ssm = boto3.client('ssm')

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()
region = sess.boto_region_name
account = role.split("::")[1].split(":")[0]
tracking_uri = ssm.get_parameter(Name="mlflow-restApiUrl")['Parameter']['Value']
mlflow_amplify_ui = ssm.get_parameter(Name="mlflow-uiUrl")['Parameter']['Value']
api_gw_id = tracking_uri.split('//')[1].split('.')[0]
experiment_name = 'DEMO-sigv4'
model_name = 'california-housing-model'

NOTEBOOK_METADATA_FILE = "/opt/ml/metadata/resource-metadata.json"

if os.path.exists(NOTEBOOK_METADATA_FILE):
 with open(NOTEBOOK_METADATA_FILE, "rb") as f:
 user = json.loads(f.read())['UserProfileName']
 if user != 'mlflow-model-approver':
 raise ValueError("Sorry, you should use the 'mlflow-reader' user profile to run this sample.")

print("Tracking URI: {}".format(tracking_uri))
print("MLFlow UI (on Amplify): {}".format(mlflow_amplify_ui))
print('SageMaker role: {}'.format(role.split("/")[-1]))
print('bucket: {}'.format(bucket))
print('Account: {}'.format(account))
print("Using AWS Region: {}".format(region))
print("MLflow server URI: {}".format(tracking_uri))
print("user profile: {}".format(user))

### With env variable set: should succeed is the sagemaker execution role has permission to call the MLFlow endpoint

In [None]:
os.environ['MLFLOW_TRACKING_AWS_SIGV4'] = "True"
mlflow.set_tracking_uri(tracking_uri)
mlflow.set_experiment(experiment_name)

In [None]:
!python3 -m requests_auth_aws_sigv4 https://{api_gw_id}.execute-api.{region}.amazonaws.com/prod/api/2.0/mlflow/experiments/get?experiment_id=0 -v

## Data Preparation
We load the dataset from sklearn, then split the data in training and testing datasets, where we allocate 75% of the data to the training dataset, and the remaining 25% to the traning dataset.

The variable `target` is what we intend to estimate, which represents the value of a house, expressed in hundreds of thousands of dollars ($100,000)

In [None]:
# we use the California housing dataset 
data = fetch_california_housing()

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

Finally, we save a copy of the data locally, as well as in S3. The data stored in S3 will be used SageMaker to train and test the model.

In [None]:
# save the data locally
trainX.to_csv('california_train.csv', index=False)
testX.to_csv('california_test.csv', index=False)

import random
import string

prefix = f"mlflow-sample/{random.choices(string.ascii_lowercase, k=8)}/sklearncontainer"
# save the data to S3.
train_path = sess.upload_data(path='california_train.csv', bucket=bucket, key_prefix=prefix)
test_path = sess.upload_data(path='california_test.csv', bucket=bucket, key_prefix=prefix)

## Training

In [None]:
!pygmentize ./source_dir/train.py

In [None]:
metric_definitions = [{'Name': 'median-AE', 'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}]

hyperparameters = {
 'n-estimators': 100,
 'min-samples-leaf': 3,
 'features': 'MedInc HouseAge AveRooms AveBedrms Population AveOccup',
 'target': 'target'
}

environment={
 "AWS_DEFAULT_REGION": region,
 "MLFLOW_EXPERIMENT_NAME": experiment_name,
 "MLFLOW_TRACKING_URI": tracking_uri,
 "MLFLOW_AMPLIFY_UI_URI": mlflow_amplify_ui,
 "MLFLOW_TRACKING_AWS_SIGV4": "true",
 "MLFLOW_USER": user
 }

estimator = SKLearn(
 entry_point='train.py',
 source_dir='source_dir',
 role=role,
 metric_definitions=metric_definitions,
 hyperparameters=hyperparameters,
 instance_count=1,
 instance_type='ml.m5.large', # to run SageMaker in a managed infrastructure
 framework_version='1.0-1',
 base_job_name='mlflow',
 environment=environment
)

Now we are ready to execute the training as a SageMaker Training job on the SageMaker managed infrastructure. However, differently from the `mlflow-admin` user, the SageMaker execution role of `mlflow-model-approver` cannot create new runs, thus the SageMaker Training job will fail.
The error message returned explains the reasons and it looks like the following:

```json
{
 'Message': 'User: arn:aws:sts:::assumed-role/SageMakerStudioUserStack-sagemakermlflowmodelaprov-XTK0XB757WVO/SageMaker is not authorized to perform: execute-api:Invoke on resource: arn:aws:execute-api::********2473:/prod/POST/api/2.0/mlflow/runs/create'
}
```

In [None]:
try:
 estimator.fit({'train':train_path, 'test': test_path}, experiment_config=experiment_config)
except Exception as e:
 print(e)

## Read details about executed runs and model registered

Nonetheless, this user is capable of reading details about specific runs, registered models, etc. For example, in this case we want to see the best run for our experiment by looking at the `metrics.accuracy` value.

In [None]:
from mlflow.entities import ViewType

experiment = mlflow.set_experiment(experiment_name)

client = MlflowClient()

run =client.search_runs(
 experiment_ids=experiment.experiment_id,
 filter_string="",
 run_view_type=ViewType.ACTIVE_ONLY,
 max_results=1,
 order_by=["metrics.accuracy DESC"],
)[0]

print(run)

### Create registered models and model versions

As expected, it is not possible to create either a new registered model, nor a new model version

In [None]:
#client.create_registered_model(model_name)

try:
 client.create_registered_model(model_name)
except mlflow.MlflowException as e:
 if e.error_code == 'RESOURCE_ALREADY_EXISTS':
 print('Resource already exists, all good')
 else:
 raise e

try:
 model_version = client.create_model_version(
 name=model_name,
 source="{}/model".format(run.info.artifact_uri),
 run_id=run.info.run_uuid
 )
 print("model_version: {}".format(model_version))
except Exception as e:
 print(f"Exception: {str(e)}")

### Change the model stage

Promoting models to certain stages is a pivotal operation in ML workflows, as you can specify which models eventually will make it to Production.

In [None]:
try:
 model_version = client.transition_model_version_stage(model_version.name, model_version.version, 'Staging')
 print(model_version)
except Exception as e:
 print(f"Exception: {str(e)}")

## Star Github Repository

If you have found this sample useful, do not hesitate to star the GitHub repository

In [None]:
%%html

Star
