# Create and Query ML Lineage between SageMaker - Models, Inference Endpoints, Feature Store, Processing Jobs and Datasources

---

#### Note: Please set kernel to Python 3 (Data Science) and select instance to ml.t3.medium


<div class="alert alert-info"> ðŸ’¡ <strong> Quick Start </strong>
ML Lineage racking from datasource to model endpoint, The challenge of reproducibility and lineage in machine learning (ML) is three-fold: code lineage, data lineage, and model lineage. Source version control is a standard for managing changes to code. For data lineage, most data storage services support versioning, which gives you the ability to track datasets at a given point in time. Model lineage combines code lineage, data lineage, and ML-specific information such as Docker containers used for training and deployment, model hyperparameters, and more.

<strong><a style="color: #0397a7 " href="https://aws.amazon.com/blogs/machine-learning/model-and-data-lineage-in-machine-learning-experimentation/">
    <u>Click here for a comprehensive ML lineage concepts</u></a>
</strong>
</div>

Feature engineering is expensive and time-consuming, leading customers to adopt a feature store
for managing features across teams and models. Unfortunately, ML lineage solutions have yet to
adapt to this new concept of feature management. To achieve the full benefits of feature reuse,
customers need to be able to answer fundamental questions about features. For example, how
was this feature group built? What models are using this feature group? What features does my
model depend on? What features are built with this data source?

---

Amazon SageMaker ML Lineage Tracking creates and stores information about the steps of a machine learning (ML) workflow from data preparation to model deployment. With the tracking information you can reproduce the workflow steps, track model and dataset lineage, and establish model governance and audit standards. 

<strong><a style="color: #0397a7 " href="https://aws.amazon.com/blogs/machine-learning/extend-model-lineage-to-include-ml-features-using-amazon-sagemaker-feature-store/">
    <u>Read about ML Lineage tracking with Feature store</u></a>
</strong>

#### With SageMaker Lineage Tracking data and feature store, scientists and  builders can do the following:
---
##### 1. Build confidence for reuse of existing features.

##### 2. Avoid re-inventing features that are based on the same raw data as existing features.

##### 3. Troubleshooting and auditing models and model predictions.

##### 4. Manage features proactively.

---

## Contents

1. [Notebook Preparation](#Notebook-Preparation)
   1. [Imports](#Imports)
   1. [Check git submodules](#Check-git-submodules)
   1. [Check and update Sagemaker version](#Check-and-update-Sagemaker-version)
   1. [Logging Settings](#Logging-Settings)
   1. [Module Configurations](#Module-Configurations)
   1. [Load peristed variables from previous modules](#Load-peristed-variables-from-previous-modules)
1. [ML Lineage Creation](#ML-Lineage-Creation) 
   1. [Create ML Lineage](#Create-ML-Lineage)
   1. [Verify ML Lineage](#Verify-ML-Lineage)
   1. [ML Lineage Graph](#ML-Lineage-Graph)
1. [ML Lineage Querying](#ML-Lineage-Querying)
   1. [What ML lineage relationships can you infer from this model's endpoint?](#A.)
   1. [What feature groups were used to train this model?](#B.)
   1. [What models were trained using this feature group?](#C.)
   1. [What feature groups were populated with data from this datasource?](#D.)
   1. [What datasources were used to populate a feature group?](#E.)


## Notebook Preparation

#### Imports

In [None]:
import sagemaker 
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker import get_execution_role
import pandas as pd
import logging
import os
import json
import sys
from pathlib import Path

path = Path(os.path.abspath(os.getcwd()))
package_dir = f'{str(path.parent)}/ml-lineage-helper'
print(package_dir) 
sys.path.append(package_dir)

#### Check git submodules

##### Check to confirm that the submodule ml-lineage-helper and all the files underneath are present as shown below. If not, please continue with the next instruction to update them.

![check submodule](../images/m8_nb1-check-submodules.png "check submodules")

---

##### Run the following command in a terminal under the [./amazon-sagemaker-feature-store-end-to-end-workshop] folder path to update the missing submodules.   

git submodule update --init --recursive

![run submodule update](../images/m8_nb1-run-submodules-update.png "run submodule update")



In [None]:
%load_ext autoreload
%autoreload 2
from ml_lineage_helper import *
from ml_lineage_helper.query_lineage import QueryLineage

#### Check and update Sagemaker version

In [None]:
if sagemaker.__version__ < '2.48.1':
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sagemaker==2.48.1'])
    importlib.reload(sagemaker)

#### Logging Settings

In [None]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())
logger.info(f'Using SageMaker version: {sagemaker.__version__}')
logger.info(f'Using Pandas version: {pd.__version__}')

#### Module Configurations 

In [None]:
# Sagemaker session
sess = sagemaker.Session()

# Sagemaker Region
region=sess.boto_region_name
print(region)

# IAM role for executing the processing job.
iam_role = sagemaker.get_execution_role()

#### Load peristed variables from previous modules

In [None]:
# Retrieve Estimator parameters
%store -r training_jobName
print(training_jobName)

# Retrieve FG names
%store -r customers_feature_group_name
print(customers_feature_group_name)
%store -r products_feature_group_name
print(products_feature_group_name)
%store -r orders_feature_group_name
print(orders_feature_group_name)

# Retrieve Orders Datasource
%store -r orders_datasource
print(orders_datasource)

# Retrieve Processing Job
%store -r processing_job_name
print(processing_job_name)
%store -r processing_job_description
print(processing_job_description)

# Retrieve Endpoint Name
%store -r endpoint_name
print(endpoint_name)

# Retrieve Query String
%store -r query_string
print(query_string)

---
## ML Lineage Creation
---

<div class="alert alert-info"> ðŸ’¡ <strong> Why is feature lineage important? </strong>
<p>Lineage tracking can tie together a SageMaker Processing job, the raw data being processed, the processing code, the query you used against the Feature Store to fetch your training and test sets, the training and test data in S3, and the training code into a lineage represented as a DAG.</p>
</div>

![ML Lineage Tracking 1](../images/m8_nb1_ml-lineage-tracking-1.png "ML Lineage Tracking 1")

##### Imagine trying to manually track all of this for a large team, or multiple teams or even multiple business units. Lineage tracking and querying helps make this more manageable and helps organizations move to ML at scale

<div class="alert alert-info"> ðŸ’¡ <strong> What relationships are important to track? </strong>
<p>The diagram below shows a sample set of ML lifecycle steps, artifacts, and associations that are
typically needed for model lineage when using a feature store, including:</p>
</div>

---

![ML Lineage Tracking 2](../images/m8_nb1_ml-lineage-tracking-2.png "ML Lineage Tracking 2")

---

#### Data source:
##### ML features depend on raw data sources like an operational data store, or a set of CSV files in Amazon S3.

#### Feature pipeline:
##### Production-worthy features are typically built using a feature pipeline that takes a set of raw data sources, performs feature transformations, and ingests resulting features into the feature store. Lineage tracking can help by associating those pipelines with their data sources and their target feature groups.

#### Feature sets:
##### Once features are in a feature store, data scientists query it to retrieve data for training and validation of a model. You can use lineage tracking to associate the feature store query with the produced dataset. This provides granular detail into which features were used and what feature history was selected across multiple feature groups.

#### Training job:
##### As the ML lifecycle matures to adopt the use of a feature store, model lineage can associate training with specific features and feature groups

#### Model: 
##### In addition to relating models to hosting endpoints, they can be linked to their corresponding training job, and indirectly to feature groups.

#### Endpoint: 
##### Lastly, for online models, specific endpoints can be associated with the models they are hosting, completing the end to end chain from data sources to endpoints providing predictions.

---


<div class="alert alert-info"> ðŸ’¡ <strong>Tip</strong>
  <p>An end-to-end lineage solution needs to give you the means to access information about parameters, versioning, data sourcess and their respective associations to understand all aspects that went in to training the model.</p>
</div>

#### Clear (Delete) existing ML Lineage

<div class="alert alert-warning"> ðŸ’¡ <strong>Warning!!!</strong>
    <p>Executing the <b>[delete_lineage_data()]</b> method will remove all Lineage among the associated artifacts used.</p>
    <p>Please <b>DO NOT UNCOMMENT AND EXECUTE</b> the following code unless you absolutely understand of the consequences</p>
</div>

In [None]:
endpoint_name

In [None]:
# sagemakersession = SageMakerSession(bucket_name=sess.default_bucket(),
#        region=region,
#        role_name=iam_role,
#        aws_profile_name="default",
#    )
# ml_lineage = MLLineageHelper(sagemaker_session=sagemakersession, sagemaker_model_name_or_model_s3_uri=endpoint_name)
# ml_lineage.delete_lineage_data()

#### Create ML Lineage
---

Lineage tracking can tie together a SageMaker Processing job, the raw data being processed, the processing code, the query you used against the Feature Store to fetch your training and test sets, the training and test data in S3, and the training code into a lineage represented as a DAG.

---

Many of the inputs are optional, but in this example we assume:
1. You started with a raw data source
2. You used SageMaker Data Wrangler to process the raw data and ingest it into the orders Feature Group.
3. You queried the Feature Store to create training and test datasets.
4. You trained a model in SageMaker on your training and test datasets.

In [None]:
# Model name is same as endpoint name in this example
ml_lineage = MLLineageHelper()
lineage = ml_lineage.create_ml_lineage(training_jobName, model_name=endpoint_name, query=query_string,
                                       feature_group_names=[customers_feature_group_name,
                                           products_feature_group_name,
                                           orders_feature_group_name], 
                                       sagemaker_processing_job_description=processing_job_description
                                      )

### Verify ML Lineage

In [None]:
# Print the ML Lineage
lineage

### ML Lineage Graph

<div class="alert alert-info"> ðŸ’¡ <strong>Tip</strong>
  <p>Given the number of components that are part of a modelâ€™s lineage, you may want to inspect the lineage of not only the model, but any object associated with the model, With a graph as the underlying data structure that supports lineage, you should have the flexibility to traverse an entityâ€™s lineage from different focal points. You should be able to find the entire lineage of a model and all the components involved in creating it.</p>
</div>

In [None]:
# Visual Representation of the ML Lineage
ml_lineage.graph()


---
## ML Lineage Querying
---



<div class="alert alert-info"> ðŸ’¡ <strong> What ML lineage relationships can you infer using this module? </strong>
<p>Feature mangement, auditing and trouble shooting</p>
</div>

---

![ML Lineage Tracking 3](../images/m8_nb1_ml-lineage-tracking-3.png "ML Lineage Tracking 3")

---


##### A.
<div class="alert alert-info"> ðŸ’¡ <strong>What ML lineage relationships can you infer from this model's endpoint?</strong>
<p>Query ML Lineage by SageMaker Model Name or SageMaker Inference Endpoint</p>
</div>

In [None]:
lineageObject = MLLineageHelper(sagemaker_model_name_or_model_s3_uri=endpoint_name)
lineageObject.df

---

##### B.
<div class="alert alert-info"> ðŸ’¡ <strong>What feature groups were used to train this model?</strong>
<p>Given a SageMaker Model Name or artifact ARN, you can find associated Feature Groups</p>
</div>

In [None]:
query_lineage = QueryLineage()
query_lineage.get_feature_groups_from_model(endpoint_name)

---

##### C.
<div class="alert alert-info"> ðŸ’¡ <strong>What models were trained using this feature group?</strong>
<p>Given a Feature Group ARN, and find associated SageMaker Models</p>
</div>

In [None]:
feature_group = FeatureGroup(name=orders_feature_group_name, sagemaker_session=sess)
query_lineage.get_models_from_feature_group(feature_group.describe()['FeatureGroupArn'])

---

##### D.
<div class="alert alert-info"> ðŸ’¡ <strong>What feature groups were populated with data from this datasource?</strong>
<p>Given a data source's S3 URI or Artifact ARN, you can find associated SageMaker Feature Groups</p>
</div>

In [None]:
query_lineage.get_feature_groups_from_data_source(orders_datasource, 3)

---

##### E.
<div class="alert alert-info"> ðŸ’¡ <strong>What datasources were used to populate a feature group?</strong>
<p>Given a Feature Group ARN, and find associated data sources</p>
</div>

In [None]:
orders_feature_group = FeatureGroup(name=orders_feature_group_name, sagemaker_session=sess)
orders_feature_group_arn = orders_feature_group.describe()['FeatureGroupArn']
print(orders_feature_group_arn)
query_lineage.get_data_sources_from_feature_group(orders_feature_group_arn, max_depth=2)

---