# Module 7: Feature Monitoring
**This notebook leverages AWS Glue DataBrew to implement Feature Monitoring of Offline Feature Groups.**

**Note:** Please set kernel to `Python 3 (Data Science)` and select instance to `ml.t3.medium`


---

## Contents

1. [Overview](##Overview)
2. [Feature Group Monitoring Preparation](#Feature-Group-Monitoring-Preparation)
3. [Run Feature Group Monitoring using DataBrew Profile Job](#Run-Feature-Group-Monitoring-using-DataBrew-Profile-Job)
4. [Visualisation of Feature Group statistics](#Visualisation-of-Feature-Group-statistics)
5. [Clean up](#Clean-Up)



## Overview

In previous module (`Module-1 Introduction to SageMaker Feature Store`), we demonstrated how to create multiple features groups inside a Amazon SageMaker Feature Store and ingest data into it.

In this notebook, we will illustrate how to implement feature monitoring using AWS Glue DataBrew to create feature statistics.

[AWS Glue DataBrew](https://aws.amazon.com/glue/features/databrew/) is a visual data preparation tool that helps you clean and normalize data without writing code. DataBrew also allows customers to specify which data quality statistics to auto-generate for datasets when running a profile job. This allows users to customize data profile statistics such as determining duplicate values, correlations, and outliers based on the nature and size of their datasets, and create a custom data profile overview with only the statistics that meet their needs.

We will cover the following aspects:

* Preparation step to Feature Monitoring
* Executing Feature Monitoring using a DataBrew Profile job to generate statistics for a feature group
* Extract statistics from profile job execution and persist to S3
* Additional visualisation of the feature group statistics

![Feature Monitoring Architecture](../images/feature_monitoring_architecture.png "Feature Monitoring Architecture")
---

## Prerequisites

Before continuing with this module, you need to have run `Module-1 Introduction to SageMaker Feature Store`.

## Setup

#### Imports

In [None]:
%%capture
!pip install pyathena

In [None]:
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker import get_execution_role
import sagemaker
import logging
import boto3
import pandas as pd
import time
import re
import os
import sys
from IPython.display import display, Markdown
from time import gmtime, strftime

#module containing utility functions for this notebook
import feature_monitoring_utils

In [None]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())
logger.info(f'Using SageMaker version: {sagemaker.__version__}')

#### Retrieve Feature Group

In this notebook, we will be using and monitoring the orders feature group created in Module 1 of the workshop. In case you would like to use a different feature group, please uncomment and replace **ORDERS_FEATURE_GROUP_NAME** with the name of your Orders Feature Group. 

In [None]:
# Need to retrieve FG names from previous FSCW notebooks
# Retrieve FG names
%store -r orders_feature_group_name

#orders_feature_group_name = ""

# Set up the results bucket location
results_bucket=sagemaker.Session().default_bucket() # You might change this for a different s3 bucket
results_key='aws-databrew-results/Offline-FS'

#### Update SageMaker Studio execution role 
If you are running this notebook in Amazon SageMaker Studio, the IAM role assumed by your Studio user needs permission to perform Glue and DataBrew operations. To provide this permission to the role, do the following.

1. Open the [Amazon SageMaker console](https://console.aws.amazon.com/sagemaker/).
2. Select Amazon SageMaker Studio and choose your user name.
3. Under **User summary**, copy just the name part of the execution role ARN 
5. Go to the [IAM console](https://console.aws.amazon.com/iam) and click on **Roles**. 
6. Find the role associated with your SageMaker Studio user
7. Under the Permissions tab, click **Attach policies** and add the following: **AWSGlueServiceRole**, **AWSGlueDataBrewFullAccessPolicy**, **AmazonAthenaFullAccess**
8. Under Trust relationships, click **Edit trust relationship** and add the following JSON,
```
{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Effect": "Allow",
 "Principal": {
 "Service": "sagemaker.amazonaws.com"
 },
 "Action": "sts:AssumeRole"
 },
 {
 "Effect": "Allow",
 "Principal": {
 "Service": "glue.amazonaws.com"
 },
 "Action": "sts:AssumeRole"
 },
 {
 "Effect": "Allow",
 "Principal": {
 "Service": "databrew.amazonaws.com"
 },
 "Action": "sts:AssumeRole"
 }
 ]
}
```
 

## Feature Group Monitoring Preparation

This function will execute a few steps to prepare the data before running DataBrew profiling job and generate statistics.

* Create Snapshot of a Feature Group from Offline Feature Store (before creating using DataBrew profiling capability, we need to create a snapshot of the Feature from the Offline Feature Store which only keeps the latest version of a feature and removes records marked as to be deleted)
* Crawl the Snapshot table (the newly created table needs to be crawled to be accessible by DataBrew. The following function will create an AWS Glue Crawler, start the crawler)
* Create AWS Glue DataBrew dataset
* Create AWS Glue DataBrew Profile Job

In [None]:
response_brew_prep = feature_monitoring_utils.feature_monitoring_prep(
 orders_feature_group_name, 
 results_bucket, 
 results_key,
 verbose = False
)

## Run Feature Group Monitoring using DataBrew Profile Job

This section will run an AWS Glue DataBrew Profile (defined in the previous section).

#### Launch DataBrew Profile Job

In [None]:
# Call the main profile execution function
response_brew_job = feature_monitoring_utils.feature_monitoring_run(
 orders_feature_group_name,
 verbose=False
)

#### Retrieve Link to DataBrew Profile Job Visualizations

Users can access the statistics for a particular job run using the AWS DataBrew Console UI. The code below retrieves the URL link.
![Feature Group Tag](../images/databrew_visualization.png "DataBrew Visualization")

This link is also added as tag to the Feature Group (as per the picture below).

![Feature Group Tag](../images/feature_group_tags.png "Feature Group Tag")


In [None]:
# Display the Report S3 location
databrew_profile_console_url = response_brew_job[2]
brew_results_s3 = response_brew_job[4]
print("Report is available at the following S3 location:\n" + brew_results_s3 + "\n")

# Display the DataBrew link
print("Please click on the link below to access visulizations in Glue DataBrew console:")
databrew_link = f'[DataBrew Profile Job Visualizations]({databrew_profile_console_url})'
display(Markdown(databrew_link))

### Ingest more rows into the feature group to simulate feature drift

In order to see feature drift we simulate some data changes in the orders FG. The CSV file used for the ingestion was previously generated in 1st module.

If the correspondent CSV file is not present at *../data/transformed/* folder, you will have to generate it by launching this notebook *m1_nb0_prepare_datasets.ipynb* from 1st module.

If you want to simulate feature drift for another Feature Group you will need to customize this part and generate some changes to your data before launching the visualization section.

In [None]:
# Provide the path to input CVS file
csv_path = '.././data/transformed/orders.csv'
orders_count = int(os.popen(f'wc -l {csv_path}').readline().split()[0]) - 1
nbrows = 0
nb_runs = 2 # Number of reports we want to generate
nb_sample = int(orders_count*0.25/nb_runs)

# Ingest rows and execute profile job
for x in range(nb_runs):
 nbrows = nbrows + nb_sample
 # Ingest rows into FG from CSV file
 feature_monitoring_utils.ingest_rows_fg(orders_feature_group_name, csv_path, nbrows=nbrows)
 
 # Call the main profile execution function
 resp_job = feature_monitoring_utils.feature_monitoring_run(orders_feature_group_name, verbose=False)

## Visualisation of Feature Group statistics

### Overview

The main goal of this notebook is to create visualizations of features and trends over time based on AWS Glue DataBrew quality statistics generated reports.

### Main steps

- Consolidate historic monitor reports, generates a data frame and save CSV files in S3
- Visualization of features statistics over time


#### Creates a Pandas dataframe from Json reports and write CSV files to S3

These CVS files can be further used by other visualisation or BI tools

In [None]:
start_date = '20210701' # input('Insert Date as a form YYYYmmdd')

# Creates a Pandas dataframe from Json reports and write CSV files to S3
response_consolid = feature_monitoring_utils.consolidate_monitor_reports(
 orders_feature_group_name, 
 results_bucket,
 results_key,
 start_date
)

analysis_data = response_consolid[0]
consolidated_s3_file = response_consolid[2]
print(consolidated_s3_file)
analysis_data.head()

#### Visualize the evolution in time for a particular feature

**purchase_amount's entropy drift over time**

In [None]:
feature_name = 'purchase_amount'
analysis_data_purchase_amount = feature_monitoring_utils.plot_profiling_data(
 analysis_data, 
 feature_name, 
 ['entropy'], 
 kind='line'
)

**See two statistics (entropy and standardDeviation) for purchase_amount feature**

**Separate plots**

In [None]:
analysis_data_purchase_amount = feature_monitoring_utils.plot_profiling_data(
 analysis_data, 
 feature_name, 
 ['entropy','standardDeviation'],
 multiple_plots=True
)

**Same plot with different scales for each statistic**

In [None]:
analysis_data_purchase_amount = feature_monitoring_utils.plot_profiling_data(
 analysis_data, 
 feature_name, 
 ['entropy','standardDeviation']
)

**purchase_amount's entropy drift over time using a Bar Plot**

## Clean up

**This section ensures you delete all AWS resources that was created as part of this workshop**

In [None]:
# Clean up resources - crawler, DataBrew dataset, and profile job - as well as local analysis folder
analysis_folder_name=response_consolid[4]

response = feature_monitoring_utils.feature_monitoring_cleanup(
 orders_feature_group_name, 
 analysis_folder_name
)