# Part 2: Setting an Amazon Fraud Detector model

In [None]:
# Uncomment and install s3fs, this is required to read CSV files from S3 directly into Pandas dataframe
# Once installed, please restart the Notebook Kernel (Kernel > Restart Kernel) before proceeding

#%pip install s3fs

## Overview <a id='overview'></a>

* [Notebook 1: Data Preparation, Process, and Store Features](./1-data-analysis-prep.ipynb)
* **[Notebook 2: Amazon Fraud Detector Model Setup](./2-afd-model-setup.ipynb)**
    * **[Introduction](#intro)**
    * **[Setup Notebook](#setup)**
    * **[Set AFD Entity type, event type, and Detector names](#entity)**
    * **[Profile Your Dataset](#profile)**
    * **[Create Labels, Variables, Entity and Event Types](#labels)**
    * **[Conclusion](#conclusion)**
* [Notebook 3: Model training, deployment, real-time and batch inference](./3-afd-model-train-deploy.ipynb)
* [Notebook 4: Create an end-to-end pipeline](./4-afd-pipeline.ipynb)

### 1. Introduction <a id="intro"></a>
___
<a href="#overview">overview</a>

Amazon Fraud Detector is a fully managed service that makes it easy to identify potentially fraudulent online activities such as online payment fraud and the creation of fake accounts. Fraud Detector capitalizes on the latest advances in machine learning (ML) and 20 years of fraud detection expertise from AWS and Amazon.com to automatically identify potentially fraudulent activity so you can catch more fraud faster.

In this notebook, we'll use the Amazon Fraud Detector API to define an entity and event of interest and use CSV data stored in S3 to train a model. Next, we'll derive some rules and create a "detector" by combining our entity, event, model, and rules into a single endpoint. Finally, we'll apply the detector to a sample of our data to identify potentially fraudulent events.

After running this notebook you should be able to:

* Define an Entity and Event
* Create a Detector
* Train a Machine Learning (ML) Model
* Author Rules to identify potential fraud based on the model's score
* Apply the Detector's "predict" function, to generate a model score and rule outcomes on data

If you would like to know more, please check out [Fraud Detector's Documentation](https://docs.aws.amazon.com/frauddetector/latest/ug/what-is-frauddetector.html).

To create models within Amazon Fraud Detector, you must provide data for training. This data has input features (defined by variables) and output labels (defined by labels in the Amazon Fraud Detector service). Additionally, you define events based on the type of entities sending the data for predictions. The following diagram shows the sequence of component creation followed in this tutorial.

<img src="images/nb2.png" width="600" height="800"/>

### IAM Permissions
---

To use Amazon Fraud Detector, you have to set up permissions that allow access to the Amazon Fraud Detector console and API operations. You also have to allow Amazon Fraud Detector to perform tasks on your behalf and to access resources that you own. 

The following policies provide the required permission to use Amazon Fraud Detector:

* `AmazonFraudDetectorFullAccessPolicy`
  Allows you to perform the following actions:
  - Access all Amazon Fraud Detector resources  
  - List and describe all model endpoints in Amazon SageMaker  
  - List all IAM roles in the account  
  - List all Amazon S3 buckets  
  - Allow IAM Pass Role to pass a role to Amazon Fraud Detector  


* `AmazonS3FullAccess`
   Allows full access to Amazon S3. This is required to upload training files to S3.

In this case we will assign `AmazonFraudDetectorFullAccessPolicy` and `AmazonS3FullAccess` policies to the SageMaker Execution Role.

### Plan

#### Plan a Fraud Detector
---

A Detector contains the event, model(s) and rule(s) detection logic for a particular type of fraud that you want to detect. We'll use the following 7 step process to plan a Fraud Detector:



* Setup your notebook
  - Name the major components `entity`, `entity type`, `model`, `detector`
  - Get IAM role ARN
  - S3 Bucket with your training data CSV File
* Read and Profile your Data
  - This will give you an idea of what your dataset contains
  - This will also identify the variables and labels that will need to be created to define your event
* Create event variables and labels
  - This will create the variables and labels in fraud detector
* Define your Entity and Event Type
  - What is the activity that you are detecting? That's likely your Event Type (e.g., account_registration)
  - Who is performing this activity? That's likely your Entity (e.g., customer)
* Create and Train your Model
  - Model training takes anywhere from 45-60 minutes
  - Promote your model once training is complete
* Create Detector, generate Rules and assemble your Detector
  - Create your detector
  - Create rules based on your model scores
    - Define outcomes (e.g., fraud, investigate and approve)
  - Assemble your detector by adding your model and rules to it
* Test your Detector
  - Interactively call predict on a handful of records


### 2. Setup your Notebook <a id="setup"></a>
---
<a href="#overview">overview</a>

1. Name the major components of Fraud Detector
2. Get IAM role ARN 
3. S3 Bucket with your training data CSV File

Then you can interactively exeucte the code cells in the notebook, no need to change anything unless you want to. 


<div class="alert alert-info"> <h4><strong>ðŸ’¡ Fraud Detector Components</strong></h4>
EVENT_TYPE is a business activity that you want evaluated for fraud risk. ENTITY_TYPE represents the "what or who" that is performing the event you want to evaluate. MODEL_NAME is the name of your supervised machine learning model that Fraud Detector trains on your behalf. DETECTOR_NAME is the name of the detector that contains the detection logic (model and rules) that you apply to events that you want to evaluate for fraud.

</div>

We will import some necessary libraries that will be used throughout this notebook.

In [1]:
from IPython.core.display import display, HTML
from IPython.display import clear_output, JSON

display(HTML("<style>.container { width:90% }</style>"))
# ------------------------------------------------------------------

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import sys
import time
import json
import uuid 
from datetime import datetime
import boto3
import sagemaker

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


#### Set region, boto3 and SageMaker SDK variables

We will initialize a Fraud Detector, S3 and Sagemaker Boto3 client objects.

In [2]:
#You can change this to a region of your choice
region = sagemaker.Session().boto_region_name
print("Using AWS Region: {}".format(region))

Using AWS Region: us-east-2


In [3]:
boto3.setup_default_session(region_name=region)

boto_session = boto3.Session(region_name=region)

# -- initialize S3 Client
s3_client = boto3.client('s3', region_name=region)

# -- initialize the AFD client 
client = boto3.client('frauddetector')

sagemaker_boto_client = boto_session.client('sagemaker')

sagemaker_session = sagemaker.session.Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_boto_client)

# -- suffix is appended to detector and model name for uniqueness  
sufx   = datetime.now().strftime("%Y%m%d")

We will get the SageMaker Execution Role 

In [4]:
print('SageMaker Role:', sagemaker.get_execution_role().split('/')[-1])

ARN_ROLE = sagemaker.get_execution_role()
%store ARN_ROLE

SageMaker Role: AmazonSageMaker-ExecutionRole-20201030T135016
Stored 'ARN_ROLE' (str)


#### Set S3 training data file location

We will now initialize a variable with the file path of our training data. If you have stepped through and executed the [1-data-analysis-prep.ipynb](./1-data-analysis-prep.ipynb) notebook, you should have your final training data set CSV uploaded into a location in S3. If not, you may use the training dataset that is included in the `data/` directory `afd_training_data.csv`. 

Executing the subsequent code cells will initialize S3 related variables and pull variables stored in Jupyter's local cache in case you have executed the previous notebook ([1-data-analysis-prep.ipynb](./1-data-analysis-prep.ipynb)). Once you replace `YOUR_PREFIX_GOES_HERE` with your S3 prefix it will check if the file exists in the S3 path, if not it will upload the provided training data to the default S3 location as defined.

In [5]:
S3_FILE = "afd_training_data.csv"
training_prefix = "training_data"

<div class="alert alert-warning"> <strong>âœ… Replace YOUR_PREFIX_GOES_HERE</strong>
with your S3 bucket prefix where your training data CSV file resides in the code cell below.

</div>

In [7]:
from datetime import datetime
current_time = datetime.now()

if 'afd_bucket' in globals():
    %store -r afd_bucket
    %store -r afd_prefix
    S3_BUCKET = afd_bucket
    print(f'{current_time}: Using default bucket: {S3_BUCKET}... Initialized folder {S3_BUCKET}/{afd_prefix}/{training_prefix}')    
else:
    print(f'{current_time}: Bucket name not in local cache initializing')
    # initialize with sagemaker default bucket and a prefix where your training data is located
    afd_bucket = sagemaker_session.default_bucket()
    afd_prefix = YOUR_PREFIX_GOES_HERE        # ---> Add your prefix here
    %store afd_bucket
    %store afd_prefix
    S3_BUCKET = afd_bucket
    print(f'{current_time}: Bucket {S3_BUCKET}... Initialized folder {afd_prefix}/{training_prefix}')

2021-05-13 19:30:19.717478: Using default bucket: sagemaker-us-east-2-965425568475... Initialized folder sagemaker-us-east-2-965425568475/amazon-fraud-detector/training_data


In [8]:
current_time = datetime.now()

try:
    # Check if the file exists in the said S3 bucket/prefix location
    objects_in_bucket = s3_client.list_objects(Bucket=S3_BUCKET, Prefix=f"{afd_prefix}/{training_prefix}/{S3_FILE}")
    print(f"{current_time}: File {objects_in_bucket['Contents'][0]['Key']} found")
    S3_FILE_LOC = f"s3://{S3_BUCKET}/{afd_prefix}/{training_prefix}/{S3_FILE}"
    %store S3_FILE_LOC
    print(f"{current_time}: S3 Location initalized ... s3://{S3_BUCKET}/{afd_prefix}/{training_prefix}/{S3_FILE}")

except Exception as e:
    print(f"{current_time}: File {afd_prefix}/{training_prefix}/{S3_FILE} not found uploading from local...")    
    print(f"{current_time}: Upoading File {afd_prefix}/{training_prefix}/{S3_FILE} to s3://{S3_BUCKET} ...")   
    # Upload the training data from local to the S3 bucket
    s3_client.upload_file(Filename=f'data/{S3_FILE}', Bucket=S3_BUCKET, Key=f'{afd_prefix}/{training_prefix}/{S3_FILE}')
    S3_FILE_LOC = f"s3://{S3_BUCKET}/{afd_prefix}/{training_prefix}/{S3_FILE}"
    %store S3_FILE_LOC

2021-05-13 19:30:26.497863: File amazon-fraud-detector/training_data/afd_training_data.csv found
Stored 'S3_FILE_LOC' (str)
2021-05-13 19:30:26.497863: S3 Location initalized ... s3://sagemaker-us-east-2-965425568475/amazon-fraud-detector/training_data/afd_training_data.csv


### 3. Set AFD Entity type, event type, and Detector names <a id="entity"></a>
---
<a href="#overview">overview</a>

In [12]:

ENTITY_TYPE    = "afd_demo_entity_{0}".format(sufx) 
ENTITY_DESC    = "AFD Entity: {0}".format(sufx) 

EVENT_TYPE     = "afd_demo_event_{0}".format(sufx) 
EVENT_DESC     = "AFD Event Type: {0}".format(sufx) 

MODEL_NAME     = "afd_demo_model_{0}".format(sufx) 
MODEL_DESC     = "AFD model trained on: {0}".format(sufx) 

DETECTOR_NAME  = "afd_detector_{0}".format(sufx)                        
DETECTOR_DESC  = "Detects synthetic fraud events created: {0}".format(sufx) 

# store name in cache
%store ENTITY_TYPE
%store ENTITY_DESC
%store EVENT_TYPE
%store EVENT_DESC
%store MODEL_NAME
%store MODEL_DESC
%store DETECTOR_NAME
%store DETECTOR_DESC


Stored 'ENTITY_TYPE' (str)
Stored 'ENTITY_DESC' (str)
Stored 'EVENT_TYPE' (str)
Stored 'EVENT_DESC' (str)
Stored 'MODEL_NAME' (str)
Stored 'MODEL_DESC' (str)
Stored 'DETECTOR_NAME' (str)
Stored 'DETECTOR_DESC' (str)


### 4. Profile Your Dataset <a id="profile"></a>
-----
<a href="#overview">overview</a>

A small profiler utility function `summary_stats()` is defined in the `data_profiler.py` file. The function will: 
* Profile your data, creating descriptive statistics 
* Perform basic data quality checks (nulls, unique variables, etc.), and 
* return summary statistics and the EVENT and MODEL schemas used to define your EVENT_TYPE and TRAIN your MODEL.


In [10]:
import sys  
import s3fs   # This is required to read CSV data directly from S3 into Pandas dataframe

# Import profiler function
sys.path.insert(0, './')
from data_profiler import summary_stats

<div class="alert alert-info"><strong>ðŸ’¡ Note: </strong>

If you make changes to the data_profiler.py script after you execute the code cell above, please make sure to restart the Kernel (Kernel > Restart Kernel) and run the notebook again.

</div>

In [11]:
# Load the Training data set in a dataframe
df = pd.read_csv(S3_FILE_LOC)
df.describe()

# ------
# Alternate: If the code above fails to execute then comment the above two lines 
# and uncomment the lines below and execute this cell again

# fs = s3fs.S3FileSystem(anon=False)
# with fs.open(S3_FILE_LOC) as f:
#     df = pd.read_csv(f)

# -----

df_stats, trainingDataSchema, eventVariables, eventLabels = summary_stats(df)
%store trainingDataSchema
%store eventVariables

Unnamed: 0,feature_name,dtype,count,nunique,null,not_null,null_pct,nunique_pct,feature_type,feature_warning
0,EVENT_TIMESTAMP,object,100133,90634,0,100133,0.0,0.9051,EVENT_TIMESTAMP,NO WARNING
1,EVENT_LABEL,object,100133,2,0,100133,0.0,0.0,TARGET,NO WARNING
2,ip_address,object,100133,3801,0,100133,0.0,0.038,IP_ADDRESS,NO WARNING
3,email_address,object,100133,3296,0,100133,0.0,0.0329,EMAIL_ADDRESS,NO WARNING
4,user_agent,object,100133,2867,0,100133,0.0,0.0286,CATEGORY,NO WARNING
5,customer_name,object,100133,71178,0,100133,0.0,0.7108,CATEGORY,NO WARNING
6,phone_number,object,100133,99371,0,100133,0.0,0.9924,CATEGORY,"EXCLUDE, GT 90% UNIQUE"
7,customer_city,object,100133,3430,0,100133,0.0,0.0343,CATEGORY,NO WARNING
8,customer_postal,float64,100133,1993,0,100133,0.0,0.0199,NUMERIC,NO WARNING
9,customer_state,object,100133,51,0,100133,0.0,0.0005,CATEGORY,NO WARNING


<IPython.core.display.JSON object>

<IPython.core.display.JSON object>

<IPython.core.display.JSON object>

Stored 'trainingDataSchema' (dict)
Stored 'eventVariables' (list)


### 5. Create Labels, Variables, Entity and Event Types <a id="labels"></a>
-----
<a href="#overview">overview</a>

1. **Events and Event Types**

   An event is a business activity that is evaluated for fraud risk. With Amazon Fraud Detector, you generate fraud predictions for events. An event type defines the structure for an event sent to Amazon Fraud Detector. This includes the variables sent as part of the event, the entity performing the event (such as a customer), and the labels that classify the event. Example event types include online payment transactions, account registrations, and authentication.

2. **Entity and Entity Type**

   An entity represents who is performing the event. As part of a fraud prediction, you can pass the entity ID to indicate the specific entity who performed the event. An entity type classifies the entity. Example classifications include customer, merchant, or account.
   
Before we can create Evnet and Entity types we must create a Labels and Variables 

3. **Label**

   A label classifies an event as fraudulent or legitimate. Labels are used to train supervised machine learning models in Amazon Fraud Detector.
   
4. **Variable**

   A variable represents a data element associated with an event that you want to use in a fraud prediction. Variables can either be sent with an event as part of a fraud prediction or derived, such as the output of an Amazon Fraud Detector model or Amazon SageMaker model. In this case we will create variables based on the input features in our training dataset and their corresponding datatypes.

For more information, refer to the [documentation](https://docs.aws.amazon.com/frauddetector/latest/ug/frauddetector-ml-concepts.html). 


#### 5.1 Create Label and Variables
---

We are going to use the [PutLabel](https://docs.aws.amazon.com/frauddetector/latest/api/API_PutLabel.html) API to create labels for the Fraud Detector model. A label classifies an event as fraudulent or legitimate. Labels are associated with event types and used to train supervised machine learning models in Amazon Fraud Detector. 

In [25]:
try:
    fraud_lbl = client.put_label(
                name = "fraud",
                description = 'fraud')
    
    legit_lbl = client.put_label(
                name = "legit",
                description = 'legit')
    
    print(f"Labels have been created")
    display(JSON(fraud_lbl))
    display(JSON(legit_lbl))
except Exception as e:
    print(e)

Labels have been created


<IPython.core.display.JSON object>

<IPython.core.display.JSON object>

We have a small helper function which will look through our data set stats and create the variables required for AFD Model. This function uses the [CreateVariable](https://docs.aws.amazon.com/frauddetector/latest/api/API_CreateVariable.html) API.

In [35]:
def create_variables(df_stats, MODEL_NAME):
    """
    Returns a variable list of model input variables, checks to see if variable exists,
    and, if not, then it adds the variable to Fraud Detector 
    
    Arguments: 
    enrichment_features  -- dictionary of optional features, mapped to specific variable types enriched (CARD_BIN, USERAGENT)
    numeric_features     -- optional list of numeric field names 
    categorical_features -- optional list of categorical features 
    
    Returns:
    variable_list -- a list of variable dictionaries 
    
    """
    enrichment_features = df_stats.loc[(df_stats['feature_type'].isin(['IP_ADDRESS', 'EMAIL_ADDRESS']))].to_dict(orient="record")
    numeric_features = df_stats.loc[(df_stats['feature_type'].isin(['NUMERIC']))]['feature_name'].to_dict()
    categorical_features = df_stats.loc[(df_stats['feature_type'].isin(['CATEGORY']))]['feature_name'].to_dict()
    
    variable_list = []
    # -- first do the enrichment features
    for feature in enrichment_features: 
        variable_list.append( {'name' : feature['feature_name']})
        try:
            resp = client.get_variables(name=feature['feature_name'])
        except:
            print("Creating variable: {0}".format(feature['feature_name']))
            resp = client.create_variable(
                    name = feature['feature_name'],
                    dataType = 'STRING',
                    dataSource ='EVENT',
                    defaultValue = '<unknown>', 
                    description = feature['feature_name'],
                    variableType = feature['feature_type'] )
                
               
    # -- check and update the numeric features 
    for feature in numeric_features: 
        variable_list.append( {'name' : numeric_features[feature]})
        try:
            resp = client.get_variables(name=numeric_features[feature])
        except:
            print("Creating variable: {0}".format(numeric_features[feature]))
            resp = client.create_variable(
                    name = numeric_features[feature],
                    dataType = 'FLOAT',
                    dataSource ='EVENT',
                    defaultValue = '0.0', 
                    description = numeric_features[feature],
                    variableType = 'NUMERIC' )
             
    # -- check and update the categorical features 
    for feature in categorical_features: 
        variable_list.append( {'name' : categorical_features[feature]})
        try:
            resp = client.get_variables(name=categorical_features[feature])
        except:
            print("Creating variable: {0}".format(categorical_features[feature]))
            resp = client.create_variable(
                    name = categorical_features[feature],
                    dataType = 'STRING',
                    dataSource ='EVENT',
                    defaultValue = '<unknown>', 
                    description = categorical_features[feature],
                    variableType = 'CATEGORICAL' )
    
    return variable_list

Call the function to create the variables.

In [36]:
# Call the create_variables function
model_variables = create_variables(df_stats, MODEL_NAME)

# Display output
display(HTML("<h4>Model variable dict</h4>"))
display(JSON(model_variables))

<IPython.core.display.JSON object>

#### 5.2 Create Entity and Event Types
---

We will use the [PutEntityType](https://docs.aws.amazon.com/frauddetector/latest/api/API_PutEntityType.html) API to create Entity type. The code checks if entity type exists, if not, it creates one.

In [43]:
try:
    response = client.get_entity_types( name = ENTITY_TYPE )
    
    display(HTML("<h4>Entity already exists</h4>"))
    display(JSON(response))
    
except Exception as e:
    print(f"Entity {ENTITY_TYPE} does not exist" )
    response = client.put_entity_type(
        name        = ENTITY_TYPE,
        description = ENTITY_DESC
    )
    display(HTML("<h4>Created entity</h4>"))
    display(JSON(response))

<IPython.core.display.JSON object>

and we will use the [PutEventType](https://docs.aws.amazon.com/frauddetector/latest/api/API_PutEventType.html) API to create Event type. The code checks if event type exists, if not, it creates one.

In [44]:
try:
    response = client.get_event_types( name = EVENT_TYPE )
    
    display(HTML("<h4>Event type already exists</h4>"))
    display(JSON(response))
    
except Exception as e:
    print(f"Event {EVENT_TYPE} does not exist" )
    response = client.put_event_type (
        name           = EVENT_TYPE,
        eventVariables = eventVariables,
        labels         = eventLabels,
        entityTypes    = [ENTITY_TYPE])
    display(HTML("<h4>Created event type</h4>"))
    display(JSON(response))

<IPython.core.display.JSON object>

### 6. Conclusion <a id="conclusion"></a>
---
<a href="#overview">overview</a>

So far, we have created labels, variables, entity type and event type. In the next notebook we will create and train an Amazon Fraud Detector model using these resources, deploy the same, and run predictions using it.