{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 2: Setting an Amazon Fraud Detector model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Uncomment and install s3fs, this is required to read CSV files from S3 directly into Pandas dataframe\n", "# Once installed, please restart the Notebook Kernel (Kernel > Restart Kernel) before proceeding\n", "\n", "#%pip install s3fs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview \n", "\n", "* [Notebook 1: Data Preparation, Process, and Store Features](./1-data-analysis-prep.ipynb)\n", "* **[Notebook 2: Amazon Fraud Detector Model Setup](./2-afd-model-setup.ipynb)**\n", " * **[Introduction](#intro)**\n", " * **[Setup Notebook](#setup)**\n", " * **[Set AFD Entity type, event type, and Detector names](#entity)**\n", " * **[Profile Your Dataset](#profile)**\n", " * **[Create Labels, Variables, Entity and Event Types](#labels)**\n", " * **[Conclusion](#conclusion)**\n", "* [Notebook 3: Model training, deployment, real-time and batch inference](./3-afd-model-train-deploy.ipynb)\n", "* [Notebook 4: Create an end-to-end pipeline](./4-afd-pipeline.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Introduction \n", "___\n", "overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Amazon Fraud Detector is a fully managed service that makes it easy to identify potentially fraudulent online activities such as online payment fraud and the creation of fake accounts. Fraud Detector capitalizes on the latest advances in machine learning (ML) and 20 years of fraud detection expertise from AWS and Amazon.com to automatically identify potentially fraudulent activity so you can catch more fraud faster.\n", "\n", "In this notebook, we'll use the Amazon Fraud Detector API to define an entity and event of interest and use CSV data stored in S3 to train a model. Next, we'll derive some rules and create a \"detector\" by combining our entity, event, model, and rules into a single endpoint. Finally, we'll apply the detector to a sample of our data to identify potentially fraudulent events.\n", "\n", "After running this notebook you should be able to:\n", "\n", "* Define an Entity and Event\n", "* Create a Detector\n", "* Train a Machine Learning (ML) Model\n", "* Author Rules to identify potential fraud based on the model's score\n", "* Apply the Detector's \"predict\" function, to generate a model score and rule outcomes on data\n", "\n", "If you would like to know more, please check out [Fraud Detector's Documentation](https://docs.aws.amazon.com/frauddetector/latest/ug/what-is-frauddetector.html).\n", "\n", "To create models within Amazon Fraud Detector, you must provide data for training. This data has input features (defined by variables) and output labels (defined by labels in the Amazon Fraud Detector service). Additionally, you define events based on the type of entities sending the data for predictions. The following diagram shows the sequence of component creation followed in this tutorial.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### IAM Permissions\n", "---\n", "\n", "To use Amazon Fraud Detector, you have to set up permissions that allow access to the Amazon Fraud Detector console and API operations. You also have to allow Amazon Fraud Detector to perform tasks on your behalf and to access resources that you own. \n", "\n", "The following policies provide the required permission to use Amazon Fraud Detector:\n", "\n", "* `AmazonFraudDetectorFullAccessPolicy`\n", " Allows you to perform the following actions:\n", " - Access all Amazon Fraud Detector resources \n", " - List and describe all model endpoints in Amazon SageMaker \n", " - List all IAM roles in the account \n", " - List all Amazon S3 buckets \n", " - Allow IAM Pass Role to pass a role to Amazon Fraud Detector \n", "\n", "\n", "* `AmazonS3FullAccess`\n", " Allows full access to Amazon S3. This is required to upload training files to S3.\n", "\n", "In this case we will assign `AmazonFraudDetectorFullAccessPolicy` and `AmazonS3FullAccess` policies to the SageMaker Execution Role." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plan\n", "\n", "#### Plan a Fraud Detector\n", "---\n", "\n", "A Detector contains the event, model(s) and rule(s) detection logic for a particular type of fraud that you want to detect. We'll use the following 7 step process to plan a Fraud Detector:\n", "\n", "\n", "\n", "* Setup your notebook\n", " - Name the major components `entity`, `entity type`, `model`, `detector`\n", " - Get IAM role ARN\n", " - S3 Bucket with your training data CSV File\n", "* Read and Profile your Data\n", " - This will give you an idea of what your dataset contains\n", " - This will also identify the variables and labels that will need to be created to define your event\n", "* Create event variables and labels\n", " - This will create the variables and labels in fraud detector\n", "* Define your Entity and Event Type\n", " - What is the activity that you are detecting? That's likely your Event Type (e.g., account_registration)\n", " - Who is performing this activity? That's likely your Entity (e.g., customer)\n", "* Create and Train your Model\n", " - Model training takes anywhere from 45-60 minutes\n", " - Promote your model once training is complete\n", "* Create Detector, generate Rules and assemble your Detector\n", " - Create your detector\n", " - Create rules based on your model scores\n", " - Define outcomes (e.g., fraud, investigate and approve)\n", " - Assemble your detector by adding your model and rules to it\n", "* Test your Detector\n", " - Interactively call predict on a handful of records\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Setup your Notebook \n", "---\n", "overview\n", "\n", "1. Name the major components of Fraud Detector\n", "2. Get IAM role ARN \n", "3. S3 Bucket with your training data CSV File\n", "\n", "Then you can interactively exeucte the code cells in the notebook, no need to change anything unless you want to. \n", "\n", "\n", "

💡 Fraud Detector Components

\n", "EVENT_TYPE is a business activity that you want evaluated for fraud risk. ENTITY_TYPE represents the \"what or who\" that is performing the event you want to evaluate. MODEL_NAME is the name of your supervised machine learning model that Fraud Detector trains on your behalf. DETECTOR_NAME is the name of the detector that contains the detection logic (model and rules) that you apply to events that you want to evaluate for fraud.\n", "\n", "
\n", "\n", "We will import some necessary libraries that will be used throughout this notebook." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from IPython.core.display import display, HTML\n", "from IPython.display import clear_output, JSON\n", "\n", "display(HTML(\"\"))\n", "# ------------------------------------------------------------------\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import os\n", "import sys\n", "import time\n", "import json\n", "import uuid \n", "from datetime import datetime\n", "import boto3\n", "import sagemaker\n", "\n", "pd.set_option('display.max_rows', 500)\n", "pd.set_option('display.max_columns', 500)\n", "pd.set_option('display.width', 1000)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Set region, boto3 and SageMaker SDK variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will initialize a Fraud Detector, S3 and Sagemaker Boto3 client objects." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using AWS Region: us-east-2\n" ] } ], "source": [ "#You can change this to a region of your choice\n", "region = sagemaker.Session().boto_region_name\n", "print(\"Using AWS Region: {}\".format(region))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "boto3.setup_default_session(region_name=region)\n", "\n", "boto_session = boto3.Session(region_name=region)\n", "\n", "# -- initialize S3 Client\n", "s3_client = boto3.client('s3', region_name=region)\n", "\n", "# -- initialize the AFD client \n", "client = boto3.client('frauddetector')\n", "\n", "sagemaker_boto_client = boto_session.client('sagemaker')\n", "\n", "sagemaker_session = sagemaker.session.Session(\n", " boto_session=boto_session,\n", " sagemaker_client=sagemaker_boto_client)\n", "\n", "# -- suffix is appended to detector and model name for uniqueness \n", "sufx = datetime.now().strftime(\"%Y%m%d\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will get the SageMaker Execution Role " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SageMaker Role: AmazonSageMaker-ExecutionRole-20201030T135016\n", "Stored 'ARN_ROLE' (str)\n" ] } ], "source": [ "print('SageMaker Role:', sagemaker.get_execution_role().split('/')[-1])\n", "\n", "ARN_ROLE = sagemaker.get_execution_role()\n", "%store ARN_ROLE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Set S3 training data file location\n", "\n", "We will now initialize a variable with the file path of our training data. If you have stepped through and executed the [1-data-analysis-prep.ipynb](./1-data-analysis-prep.ipynb) notebook, you should have your final training data set CSV uploaded into a location in S3. If not, you may use the training dataset that is included in the `data/` directory `afd_training_data.csv`. \n", "\n", "Executing the subsequent code cells will initialize S3 related variables and pull variables stored in Jupyter's local cache in case you have executed the previous notebook ([1-data-analysis-prep.ipynb](./1-data-analysis-prep.ipynb)). Once you replace `YOUR_PREFIX_GOES_HERE` with your S3 prefix it will check if the file exists in the S3 path, if not it will upload the provided training data to the default S3 location as defined." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "S3_FILE = \"afd_training_data.csv\"\n", "training_prefix = \"training_data\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
✅ Replace YOUR_PREFIX_GOES_HERE\n", "with your S3 bucket prefix where your training data CSV file resides in the code cell below.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2021-05-13 19:30:19.717478: Using default bucket: sagemaker-us-east-2-965425568475... Initialized folder sagemaker-us-east-2-965425568475/amazon-fraud-detector/training_data\n" ] } ], "source": [ "from datetime import datetime\n", "current_time = datetime.now()\n", "\n", "if 'afd_bucket' in globals():\n", " %store -r afd_bucket\n", " %store -r afd_prefix\n", " S3_BUCKET = afd_bucket\n", " print(f'{current_time}: Using default bucket: {S3_BUCKET}... Initialized folder {S3_BUCKET}/{afd_prefix}/{training_prefix}') \n", "else:\n", " print(f'{current_time}: Bucket name not in local cache initializing')\n", " # initialize with sagemaker default bucket and a prefix where your training data is located\n", " afd_bucket = sagemaker_session.default_bucket()\n", " afd_prefix = YOUR_PREFIX_GOES_HERE # ---> Add your prefix here\n", " %store afd_bucket\n", " %store afd_prefix\n", " S3_BUCKET = afd_bucket\n", " print(f'{current_time}: Bucket {S3_BUCKET}... Initialized folder {afd_prefix}/{training_prefix}')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2021-05-13 19:30:26.497863: File amazon-fraud-detector/training_data/afd_training_data.csv found\n", "Stored 'S3_FILE_LOC' (str)\n", "2021-05-13 19:30:26.497863: S3 Location initalized ... s3://sagemaker-us-east-2-965425568475/amazon-fraud-detector/training_data/afd_training_data.csv\n" ] } ], "source": [ "current_time = datetime.now()\n", "\n", "try:\n", " # Check if the file exists in the said S3 bucket/prefix location\n", " objects_in_bucket = s3_client.list_objects(Bucket=S3_BUCKET, Prefix=f\"{afd_prefix}/{training_prefix}/{S3_FILE}\")\n", " print(f\"{current_time}: File {objects_in_bucket['Contents'][0]['Key']} found\")\n", " S3_FILE_LOC = f\"s3://{S3_BUCKET}/{afd_prefix}/{training_prefix}/{S3_FILE}\"\n", " %store S3_FILE_LOC\n", " print(f\"{current_time}: S3 Location initalized ... s3://{S3_BUCKET}/{afd_prefix}/{training_prefix}/{S3_FILE}\")\n", "\n", "except Exception as e:\n", " print(f\"{current_time}: File {afd_prefix}/{training_prefix}/{S3_FILE} not found uploading from local...\") \n", " print(f\"{current_time}: Upoading File {afd_prefix}/{training_prefix}/{S3_FILE} to s3://{S3_BUCKET} ...\") \n", " # Upload the training data from local to the S3 bucket\n", " s3_client.upload_file(Filename=f'data/{S3_FILE}', Bucket=S3_BUCKET, Key=f'{afd_prefix}/{training_prefix}/{S3_FILE}')\n", " S3_FILE_LOC = f\"s3://{S3_BUCKET}/{afd_prefix}/{training_prefix}/{S3_FILE}\"\n", " %store S3_FILE_LOC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Set AFD Entity type, event type, and Detector names \n", "---\n", "overview" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Stored 'ENTITY_TYPE' (str)\n", "Stored 'ENTITY_DESC' (str)\n", "Stored 'EVENT_TYPE' (str)\n", "Stored 'EVENT_DESC' (str)\n", "Stored 'MODEL_NAME' (str)\n", "Stored 'MODEL_DESC' (str)\n", "Stored 'DETECTOR_NAME' (str)\n", "Stored 'DETECTOR_DESC' (str)\n" ] } ], "source": [ "\n", "ENTITY_TYPE = \"afd_demo_entity_{0}\".format(sufx) \n", "ENTITY_DESC = \"AFD Entity: {0}\".format(sufx) \n", "\n", "EVENT_TYPE = \"afd_demo_event_{0}\".format(sufx) \n", "EVENT_DESC = \"AFD Event Type: {0}\".format(sufx) \n", "\n", "MODEL_NAME = \"afd_demo_model_{0}\".format(sufx) \n", "MODEL_DESC = \"AFD model trained on: {0}\".format(sufx) \n", "\n", "DETECTOR_NAME = \"afd_detector_{0}\".format(sufx) \n", "DETECTOR_DESC = \"Detects synthetic fraud events created: {0}\".format(sufx) \n", "\n", "# store name in cache\n", "%store ENTITY_TYPE\n", "%store ENTITY_DESC\n", "%store EVENT_TYPE\n", "%store EVENT_DESC\n", "%store MODEL_NAME\n", "%store MODEL_DESC\n", "%store DETECTOR_NAME\n", "%store DETECTOR_DESC\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Profile Your Dataset \n", "-----\n", "overview\n", "\n", "A small profiler utility function `summary_stats()` is defined in the `data_profiler.py` file. The function will: \n", "* Profile your data, creating descriptive statistics \n", "* Perform basic data quality checks (nulls, unique variables, etc.), and \n", "* return summary statistics and the EVENT and MODEL schemas used to define your EVENT_TYPE and TRAIN your MODEL.\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "import sys \n", "import s3fs # This is required to read CSV data directly from S3 into Pandas dataframe\n", "\n", "# Import profiler function\n", "sys.path.insert(0, './')\n", "from data_profiler import summary_stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
💡 Note: \n", "\n", "If you make changes to the data_profiler.py script after you execute the code cell above, please make sure to restart the Kernel (Kernel > Restart Kernel) and run the notebook again.\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

Summary Stats

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feature_namedtypecountnuniquenullnot_nullnull_pctnunique_pctfeature_typefeature_warning
0EVENT_TIMESTAMPobject1001339063401001330.00.9051EVENT_TIMESTAMPNO WARNING
1EVENT_LABELobject100133201001330.00.0000TARGETNO WARNING
2ip_addressobject100133380101001330.00.0380IP_ADDRESSNO WARNING
3email_addressobject100133329601001330.00.0329EMAIL_ADDRESSNO WARNING
4user_agentobject100133286701001330.00.0286CATEGORYNO WARNING
5customer_nameobject1001337117801001330.00.7108CATEGORYNO WARNING
6phone_numberobject1001339937101001330.00.9924CATEGORYEXCLUDE, GT 90% UNIQUE
7customer_cityobject100133343001001330.00.0343CATEGORYNO WARNING
8customer_postalfloat64100133199301001330.00.0199NUMERICNO WARNING
9customer_stateobject1001335101001330.00.0005CATEGORYNO WARNING
10customer_addressobject1001339998501001330.00.9985CATEGORYEXCLUDE, GT 90% UNIQUE
\n", "
" ], "text/plain": [ " feature_name dtype count nunique null not_null null_pct nunique_pct feature_type feature_warning\n", "0 EVENT_TIMESTAMP object 100133 90634 0 100133 0.0 0.9051 EVENT_TIMESTAMP NO WARNING\n", "1 EVENT_LABEL object 100133 2 0 100133 0.0 0.0000 TARGET NO WARNING\n", "2 ip_address object 100133 3801 0 100133 0.0 0.0380 IP_ADDRESS NO WARNING\n", "3 email_address object 100133 3296 0 100133 0.0 0.0329 EMAIL_ADDRESS NO WARNING\n", "4 user_agent object 100133 2867 0 100133 0.0 0.0286 CATEGORY NO WARNING\n", "5 customer_name object 100133 71178 0 100133 0.0 0.7108 CATEGORY NO WARNING\n", "6 phone_number object 100133 99371 0 100133 0.0 0.9924 CATEGORY EXCLUDE, GT 90% UNIQUE\n", "7 customer_city object 100133 3430 0 100133 0.0 0.0343 CATEGORY NO WARNING\n", "8 customer_postal float64 100133 1993 0 100133 0.0 0.0199 NUMERIC NO WARNING\n", "9 customer_state object 100133 51 0 100133 0.0 0.0005 CATEGORY NO WARNING\n", "10 customer_address object 100133 99985 0 100133 0.0 0.9985 CATEGORY EXCLUDE, GT 90% UNIQUE" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

Event Variables

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

These are the available features in the data set for the AFD model training

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": [ "ip_address", "email_address", "user_agent", "customer_name", "phone_number", "customer_city", "customer_postal", "customer_state", "customer_address" ], "text/plain": [ "" ] }, "metadata": { "application/json": { "expanded": false, "root": "root" } }, "output_type": "display_data" }, { "data": { "text/html": [ "

Event Labels

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

We have two types of events - Fraud events and legitimate events

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": [ "legit", "fraud" ], "text/plain": [ "" ] }, "metadata": { "application/json": { "expanded": false, "root": "root" } }, "output_type": "display_data" }, { "data": { "text/html": [ "

Training Data Schema

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

Training data schema is required for creating and training the model. Refer to documentation

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "labelSchema": { "labelMapper": { "FRAUD": [ "fraud" ], "LEGIT": [ "legit" ] } }, "modelVariables": [ "ip_address", "email_address", "user_agent", "customer_name", "phone_number", "customer_city", "customer_postal", "customer_state", "customer_address" ] }, "text/plain": [ "" ] }, "metadata": { "application/json": { "expanded": false, "root": "root" } }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Stored 'trainingDataSchema' (dict)\n", "Stored 'eventVariables' (list)\n" ] } ], "source": [ "# Load the Training data set in a dataframe\n", "df = pd.read_csv(S3_FILE_LOC)\n", "df.describe()\n", "\n", "# ------\n", "# Alternate: If the code above fails to execute then comment the above two lines \n", "# and uncomment the lines below and execute this cell again\n", "\n", "# fs = s3fs.S3FileSystem(anon=False)\n", "# with fs.open(S3_FILE_LOC) as f:\n", "# df = pd.read_csv(f)\n", "\n", "# -----\n", "\n", "df_stats, trainingDataSchema, eventVariables, eventLabels = summary_stats(df)\n", "%store trainingDataSchema\n", "%store eventVariables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Create Labels, Variables, Entity and Event Types \n", "-----\n", "overview\n", "\n", "1. **Events and Event Types**\n", "\n", " An event is a business activity that is evaluated for fraud risk. With Amazon Fraud Detector, you generate fraud predictions for events. An event type defines the structure for an event sent to Amazon Fraud Detector. This includes the variables sent as part of the event, the entity performing the event (such as a customer), and the labels that classify the event. Example event types include online payment transactions, account registrations, and authentication.\n", "\n", "2. **Entity and Entity Type**\n", "\n", " An entity represents who is performing the event. As part of a fraud prediction, you can pass the entity ID to indicate the specific entity who performed the event. An entity type classifies the entity. Example classifications include customer, merchant, or account.\n", " \n", "Before we can create Evnet and Entity types we must create a Labels and Variables \n", "\n", "3. **Label**\n", "\n", " A label classifies an event as fraudulent or legitimate. Labels are used to train supervised machine learning models in Amazon Fraud Detector.\n", " \n", "4. **Variable**\n", "\n", " A variable represents a data element associated with an event that you want to use in a fraud prediction. Variables can either be sent with an event as part of a fraud prediction or derived, such as the output of an Amazon Fraud Detector model or Amazon SageMaker model. In this case we will create variables based on the input features in our training dataset and their corresponding datatypes.\n", "\n", "For more information, refer to the [documentation](https://docs.aws.amazon.com/frauddetector/latest/ug/frauddetector-ml-concepts.html). \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.1 Create Label and Variables\n", "---\n", "\n", "We are going to use the [PutLabel](https://docs.aws.amazon.com/frauddetector/latest/api/API_PutLabel.html) API to create labels for the Fraud Detector model. A label classifies an event as fraudulent or legitimate. Labels are associated with event types and used to train supervised machine learning models in Amazon Fraud Detector. " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Labels have been created\n" ] }, { "data": { "application/json": { "ResponseMetadata": { "HTTPHeaders": { "connection": "keep-alive", "content-length": "2", "content-type": "application/x-amz-json-1.1", "date": "Thu, 13 May 2021 00:16:15 GMT", "x-amzn-requestid": "8551fb50-0018-4be3-a712-c74032858678" }, "HTTPStatusCode": 200, "RequestId": "8551fb50-0018-4be3-a712-c74032858678", "RetryAttempts": 0 } }, "text/plain": [ "" ] }, "metadata": { "application/json": { "expanded": false, "root": "root" } }, "output_type": "display_data" }, { "data": { "application/json": { "ResponseMetadata": { "HTTPHeaders": { "connection": "keep-alive", "content-length": "2", "content-type": "application/x-amz-json-1.1", "date": "Thu, 13 May 2021 00:16:15 GMT", "x-amzn-requestid": "64ccabe1-c7ad-4733-a224-f9ad6f569630" }, "HTTPStatusCode": 200, "RequestId": "64ccabe1-c7ad-4733-a224-f9ad6f569630", "RetryAttempts": 0 } }, "text/plain": [ "" ] }, "metadata": { "application/json": { "expanded": false, "root": "root" } }, "output_type": "display_data" } ], "source": [ "try:\n", " fraud_lbl = client.put_label(\n", " name = \"fraud\",\n", " description = 'fraud')\n", " \n", " legit_lbl = client.put_label(\n", " name = \"legit\",\n", " description = 'legit')\n", " \n", " print(f\"Labels have been created\")\n", " display(JSON(fraud_lbl))\n", " display(JSON(legit_lbl))\n", "except Exception as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have a small helper function which will look through our data set stats and create the variables required for AFD Model. This function uses the [CreateVariable](https://docs.aws.amazon.com/frauddetector/latest/api/API_CreateVariable.html) API." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "def create_variables(df_stats, MODEL_NAME):\n", " \"\"\"\n", " Returns a variable list of model input variables, checks to see if variable exists,\n", " and, if not, then it adds the variable to Fraud Detector \n", " \n", " Arguments: \n", " enrichment_features -- dictionary of optional features, mapped to specific variable types enriched (CARD_BIN, USERAGENT)\n", " numeric_features -- optional list of numeric field names \n", " categorical_features -- optional list of categorical features \n", " \n", " Returns:\n", " variable_list -- a list of variable dictionaries \n", " \n", " \"\"\"\n", " enrichment_features = df_stats.loc[(df_stats['feature_type'].isin(['IP_ADDRESS', 'EMAIL_ADDRESS']))].to_dict(orient=\"record\")\n", " numeric_features = df_stats.loc[(df_stats['feature_type'].isin(['NUMERIC']))]['feature_name'].to_dict()\n", " categorical_features = df_stats.loc[(df_stats['feature_type'].isin(['CATEGORY']))]['feature_name'].to_dict()\n", " \n", " variable_list = []\n", " # -- first do the enrichment features\n", " for feature in enrichment_features: \n", " variable_list.append( {'name' : feature['feature_name']})\n", " try:\n", " resp = client.get_variables(name=feature['feature_name'])\n", " except:\n", " print(\"Creating variable: {0}\".format(feature['feature_name']))\n", " resp = client.create_variable(\n", " name = feature['feature_name'],\n", " dataType = 'STRING',\n", " dataSource ='EVENT',\n", " defaultValue = '', \n", " description = feature['feature_name'],\n", " variableType = feature['feature_type'] )\n", " \n", " \n", " # -- check and update the numeric features \n", " for feature in numeric_features: \n", " variable_list.append( {'name' : numeric_features[feature]})\n", " try:\n", " resp = client.get_variables(name=numeric_features[feature])\n", " except:\n", " print(\"Creating variable: {0}\".format(numeric_features[feature]))\n", " resp = client.create_variable(\n", " name = numeric_features[feature],\n", " dataType = 'FLOAT',\n", " dataSource ='EVENT',\n", " defaultValue = '0.0', \n", " description = numeric_features[feature],\n", " variableType = 'NUMERIC' )\n", " \n", " # -- check and update the categorical features \n", " for feature in categorical_features: \n", " variable_list.append( {'name' : categorical_features[feature]})\n", " try:\n", " resp = client.get_variables(name=categorical_features[feature])\n", " except:\n", " print(\"Creating variable: {0}\".format(categorical_features[feature]))\n", " resp = client.create_variable(\n", " name = categorical_features[feature],\n", " dataType = 'STRING',\n", " dataSource ='EVENT',\n", " defaultValue = '', \n", " description = categorical_features[feature],\n", " variableType = 'CATEGORICAL' )\n", " \n", " return variable_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Call the function to create the variables." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

Model variable dict

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": [ { "name": "ip_address" }, { "name": "email_address" }, { "name": "customer_postal" }, { "name": "user_agent" }, { "name": "customer_name" }, { "name": "phone_number" }, { "name": "customer_city" }, { "name": "customer_state" }, { "name": "customer_address" } ], "text/plain": [ "" ] }, "metadata": { "application/json": { "expanded": false, "root": "root" } }, "output_type": "display_data" } ], "source": [ "# Call the create_variables function\n", "model_variables = create_variables(df_stats, MODEL_NAME)\n", "\n", "# Display output\n", "display(HTML(\"

Model variable dict

\"))\n", "display(JSON(model_variables))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.2 Create Entity and Event Types\n", "---\n", "\n", "We will use the [PutEntityType](https://docs.aws.amazon.com/frauddetector/latest/api/API_PutEntityType.html) API to create Entity type. The code checks if entity type exists, if not, it creates one." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

Entity already exists

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ResponseMetadata": { "HTTPHeaders": { "connection": "keep-alive", "content-length": "266", "content-type": "application/x-amz-json-1.1", "date": "Thu, 13 May 2021 00:47:11 GMT", "x-amzn-requestid": "2a4ce9c5-a055-4b22-a54c-e3a65d1a64e6" }, "HTTPStatusCode": 200, "RequestId": "2a4ce9c5-a055-4b22-a54c-e3a65d1a64e6", "RetryAttempts": 0 }, "entityTypes": [ { "arn": "arn:aws:frauddetector:us-east-2:965425568475:entity-type/afd_demo_entity_20210512", "createdTime": "2021-05-13T00:40:48.794Z", "description": "AFD Entity: 20210512", "lastUpdatedTime": "2021-05-13T00:46:50.186Z", "name": "afd_demo_entity_20210512" } ] }, "text/plain": [ "" ] }, "metadata": { "application/json": { "expanded": false, "root": "root" } }, "output_type": "display_data" } ], "source": [ "try:\n", " response = client.get_entity_types( name = ENTITY_TYPE )\n", " \n", " display(HTML(\"

Entity already exists

\"))\n", " display(JSON(response))\n", " \n", "except Exception as e:\n", " print(f\"Entity {ENTITY_TYPE} does not exist\" )\n", " response = client.put_entity_type(\n", " name = ENTITY_TYPE,\n", " description = ENTITY_DESC\n", " )\n", " display(HTML(\"

Created entity

\"))\n", " display(JSON(response))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and we will use the [PutEventType](https://docs.aws.amazon.com/frauddetector/latest/api/API_PutEventType.html) API to create Event type. The code checks if event type exists, if not, it creates one." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

Event type already exists

" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/json": { "ResponseMetadata": { "HTTPHeaders": { "connection": "keep-alive", "content-length": "485", "content-type": "application/x-amz-json-1.1", "date": "Thu, 13 May 2021 00:48:33 GMT", "x-amzn-requestid": "8e64029a-8a00-459d-ab15-b626160ff32d" }, "HTTPStatusCode": 200, "RequestId": "8e64029a-8a00-459d-ab15-b626160ff32d", "RetryAttempts": 0 }, "eventTypes": [ { "arn": "arn:aws:frauddetector:us-east-2:965425568475:event-type/afd_demo_event_20210512", "createdTime": "2021-05-13T00:40:51.736Z", "entityTypes": [ "afd_demo_entity_20210512" ], "eventVariables": [ "ip_address", "email_address", "user_agent", "customer_name", "phone_number", "customer_city", "customer_postal", "customer_state", "customer_address" ], "labels": [ "legit", "fraud" ], "lastUpdatedTime": "2021-05-13T00:40:51.736Z", "name": "afd_demo_event_20210512" } ] }, "text/plain": [ "" ] }, "metadata": { "application/json": { "expanded": false, "root": "root" } }, "output_type": "display_data" } ], "source": [ "try:\n", " response = client.get_event_types( name = EVENT_TYPE )\n", " \n", " display(HTML(\"

Event type already exists

\"))\n", " display(JSON(response))\n", " \n", "except Exception as e:\n", " print(f\"Event {EVENT_TYPE} does not exist\" )\n", " response = client.put_event_type (\n", " name = EVENT_TYPE,\n", " eventVariables = eventVariables,\n", " labels = eventLabels,\n", " entityTypes = [ENTITY_TYPE])\n", " display(HTML(\"

Created event type

\"))\n", " display(JSON(response))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6. Conclusion \n", "---\n", "overview\n", "\n", "So far, we have created labels, variables, entity type and event type. In the next notebook we will create and train an Amazon Fraud Detector model using these resources, deploy the same, and run predictions using it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.t3.medium", "kernelspec": { "display_name": "Python 3 (Data Science)", "language": "python", "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/datascience-1.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }