{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Amazon Fraud Detector from End to End for Account Takeover Insights \n", "### Un-supervised fraud detection \n", "-------\n", "- [Introduction](#Introduction)\n", "- [Data preparation](#Data_preparation)\n", "- [Set up AWS credentials & permissions](#Set_up_AWS_credentials_permissions)\n", "- [Plan](#Plan)\n", "\n", "\n", "## Introduction\n", "-------\n", "\n", "Amazon Fraud Detector (AFD) is a fully managed service that makes it easy to identify potentially fraudulent online activities such as online payment fraud, creation of fake accounts, and account takeovers. Fraud Detector capitalizes on the latest advances in machine learning (ML) and 20 years of fraud detection expertise from AWS and Amazon.com to automatically identify potentially fraudulent activities so you can catch more fraud faster. \n", "\n", "In this notebook, we'll use the Amazon Fraud Detector API to define an entity and an event of interest and send events to AFD from a CSV file stored in S3. Next, we'll train a model with Account Takeover Insight template using the events we have sent. After that, we'll derive some rules and create a \"detector\" by combining our entity, event, model, and rules into a single endpoint. Finally, we'll apply the detector to a sample of our data to identify potentially fraudulent events.\n", "\n", "After running this notebook you should be able to: \n", "\n", "- Define an Entity and an Event\n", "- Send Events to AFD\n", "- Train a Machine Learning (ML) Model\n", "- Author Rules to identify potential fraud based on the model's score\n", "- Create a Fraud Detector\n", "- Apply the Detector's \"predict\" function, to generate a model score and rule outcomes on data \n", "\n", "If you would like to know more, please check out [Fraud Detector's Documentation](https://docs.aws.amazon.com/frauddetector/). \n", "\n", "\n", "## Data preparation \n", "------\n", "\n", "Before you train a Transaction Fraud Insights model, ensure that you have stored at least 10,000 records in your training dataset within Amazon Fraud Detector. We recommend that you collect at least 3-6 weeks of historic data with at least 1,500 unique entities. The Transaction Fraud Insights model will calculate aggregates likes account’s age, transaction counts based on the account’s history. Therefore, providing the full history of the accounts will help the model to calculate the aggregates correctly to capture the fraud patterns.\n", "\n", "In addition to the event variables, the training dataset must contain the following headers:\n", "* ENTITY_TYPE - Who is performing the activity, e.g. customer. Currently, AFD only support one ENTITY_TYPE per EVENT_TYPE. That column should contain a single value. \n", "* ENTITY_ID - An identifier for who is performing the activity, e.g. customer id\n", "* EVENT_ID - An identifier for the event, e.g. order id\n", "* EVENT_TIMESTAMP - The timestamp of when the event occurred. The timestamp must be in ISO 8601 standard in UTC.\n", "* EVENT_LABEL - Classifies the event as fraudulent or legitimate. The values in the column must correspond to the values defined in the event type.\n", "* LABEL_TIMESTAMP - Timestamp of when the label is updated. LABEL_TIMESTAMP is required if EVENT_LABEL is included. If you do not have that data, you can duplicate the EVENT_TIMESTAMP column and rename it as LABEL_TIMESTAMP.\n", "\n", "\n", "\n", "## Set up AWS credentials & permissions \n", "----\n", "\n", "https://docs.aws.amazon.com/frauddetector/latest/ug/set-up.html\n", "\n", "To use Amazon Fraud Detector, you have to set up permissions that allow access to the Amazon Fraud Detector console and API operations. You also have to allow Amazon Fraud Detector to perform tasks on your behalf and to access resources that you own. We recommend creating an AWS Identify and Access Management (IAM) user with access restricted to Amazon Fraud Detector operations and required permissions. You can add other permissions as needed.\n", "\n", "The following policies provide the required permission to use Amazon Fraud Detector. If you are using SageMaker Notebook Instance, add the following two policies to the Instance's IAM role and restart your kernel:\n", "\n", "- *AmazonFraudDetectorFullAccessPolicy* \n", " Allows you to perform the following actions: \n", " - Access all Amazon Fraud Detector resources \n", " - List and describe all model endpoints in Amazon SageMaker \n", " - List all IAM roles in the account \n", " - List all Amazon S3 buckets \n", " - Allow IAM Pass Role to pass a role to Amazon Fraud Detector \n", "\n", "- *AmazonS3FullAccess* \n", " Allows full access to Amazon S3. This is required to upload training files to S3.\n", " \n", " \n", " \n", "## Plan\n", "------\n", "A *Detector* contains the event, model(s) and rule(s) for a particular type of fraud that you want to detect. We'll use the following 8 steps to plan a Fraud Detector:\n", "\n", "1. [Setup notebook](#setup_notebook)
\n", " a. Name the major components: Event, Entity, Model, Detector
\n", " b. Specify the MODEL_TYPE you want to train
\n", " c. Plug in your S3 Bucket and CSV File\n", " \n", "2. [Load and profile your dataset](#load_and_profile_your_data)
\n", " a. This will give you an idea of what your dataset contains
\n", " b. This will also identify the variables and labels that will need to be created to define your event\n", "\n", "3. [Create event variables and labels](#create_event_variables_and_labels)
\n", " a. This will create the variables and labels in fraud detector \n", "\n", "4. [Define your Entity and Event Type](#define_your_entity_and_event_type)
\n", " a. What is activity that you are detecting? That's likely your Event Type (e.g. transaction)
\n", " b. Who is performing this activity? That's likely your Entity (e.g. customer)\n", "\n", "5. [Send events](#send_events)
\n", " a. This will ingest your events to AFD \n", " \n", "6. [Create and train your model](#create_and_train_your_model)\t\n", " a. Model training takes anywhere from 45-60 minutes. Once complete you need to promote your endpoint
\n", " b. Promote your model\n", "\n", "7. [Create a Fraud Detector, generate Rules and assemble your Detector](#create_detector)
\n", " a. Create your detector
\n", " b. Define outcomes, e.g. fraud, investigate and approve
\n", " c. Create rules based on your model scores
\n", " d. Assemble your detector: combine your rule(s) and model into a \"detector\"\n", "\n", "8. [Make predictions](#make_predictions)
\n", " a. Interactively call predict API on a handful of records " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Current boto3 version: 1.24.36\n" ] } ], "source": [ "# -- check boto3 version, if it is lower than 1.18.59, update it and restart the kernel --\n", "import boto3\n", "import os\n", "\n", "current_boto3_version = boto3.__version__\n", "print('Current boto3 version:', current_boto3_version)\n", "\n", "if current_boto3_version < '1.24.36':\n", " print('update boto3...')\n", " %pip install 'boto3>=1.24.36'\n", " os._exit(00) " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# -- import packages --\n", "import boto3\n", "import time\n", "import logging\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from multiprocessing import Pool\n", "from datetime import datetime, date\n", "from dateutil.relativedelta import relativedelta\n", "import csv, codecs\n", "from sklearn.metrics import roc_curve, roc_auc_score, auc, roc_auc_score\n", "\n", "# -- for display --\n", "from IPython.core.display import display, HTML\n", "from IPython.display import clear_output\n", "display(HTML(\"\"))\n", "pd.set_option('display.max_rows', 500)\n", "pd.set_option('display.max_columns', 500)\n", "pd.set_option('display.width', 1000)\n", "%matplotlib inline " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# -- initialize the AFD client --\n", "client = boto3.client('frauddetector')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Setup notebook \n", "-----\n", "\n", "***To get started***\n", "\n", "1. Name the major components of Fraud Detector\n", "2. Specify the MODEL_TYPE you want to train: ACCOUNT_TAKEOVER_INSIGHTS. \n", "2. Plug in your S3 Bucket and CSV file path\n", "\n", "Then you can interactively exeucte the code cells in the notebook, no need to change anything unless you want to. \n", "\n", "\n", "
Major Fraud Detector Components \n", " \n", "- **EVENT_TYPE** is a business activity that you want evaluated for fraud risk \n", "- **ENTITY_TYPE** represents the \"what or who\" that is performing the event you want to evaluate\n", "- **MODEL_NAME** is the name of your supervised machine learning model that Fraud Detector trains on your behalf\n", "- **DETECTOR_NAME** is the name of the detector that contains the detection logic (model and rules) that you apply to events that you want to evaluate for fraud\n", "\n", "
\n", "\n", "\n", "Identify the following assets:\n", "\n", "
Bucket, File and ARN Role\n", "\n", "- **S3_BUCKET** is the name of the bucket where your file lives\n", "- **S3_FILE** is the URL to your s3 file\n", "- **ARN_ROLE** is the role Fraud Detector use to access your data in s3 bucket\n", "\n", "
\n", "\n", "\n", "_**Note**: To use Amazon Fraud Detector, you have to set up permissions that allow access to the Amazon Fraud Detector console and API operations. You also have to allow Amazon Fraud Detector to perform tasks on your behalf and to access resources that you own. We recommend creating an AWS Identify and Access Management (IAM) user with access restricted to. Amazon Fraud Detector operations and required permissions. You can add other permissions as needed. See \"Create an IAM User and Assign Required Permissions\" in the user's guide: https://docs.aws.amazon.com/frauddetector/latest/ug/set-up.html_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Update following section: Define event, model, detector names, and data location\n", "
" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# -- this is all you need to fill out, once complete simply interactively run each code cell -- \n", "EVENT_TYPE = \"your_event_name\"\n", "EVENT_DESC = \"your event description\"\n", "\n", "MODEL_NAME = \"your_model_name\"\n", "MODEL_DESC = \"your model description\"\n", "\n", "DETECTOR_NAME = \"your_detector_name\" \n", "DETECTOR_DESC = \"your detector description\"\n", "\n", "MODEL_TYPE = \"ACCOUNT_TAKEOVER_INSIGHTS\" \n", "\n", "S3_BUCKET = \"your-s3-bucket-with-data\" \n", "S3_FILE = \"path-to-your-data-file\" \n", "ARN_ROLE = \"your-arn-role\"\n", "\n", "# -- percentage of data used in model training (by default: 80%). \n", "TRAINING_PERC = 0.8 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Header names in your CSV files" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "EVENT_ID\n", "ENTITY_ID\n", "ENTITY_TYPE\n", "EVENT_TIMESTAMP\n", "EVENT_LABEL\n", "LABEL_TIMESTAMP\n", "ip\n", "useragent\n", "fp\n", "session_id\n", "are_credentials_valid\n" ] } ], "source": [ "s3 = boto3.resource('s3')\n", "obj = s3.Object(S3_BUCKET, S3_FILE)\n", "body = obj.get()['Body']\n", "reader = csv.DictReader(codecs.getreader(\"utf-8\")(body))\n", "header_names = reader.fieldnames\n", "for item in header_names:\n", " print(item)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Map header names to mandatory/optional variable types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Update following section: Fill the mapping with CSV header names printed above. \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# -- map your data to the variable types --\n", "# Except the fixed-naming columns (EVENT_ID, ENTITY_ID, ENTITY_TYPE, EVENT_TIMESTAMP), use the VARIABLES_MAP dictionary \n", "# to map the variable type to the column names in your data.\n", "# Mandatory variable types include: IP_ADDRESS, USERAGENT, ARE_CREDENTIALS_VALID\n", "# Optional variable types: FINGERPRINT, SESSION_ID\n", "\n", "VARIABLES_MAP = {\n", " # Mandatory variables\n", " \"IP_ADDRESS\": \"ip\", # e.g. ip\n", " \"USERAGENT\": \"useragent\", # e.g. user_agent\n", " \"ARE_CREDENTIALS_VALID\": \"are_credentials_valid\", # e.g. are_credentials_valid\n", " \n", " # Optional variables\n", " \"FINGERPRINT\": \"fp\", # e.g. fingerprint\n", " \"SESSION_ID\": \"session_id\" # e.g. session_id\n", "}\n", "\n", "for c in VARIABLES_MAP.keys():\n", " if VARIABLES_MAP[c]!=\"\" and VARIABLES_MAP[c] not in header_names:\n", " raise ValueError(f'Variable {VARIABLES_MAP[c]} (type {c}) not in CSV header!')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Load and profile your dataset \n", "-----\n", "\n", "The functions below will: 1) profile your data, creating descriptive statististics, 2) perform basic data quality checks (nulls, unique variables, etc.), and 3) return summary statistics and the EVENT and MODEL schemas used to define your EVENT_TYPE and train your model. \n", "\n", "\n", "_**Important Note**: The functions below provides a layman guess for the fraud/legit labels and variable mapping. Please review the summary stats, event variables, event labels and training data schema and make sure they are aligned with how you want to use the data. You can always manually modify them if needed._\n", "\n", "
💡 summary stats, event variables, event labels and training data schema \n", "\n", "- summary stats: data quality and summary statistics of the data; used to create variables of the specific feature types\n", "- event variables: variables associated with the specific event type; used when creating event type and sending events\n", "- event labels: labels associated with the event type; used when creating event type\n", "- training data schema: define the variables to build the model, labels to be used as fraud/legit, and how to treat the unlabeled events; By default, we identify the rare event as fraud, and the rest as not-fraud. If you have more than 2 labels in the data or want to map them in a different way, you can manually modify the training data schema \n", "\n", "
\n", "\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# --- no changes; just run this code block ---\n", "def summary_stats(df, variables_map):\n", " \"\"\"\n", " Generate summary statistics for a pandas data frame \n", " \"\"\"\n", " rowcnt = len(df)\n", " \n", " # -- calculating data statistics and data types -- \n", " df_s1 = df.agg(['count', 'nunique']).transpose().reset_index().rename(columns={\"index\":\"feature_name\"})\n", " df_s1[\"null\"] = (rowcnt - df_s1[\"count\"]).astype('int64')\n", " df_s1[\"not_null\"] = rowcnt - df_s1[\"null\"]\n", " df_s1[\"null_pct\"] = df_s1[\"null\"] / rowcnt\n", " df_s1[\"nunique_pct\"] = df_s1['nunique']/ rowcnt\n", "\n", " dt = pd.DataFrame(df.dtypes).reset_index().rename(columns={\"index\":\"feature_name\", 0:\"dtype\"})\n", " df_stats = pd.merge(dt, df_s1, on='feature_name', how='inner').round(4)\n", " df_stats['nunique'] = df_stats['nunique'].astype('int64')\n", " df_stats['count'] = df_stats['count'].astype('int64')\n", " \n", " \n", " # -- variable type mapper: map mandatory variables and variables_map -- \n", " flatten_var_maps = []\n", " for vartype in variables_map.keys():\n", " if isinstance(variables_map[vartype], list):\n", " for var in variables_map[vartype]:\n", " flatten_var_maps.append([vartype, var])\n", " else:\n", " flatten_var_maps.append([vartype, variables_map[vartype]])\n", " \n", " for vartype in ['ENTITY_TYPE','ENTITY_ID','EVENT_ID','EVENT_TIMESTAMP']:\n", " flatten_var_maps.append([vartype, vartype])\n", " for vartype in ['EVENT_LABEL','LABEL_TIMESTAMP']:\n", " if vartype in df.columns:\n", " flatten_var_maps.append([vartype, vartype])\n", "\n", " df_schema = pd.DataFrame(flatten_var_maps, columns = ['feature_type', 'feature_name'])\n", " df_stats = pd.merge(df_stats, df_schema, how = 'left', on = 'feature_name')\n", " \n", " # -- variable type mapper: map the rest types based on data type -- \n", " df_stats.loc[(df_stats['feature_type'].isna())&(df_stats[\"dtype\"] == object), 'feature_type'] = \"CATEGORICAL\"\n", " df_stats.loc[(df_stats['feature_type'].isna())&((df_stats[\"dtype\"] == \"int64\") | (df_stats[\"dtype\"] == \"float64\")), 'feature_type'] = \"NUMERIC\"\n", " \n", " # -- variable validation -- \n", " df_stats['feature_warning'] = \"NO WARNING\"\n", " df_stats.loc[(df_stats[\"nunique\"] != 2) & (df_stats[\"feature_name\"] == \"EVENT_LABEL\"),'feature_warning' ] = \"LABEL WARNING, NON-BINARY EVENT LABEL\"\n", " df_stats.loc[(df_stats[\"nunique_pct\"] > 0.9) & (df_stats['feature_type'] == \"CATEGORICAL\") ,'feature_warning' ] = \"EXCLUDE, GT 90% UNIQUE\"\n", " df_stats.loc[(df_stats[\"null_pct\"] > 0.2) & (df_stats[\"null_pct\"] <= 0.75), 'feature_warning' ] = \"NULL WARNING, GT 20% MISSING\"\n", " df_stats.loc[df_stats[\"null_pct\"] > 0.75,'feature_warning' ] = \"EXCLUDE, GT 75% MISSING\"\n", " df_stats.loc[((df_stats['dtype'] == \"int64\" ) | (df_stats['dtype'] == \"float64\" ) ) & (df_stats['nunique'] < 0.2), 'feature_warning' ] = \"LIKELY CATEGORICAL, NUMERIC w. LOW CARDINALITY\"\n", " return df_stats[['feature_name', 'feature_type', 'dtype', 'count', 'null', 'null_pct', 'nunique', 'nunique_pct', 'feature_warning']]\n", "\n", "\n", "def prepare_schema(df, df_stats, variables_map):\n", " \"\"\"\n", " Prepare schema for following steps\n", " \"\"\"\n", " # -- prepare event variables --\n", " exclude_list = ['ENTITY_TYPE','ENTITY_ID','EVENT_ID','EVENT_TIMESTAMP','EVENT_LABEL','LABEL_TIMESTAMP','UNKNOWN']\n", " event_variables = df_stats.loc[(~df_stats['feature_type'].isin(exclude_list))]['feature_name'].to_list()\n", " \n", " # -- define training_data_schema, Stored events need to specify unlabeledEventsTreatment --\n", " training_data_schema = {\n", " 'modelVariables' : df_stats.loc[~(df_stats['feature_type'].isin(exclude_list))]['feature_name'].to_list(),\n", " }\n", " \n", " if 'EVENT_LABEL' in df.columns:\n", " # -- target -- \n", " label_value_count = df['EVENT_LABEL'].dropna().astype('str', errors='ignore').value_counts()\n", " event_labels = label_value_count.index.unique().tolist() \n", " training_data_schema['labelSchema'] = {\n", " # we assume the rare event as fraud, and the rest as not-fraud. \n", " # if you have more than 2 labels in the data or want to map them in a different way, you can manually modify the training data schema\n", " 'labelMapper' : {\n", " 'FRAUD' : [str(label_value_count.idxmin())],\n", " 'LEGIT' : [i for i in event_labels if i not in [str(label_value_count.idxmin())]]\n", " },\n", " # there are there options for unlabeledEventsTreatment: \n", " 'unlabeledEventsTreatment': 'LEGIT'\n", " }\n", " else:\n", " event_labels = None\n", " \n", " return training_data_schema, event_variables, event_labels\n", "\n", "\n", "def profiling(df, variables_map):\n", " \"\"\"\n", " profiling the input pandas data frame and prepare schema for following steps \n", " \n", " Arguments:\n", " df (DataFrame) - panda's dataframe to create summary statistics for\n", " variables_map (dictionary) - variables map dictionary - key is the variable type and value is the list of variable name\n", " \n", " Returns:\n", " DataFrame of summary statistics, training data schema, event variables and event labels \n", " \"\"\"\n", " df = df.copy()\n", " \n", " # -- check required variables --\n", " required_var_type = ['ENTITY_TYPE','ENTITY_ID','EVENT_ID','EVENT_TIMESTAMP', 'IP_ADDRESS', 'USERAGENT', 'ARE_CREDENTIALS_VALID']\n", " required_var_name = []\n", " for item in required_var_type:\n", " if VARIABLES_MAP.get(item):\n", " required_var_name.append(VARIABLES_MAP.get(item))\n", " else:\n", " required_var_name.append(item)\n", "\n", " missing_required_vars = [i for i in required_var_name if i not in set(df.columns)]\n", " if len(missing_required_vars) != 0:\n", " raise ValueError(f'Required columns {missing_required_vars} are not included in the training data.')\n", " \n", " # -- check if entity types only contains one value --\n", " entity_types = list(df['ENTITY_TYPE'].unique())\n", " if len(entity_types)> 1:\n", " raise ValueError('Currently, Amazon Fraud Detector only support one ENTITY_TYPE per EVENT_TYPE.')\n", " \n", " # -- get data summary --\n", " df_stats = summary_stats(df, variables_map)\n", " \n", " # -- prepare schema for following steps -- \n", " training_data_schema, event_variables, event_labels = prepare_schema(df, df_stats, variables_map)\n", " \n", " print(\"--- summary stats ---\")\n", " print(df_stats)\n", " print(\"\\n\")\n", " print(\"--- event variables ---\")\n", " print(event_variables)\n", " print(\"\\n\")\n", " print(\"--- event labels ---\")\n", " print(event_labels)\n", " print(\"\\n\")\n", " print(\"--- training data schema ---\")\n", " print(training_data_schema)\n", " \n", " return df_stats, training_data_schema, event_variables, event_labels\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--- event timestamp ---\n", "earliest event: 2022-01-01T00:00:31Z , latest event: 2022-06-30T23:59:49Z\n", "split training and test set at: 2022-05-27T21:56:51Z\n", "\n", "\n", "--- summary stats ---\n", " feature_name feature_type dtype count null null_pct nunique nunique_pct feature_warning\n", "0 EVENT_ID EVENT_ID object 735683 0 0.0000 735683 1.0000 NO WARNING\n", "1 ENTITY_ID ENTITY_ID object 735683 0 0.0000 100000 0.1359 NO WARNING\n", "2 ENTITY_TYPE ENTITY_TYPE object 735683 0 0.0000 1 0.0000 NO WARNING\n", "3 EVENT_TIMESTAMP EVENT_TIMESTAMP object 735683 0 0.0000 718441 0.9766 NO WARNING\n", "4 EVENT_LABEL EVENT_LABEL object 529 735154 0.9993 2 0.0000 EXCLUDE, GT 75% MISSING\n", "5 LABEL_TIMESTAMP LABEL_TIMESTAMP object 529 735154 0.9993 529 0.0007 EXCLUDE, GT 75% MISSING\n", "6 ip IP_ADDRESS object 735683 0 0.0000 190546 0.2590 NO WARNING\n", "7 useragent USERAGENT object 735683 0 0.0000 15715 0.0214 NO WARNING\n", "8 fp FINGERPRINT object 735683 0 0.0000 270396 0.3675 NO WARNING\n", "9 session_id SESSION_ID object 735683 0 0.0000 426005 0.5791 NO WARNING\n", "10 are_credentials_valid ARE_CREDENTIALS_VALID bool 735683 0 0.0000 2 0.0000 NO WARNING\n", "\n", "\n", "--- event variables ---\n", "['ip', 'useragent', 'fp', 'session_id', 'are_credentials_valid']\n", "\n", "\n", "--- event labels ---\n", "['0', '1']\n", "\n", "\n", "--- training data schema ---\n", "{'modelVariables': ['ip', 'useragent', 'fp', 'session_id', 'are_credentials_valid'], 'labelSchema': {'labelMapper': {'FRAUD': ['1'], 'LEGIT': ['0']}, 'unlabeledEventsTreatment': 'LEGIT'}}\n" ] } ], "source": [ "# -- connect to S3, snag file, and convert to a panda's dataframe --\n", "s3 = boto3.resource('s3')\n", "obj = s3.Object(S3_BUCKET, S3_FILE)\n", "body = obj.get()['Body']\n", "df = pd.read_csv(body, dtype={'EVENT_LABEL': object})\n", " \n", "# -- by default, we split the data into training (80%) and test set (20%) --\n", "earliest_event = df['EVENT_TIMESTAMP'].min()\n", "latest_event = df['EVENT_TIMESTAMP'].max()\n", "if TRAINING_PERC > 1 or TRAINING_PERC <= 0:\n", " raise ValueError(\"TRAINING_PERC should be in (0,1]\")\n", "else:\n", " test_split_row = int(df.shape[0]*TRAINING_PERC)\n", " test_split_time = df.sort_values(by = 'EVENT_TIMESTAMP').iloc[test_split_row]['EVENT_TIMESTAMP']\n", " test_split_time = pd.to_datetime(test_split_time).strftime(\"%Y-%m-%dT%H:%M:%SZ\")\n", "\n", "print(\"--- event timestamp ---\")\n", "print(\"earliest event:\", earliest_event, \", latest event:\", latest_event) \n", "print(\"split training and test set at:\", test_split_time) \n", "print(\"\\n\")\n", "\n", "# -- call profiling function -- \n", "df_stats, training_data_schema, event_variables, event_labels = profiling(df, VARIABLES_MAP)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Create event variables and labels \n", "-----\n", "\n", "The following section will automatically create your modeling input variables for you. \n", "\n", "\n", "\n", "
💡 APIs for Creating/Deleting Variables and Labels \n", " \n", "- **create_variable**: Creates a variable in Fraud Detector\n", "- **get_variables**: Gets all of the variables or a specific label if name is provided\n", "- **delete_variables**: Delete a variable; If you have events, models or detectors created using the variable, you need to delect the associated resource first\n", "- **put_label**: Creates a label\n", "- **get_labels**: Gets all labels or a specific label if name is provided\n", "- **delete_label**: Delete a label \n", "
" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " --- model variable dict --\n", "ip has been defined, data type: STRING\n", "useragent has been defined, data type: STRING\n", "fp has been defined, data type: STRING\n", "session_id has been defined, data type: STRING\n", "Creating variable: are_credentials_valid\n", "\n", "\n", "{'ip': 'STRING', 'useragent': 'STRING', 'fp': 'STRING', 'session_id': 'STRING', 'are_credentials_valid': 'ARE_CREDENTIALS_VALID'}\n" ] } ], "source": [ "# -- function to create all your variables --- \n", "def create_variables(features_dict):\n", " \"\"\"\n", " Check if variables exist, if not, adds the variable to Fraud Detector \n", " \n", " Arguments: \n", " features_dict - a dictionary maps your variables to variable type\n", " \"\"\"\n", " for feature in features_dict.keys(): \n", " if features_dict[feature] in ['NUMERIC','PRICE']:\n", " DATA_TYPE = 'FLOAT'\n", " DEFAULT_VALUE = '0.0'\n", " elif features_dict[feature]=='ARE_CREDENTIALS_VALID':\n", " DATA_TYPE = 'BOOLEAN'\n", " DEFAULT_VALUE = 'false'\n", " else:\n", " DATA_TYPE = 'STRING'\n", " DEFAULT_VALUE = ''\n", " \n", " try:\n", " resp = client.get_variables(name = feature)\n", " features_dict[feature] = resp['variables'][0]['dataType']\n", " print(\"{0} has been defined, data type: {1}\".format(feature, features_dict[feature]))\n", " except:\n", " print(\"Creating variable: {0}\".format(feature))\n", " resp = client.create_variable(\n", " name = feature,\n", " dataType = DATA_TYPE,\n", " dataSource ='EVENT',\n", " defaultValue = DEFAULT_VALUE, \n", " description = feature,\n", " variableType = features_dict[feature])\n", " return features_dict\n", "\n", "exclude_list = ['ENTITY_TYPE','ENTITY_ID','EVENT_ID','EVENT_TIMESTAMP','EVENT_LABEL','LABEL_TIMESTAMP','UNKNOWN']\n", "features_dict = df_stats.loc[(~df_stats['feature_type'].isin(exclude_list))].set_index('feature_name')['feature_type'].to_dict()\n", "print(\"\\n --- model variable dict --\")\n", "features_dict = create_variables(features_dict)\n", "print(\"\\n\")\n", "print(features_dict)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2. Create labels (Optional)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " --- model label schema dict --\n", "{'FRAUD': ['1'], 'LEGIT': ['0']}\n" ] } ], "source": [ "# -- function to create all your labels --- \n", "def create_label(label_mapper):\n", " \"\"\"\n", " Add labels to Fraud Detector\n", " \n", " Arguments:\n", " label_mapper - a dictionary maps Fraud/Legit to your labels in data\n", " \"\"\"\n", " for label in label_mapper['FRAUD']:\n", " response = client.put_label(\n", " name = label,\n", " description = \"FRAUD\")\n", " \n", " for label in label_mapper['LEGIT']:\n", " response = client.put_label(\n", " name = label,\n", " description = \"LEGIT\")\n", "\n", "if 'labelSchema' in training_data_schema:\n", " label_mapper = training_data_schema['labelSchema']['labelMapper']\n", " print(\"\\n --- model label schema dict --\")\n", " print(label_mapper)\n", " create_label(label_mapper)\n", "else:\n", " print('Labels not defined, skip.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Define your Entity and Event Types \n", "-----\n", " \n", "The following code block will automatically create your entity and event types for you.\n", "\n", "
💡 APIs for Entity and Event Types \n", "\n", "- **put_entity_type**: Creates or updates an entity type. An entity represents who is performing the event. An entity type classifies the entity. Example classifications include customer, merchant, or account\n", "- **get_entity_type**: Gets all entity types or a specific entity type if a name is specified\n", "- **delete_entity_type**: Deletes an entity type. If you have an event type associated with the entity type, you need to delete that event type first \n", "- **put_event_type**: Creates or updates an event type. An event is a business activity that is evaluated for fraud risk. Example event types include online payment transactions, account registrations, and authentications\n", "- **get_event_type**: Gets all event types or a specific event type if name is provided\n", "- **delete_event_typ**e: Delete one event type \n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# --- no changes just run this code block ---\n", "\n", "# -- create entity types if not exist --\n", "entity_type = list(df['ENTITY_TYPE'].unique())[0]\n", "\n", "try:\n", " response = client.get_entity_types(name = entity_type)\n", " print(\"-- entity type exists --\")\n", " print(response)\n", "except:\n", " response = client.put_entity_type(\n", " name = entity_type,\n", " description = entity_type\n", " )\n", " print(\"-- create entity type --\")\n", " print(response)\n", " \n", "\n", "# -- create event type --\n", "try:\n", " response = client.get_event_types(name = EVENT_TYPE)\n", " print(\"\\n-- event type exists --\")\n", " print(response)\n", "except:\n", " response = client.put_event_type (\n", " name = EVENT_TYPE,\n", " eventVariables = event_variables,\n", " labels = event_labels,\n", " eventIngestion = 'ENABLED',\n", " entityTypes = [entity_type])\n", " print(\"\\n-- create event type --\")\n", " print(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Send Events \n", "\n", "***Send Events use Batch Import***\n", "\n", "The following section will automatically stream in data from S3 bucket to AFD using the Batch Import function. After sending all your events, you can also go to the console, click \"Refresh events data\" to check the number of events, the earliest and the latest event timestamp, etc.\n", "\n", "
💡 APIs for Sending, Getting and Deleting Events \n", "\n", "- **send_event**: Send one event to AFD\n", "- **get_event**: Get the specified event by its eventId\n", "- **delete_events_by_event_type**: Delete all events associated with the event type\n", "- **delete_event**: Deletes the specified event\n", "- **create_batch_import_job**: Batch import all event in a csv file on S3 bucket to AFD\n", "- **get_batch_import_jobs**: Check the status of batch import job \n", "\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## -- create batch import job --\n", "client.create_batch_import_job(\n", " jobId = f'batch_import_{EVENT_TYPE}_v2',\n", " inputPath = f's3://{S3_BUCKET}/{S3_FILE}',\n", " outputPath = f's3://{S3_BUCKET}',\n", " eventTypeName = EVENT_TYPE,\n", " iamRoleArn = ARN_ROLE\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# -- takes sometime to import all data into AFD -- \n", "print(\"-- wait for model training to complete --\") \n", "stime = time.time()\n", "while True:\n", " clear_output(wait = True)\n", " response = client.get_batch_import_jobs(jobId = f'batch_import_{EVENT_TYPE}_v2')\n", " status = response['batchImports'][0]['status']\n", " if status in ['IN_PROGRESS', 'IN_PROGRESS_INITIALIZING']:\n", " print(f\"current progress: {(time.time() - stime)/60:{3}.{3}} minutes\")\n", " time.sleep(60) # -- sleep for 60 seconds \n", " else:\n", " print(f\"Model status : {status}\")\n", " break\n", "etime = time.time()\n", "\n", "# -- summarize --\n", "print(\"\\n --- batch import complete --\")\n", "print(\"Elapsed time : %s\" % (etime - stime) + \" seconds \\n\" )\n", "print(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----\n", "***(Optional) Update Label***\n", "\n", "After streaming in events, you can always update their event labels after investigation.\n", "\n", "
💡 API for Updating Event Labels \n", "\n", "- **update_event_label**: Update one event's label\n", "\n", "
\n", "\n", "For example, we can update the label of the event we just sent from '0' to '1' by running:\n", "\n", "```python\n", "client.update_event_label(\n", " eventId = '3200002a7ee5e7d0dc80ad7906d792373b',\n", " eventTypeName = EVENT_TYPE,\n", " assignedLabel = '1',\n", " labelTimestamp = '2021-05-01T17:01:00Z'\n", ")\n", "```\n", "\n", "Check the updated event:\n", "```python\n", "client.get_event(eventId = '3200002a7ee5e7d0dc80ad7906d792373b', eventTypeName = EVENT_TYPE)\n", "```\n", "\n", "You will see the `currentLabel` has been updated to '1':\n", "\n", "```python\n", "{\n", " 'event': {\n", " 'eventId': '3200002a7ee5e7d0dc80ad7906d792373b',\n", " 'eventTypeName': 'accounttakeover',\n", " 'eventTimestamp': '2020-11-28T14:59:50Z',\n", " 'eventVariables': {\n", " 'ip': '1.1.1.1',\n", " 'user_agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Windows 95; Trident/3.0)'\n", " },\n", " 'currentLabel': '1',\n", " 'labelTimestamp': '2021-05-01T17:01:00Z',\n", " 'entities': [{'entityType': 'user', 'entityId': '153-04-1621'}]\n", " },\n", " 'ResponseMetadata': ...\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6. Create and train your model \n", "-----\n", "\n", "After sending all your events, we recommend waiting 10-20 minutes to ensure that they are fully ingested by the system. Then, you can start to create and train your model. The following section will automatically train and activate your model for you. By default, we use all features available, if you want to exclude features from training, you can review and modify the `training_data_schema`.\n", " \n", "
💡 APIs for Creating and Training Model \n", "\n", "- **create_model**: Creates a model using the specified model type. Available model types include: ONLINE_FRAUD_INSIGHTS, TRANSACTION_FRAUD_INSIGHTS\n", "- **update_model**: Updates a model. You can update the description attribute using this action\n", "- **get_models**: Gets one or more models. Gets all models for the AWS account if no model type and no model id provided\n", "- **create_model_version**: Creates a version of the model using the specified model type and model id\n", "- **update_model_version**: Updates a model version. Updating a model version retrains an existing model version using updated training data and produces a new minor version of the model. This action creates and trains a new minor version of the model, for example version 1.01, 1.02, 1.03\n", "- **describe_model_versions**: Gets all of the model versions for the specified model type or for the specified model type and model ID. You can also get details for a single, specified model version\n", "- **get_model_version**: Gets the details of the specified model version\n", "- **put_external_model**: Creates or updates an Amazon SageMaker model endpoint. You can also use this action to update the configuration of the model endpoint, including the IAM role and/or the mapped variables\n", "- **get_external_models**: Gets the details for one or more Amazon SageMaker models that have been imported into the service\n", "- **update_model_version_status**: Updates the status of a model version. You can 1) Change the TRAINING_COMPLETE status to ACTIVE, 2) Change ACTIVE to INACTIVE\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# --- no changes; just run this code block ---\n", "# -- create our model --\n", "try:\n", " response = client.create_model(\n", " description = MODEL_DESC,\n", " eventTypeName = EVENT_TYPE,\n", " modelId = MODEL_NAME,\n", " modelType = MODEL_TYPE)\n", " print(\"-- initalize model --\")\n", " print(response)\n", "except Exception:\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# --- no changes; just run this code block ---\n", "# -- initalizes the model, it's now ready to train --\n", "response = client.create_model_version(\n", " modelId = MODEL_NAME,\n", " modelType = MODEL_TYPE,\n", " trainingDataSource = 'INGESTED_EVENTS',\n", " trainingDataSchema = training_data_schema,\n", " ingestedEventsDetail={\n", " 'ingestedEventsTimeWindow': {\n", " 'startTime': df['EVENT_TIMESTAMP'].min(),\n", " 'endTime': test_split_time\n", " }\n", " }\n", ")\n", "model_version = response['modelVersionNumber']\n", "print(\"-- model training --\")\n", "print(response)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# -- model training can take a long time, we'll loop until it's complete -- \n", "print(\"-- wait for model training to complete --\") \n", "stime = time.time()\n", "while True:\n", " clear_output(wait = True)\n", " response = client.get_model_version(modelId = MODEL_NAME, modelType = MODEL_TYPE, modelVersionNumber = model_version)\n", " if response['status'] == 'TRAINING_IN_PROGRESS':\n", " print(f\"current progress: {(time.time() - stime)/60:{3}.{3}} minutes\")\n", " time.sleep(60) # -- sleep for 60 seconds \n", " if response['status'] != 'TRAINING_IN_PROGRESS':\n", " print(\"Model status : \" + response['status'])\n", " break\n", "etime = time.time()\n", "\n", "# -- summarize --\n", "print(\"\\n --- model training complete --\")\n", "print(f\"Elapsed time: {(etime - stime)/60:{3}.{3}} minutes \\n\" )\n", "print(response)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# -- activate the model version --\n", "response = client.update_model_version_status (\n", " modelId = MODEL_NAME,\n", " modelType = MODEL_TYPE,\n", " modelVersionNumber = model_version,\n", " status = 'ACTIVE'\n", ")\n", "print(\"-- activating model --\")\n", "print(response)\n", "\n", "# -- wait until model is active --\n", "print(\"--- waiting until model status is active \")\n", "stime = time.time()\n", "while True:\n", " clear_output(wait=True)\n", " response = client.get_model_version(modelId=MODEL_NAME, modelType = MODEL_TYPE, modelVersionNumber = model_version)\n", " if response['status'] != 'ACTIVE':\n", " print(f\"current progress: {(time.time() - stime)/60:{3}.{3}} minutes\")\n", " time.sleep(60) # sleep for 1 minute \n", " if response['status'] == 'ACTIVE':\n", " print(\"Model status : \" + response['status'])\n", " break\n", " \n", "etime = time.time()\n", "print(f\"Elapsed time: {(etime - stime)/60:{3}.{3}} minutes \\n\" )\n", "print(response)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "trainingMetrics = client.describe_model_versions(\n", " modelId = MODEL_NAME,\n", " modelVersionNumber = model_version,\n", " modelType = MODEL_TYPE,\n", " maxResults = 10\n", ")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# -- model performance summary -- \n", "trainingMetrics = client.describe_model_versions(\n", " modelId = MODEL_NAME,\n", " modelVersionNumber = model_version,\n", " modelType = MODEL_TYPE,\n", " maxResults = 10\n", ")['modelVersionDetails'][0]['trainingResultV2']['trainingMetricsV2']['ati']\n", "\n", "perf_asi = trainingMetrics['modelPerformance']['asi']\n", "df_model = pd.DataFrame(trainingMetrics['metricDataPoints'])\n", "\n", "# -- ROC Chart -- \n", "plt.figure(figsize = (8,8))\n", "plt.plot(df_model[\"cr\"], df_model[\"adr\"], color='darkorange', lw=2, label='ADR')\n", "plt.plot(df_model[\"cr\"], df_model[\"atodr\"], color='darkgreen', lw=2, label='ATO_DR')\n", "plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')\n", "plt.xlabel('Challenge Rate')\n", "plt.ylabel('Discovery Rate')\n", "plt.title(f'{MODEL_NAME}, ASI: {perf_asi:.3f}')\n", "plt.legend(loc=\"lower right\", fontsize=12)\n", "plt.grid()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# -- variable importance summary -- \n", "varImpMetrics = client.describe_model_versions(\n", " modelId = MODEL_NAME,\n", " modelVersionNumber = model_version,\n", " modelType = MODEL_TYPE,\n", " maxResults = 10\n", ")['modelVersionDetails'][0]['trainingResultV2']['aggregatedVariablesImportanceMetrics'] \n", "\n", "df_var_imp = pd.DataFrame(varImpMetrics['logOddsMetrics']).sort_values(by='aggregatedVariablesImportance')\n", "\n", "# -- Variable importance Chart -- \n", "df_var_imp.plot.barh(x='variableNames',y='aggregatedVariablesImportance',figsize=(10,int(df_var_imp.shape[0])))\n", "plt.xlabel('Variable Importance (logOdds)')\n", "plt.grid()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7. Create a Fraud Detector, generate Rules and assemble your Detector \n", "-----\n", "The following section will automatically generate a number of fraud, investigate and approve rules based on the false positive rate and score thresholds of your model. These are just example rules that you could create, it is recommended that you fine tune your rules specifically to your business use case. \n", " \n", "
💡 Key APIs for Generating Rules, Creating and Publishing a Detector \n", " \n", "- **put_detector**: Creates or updates a detector\n", "- **put_outcome**: Creates or updates an outcome\n", "- **create_rule**: Creates a rule for use with the specified detector\n", "- **update_rule_version**: Updates a rule version resulting in a new rule version (version 1, 2, 3 ...)\n", "- **create_detector_version**: Creates a detector version. The detector version starts in a DRAFT status\n", "- **update_detector_version**: Updates a detector version. The detector version attributes that you can update include models, external model endpoints, rules, rule execution mode, and description. You can only update a DRAFT detector version\n", "- **update_detector_version_status**: Updates the detector version’s status. You can perform the following promotions or demotions using UpdateDetectorVersionStatus: DRAFT to ACTIVE, ACTIVE to INACTIVE, and INACTIVE to ACTIVE\n", "- **describe_detector**: Gets all versions for a specified detector\n", "
" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# -- put detector, initalizes your detector -- \n", "response = client.put_detector(\n", " detectorId = DETECTOR_NAME, \n", " description = DETECTOR_DESC,\n", " eventTypeName = EVENT_TYPE )" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " --- score thresholds 1% to 6% --- \n", " cr adr atodr threshold\n", "0 0.01 0.86 0.84 750.0\n", "1 0.02 0.90 0.92 670.0\n", "2 0.03 0.92 0.92 620.0\n", "3 0.04 0.95 0.92 575.0\n", "4 0.05 0.96 0.97 540.0\n", "5 0.06 0.96 0.97 515.0\n" ] } ], "source": [ "# -- check the score thresholds with FPR from 1% to 6% --\n", "model_stat = df_model.sort_values(by='cr')\n", "model_stat['cr_bin']=np.ceil(model_stat['cr']*100)*0.01\n", "m = model_stat.loc[model_stat.groupby([\"cr_bin\"])[\"threshold\"].idxmin()] \n", "m = m.round(decimals=2)[['cr','adr','atodr','threshold']]\n", "print (\" --- score thresholds 1% to 6% --- \")\n", "print(m.loc[(m['cr'] > 0.0 ) & (m['cr'] <= 0.06)].reset_index(drop=True))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# -- decide what threshold and corresponding outcome you want to add -- \n", "# here, we create three simple rules by cutting the score at [950,750], \n", "# and create three outcome ['investigate', 'challenge', 'approve'] \n", "# it will create 3 rules:\n", "# score > 950: investigate\n", "# score > 750: challenge \n", "# score <= 750: approve\n", "\n", "score_cuts = [950,750] # Note: recommend to fine tune this based on your business use case\n", "outcomes = ['investigate', 'challenge', 'approve'] # Note: recommend to define this based on your business use case\n", "\n", "def create_outcomes(outcomes):\n", " \"\"\" \n", " Create Fraud Detector Outcomes \n", " \"\"\" \n", " for outcome in outcomes:\n", " print(\"creating outcome variable: {0} \".format(outcome))\n", " response = client.put_outcome(name = outcome, description = outcome)\n", "\n", "def create_rules(score_cuts, outcomes):\n", " \"\"\"\n", " Creating rules \n", " \n", " Arguments:\n", " score_cuts - list of score cuts to create rules\n", " outcomes - list of outcomes associated with the rules\n", " \n", " Returns:\n", " a rule list to used when creating detector\n", " \"\"\"\n", " \n", " if len(score_cuts)+1 != len(outcomes):\n", " logging.error('Your socre cuts and outcomes are not matched.')\n", " \n", " rule_list = []\n", " for i in range(len(outcomes)):\n", " # rule expression\n", " if i < (len(outcomes)-1):\n", " rule = \"${0}_insightscore > {1}\".format(MODEL_NAME,score_cuts[i])\n", " else:\n", " rule = \"${0}_insightscore <= {1}\".format(MODEL_NAME,score_cuts[i-1])\n", " \n", " # append to rule_list (used when create detector)\n", " rule_id = \"rules{0}_{1}\".format(i, MODEL_NAME)\n", " \n", " rule_list.append({\n", " \"ruleId\": rule_id, \n", " \"ruleVersion\" : '1',\n", " \"detectorId\" : DETECTOR_NAME\n", " })\n", " \n", " # create rules\n", " print(\"creating rule: {0}: IF {1} THEN {2}\".format(rule_id, rule, outcomes[i]))\n", " try:\n", " response = client.create_rule(\n", " ruleId = rule_id,\n", " detectorId = DETECTOR_NAME,\n", " expression = rule,\n", " language = 'DETECTORPL',\n", " outcomes = [outcomes[i]]\n", " )\n", " except:\n", " print(\"this rule already exists in this detector\")\n", " \n", " return rule_list\n", " \n", "# -- create outcomes -- \n", "print(\" -- create outcomes --\")\n", "create_outcomes(outcomes)\n", "\n", "# -- create rules --\n", "print(\" -- create rules --\")\n", "rule_list = create_rules(score_cuts, outcomes)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# -- create detector version --\n", "client.create_detector_version(\n", " detectorId = DETECTOR_NAME,\n", " rules = rule_list,\n", " modelVersions = [{\"modelId\": MODEL_NAME, \n", " \"modelType\": MODEL_TYPE,\n", " \"modelVersionNumber\": model_version}],\n", " # there are 2 options for ruleExecutionMode:\n", " # 'ALL_MATCHED' - return all matched rules' outcome\n", " # 'FIRST_MATCHED' - return first matched rule's outcome\n", " ruleExecutionMode = 'FIRST_MATCHED'\n", ")\n", "\n", "print(\"\\n -- detector created -- \")\n", "print(response) " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Latest Detector Version: 1\n", "\n", " -- detector activated -- \n", "{'ResponseMetadata': {'RequestId': 'a1df5bee-a034-4459-886a-91d1c2c5d9da', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Fri, 22 Jul 2022 23:21:28 GMT', 'content-type': 'application/x-amz-json-1.1', 'content-length': '2', 'connection': 'keep-alive', 'x-amzn-requestid': 'a1df5bee-a034-4459-886a-91d1c2c5d9da'}, 'RetryAttempts': 0}}\n" ] } ], "source": [ "# -- activate the latest detector version --\n", "detector_version_summaries = client.describe_detector(detectorId=DETECTOR_NAME)['detectorVersionSummaries']\n", "latest_detector_version = max([det['detectorVersionId'] for det in detector_version_summaries])\n", "print('Latest Detector Version:', latest_detector_version)\n", "\n", "response = client.update_detector_version_status(\n", " detectorId = DETECTOR_NAME,\n", " detectorVersionId = latest_detector_version,\n", " status = 'ACTIVE'\n", ")\n", "print(\"\\n -- detector activated -- \")\n", "print(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8. Make predictions \n", "-----\n", "\n", "The following section will apply your detector to the latest 20% of your data to check the model performance. \n", "\n", "
💡 API for Making Predictions \n", "\n", "- **get_event_prediction**: Evaluates an event against a detector version. If a version ID is not provided, the detector’s (ACTIVE) version is used. \n", "\n", "\n", "
\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "N_pred = 1000" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 193 ms, sys: 107 ms, total: 301 ms\n", "Wall time: 1min 59s\n" ] } ], "source": [ "%%time\n", "def _predict(record):\n", " \"\"\"\n", " Get prediction on one event\n", " \"\"\"\n", " event_id = str(record[0])\n", " entity_id = str(record[1])\n", " event_timestamp = str(record[2])\n", " label_timestamp = str(record[4])\n", " \n", " try:\n", " rec_content = {event_variables[i]: str(record[5:][i]) for i in range(len(event_variables)) if pd.isnull(record[5+i])==False}\n", " pred = client.get_event_prediction(\n", " detectorId = DETECTOR_NAME,\n", " detectorVersionId = latest_detector_version,\n", " eventId = event_id,\n", " eventTypeName = EVENT_TYPE,\n", " eventTimestamp = event_timestamp, \n", " entities = [{\n", " 'entityType': entity_type, \n", " 'entityId': entity_id\n", " }],\n", " eventVariables = rec_content) \n", " record.append(pred['modelScores'][0]['scores'][\"{0}_insightscore\".format(MODEL_NAME)])\n", " record.append(pred['ruleResults'][0]['outcomes'])\n", " except:\n", " record.append(\"-999\")\n", " record.append([\"error\"])\n", " \n", " return record\n", "\n", "# -- get predictions in parallel --\n", "if TRAINING_PERC < 1:\n", " df_test = df[df['EVENT_TIMESTAMP'] > test_split_time]\n", "else: \n", " # used all data to train the model, GEP on the last 100 events to demonstrate the API\n", " df_test = df.tail(100)\n", "\n", "df_test = df_test.head(N_pred)\n", "cols_keep = ['EVENT_ID', 'ENTITY_ID', 'EVENT_TIMESTAMP', 'EVENT_LABEL', 'LABEL_TIMESTAMP'] + event_variables\n", "df_list = df_test[cols_keep].values.tolist()\n", "with Pool(processes = 5) as p:\n", " result = p.map(_predict, df_list)\n", "predictions = pd.DataFrame(result, columns = cols_keep + ['score', 'outcomes'])" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
EVENT_IDENTITY_IDEVENT_TIMESTAMPEVENT_LABELLABEL_TIMESTAMPipuseragentfpsession_idare_credentials_validscoreoutcomes
0ev7fz3ebsvgs97A34033244112022-05-27T21:56:54ZNaNNaN142.135.19.20Mozilla/5.0 (X11; Linux x86_64; rv:1.9.7.20) G...FP-e156e30473ee46dfSID-98c30b09877d6031True432.0[approve]
1evhfmud6ke2encA74688060852022-05-27T21:56:59ZNaNNaN177.186.170.126Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_8_1...FP-f54a5390ef306380SID-274478ebb420a1fcTrue34.0[approve]
2ev8ea5sbd0r5c9A02327632132022-05-27T21:57:32ZNaNNaN27.238.226.94Opera/9.85.(X11; Linux i686; ps-AF) Presto/2.9...FP-c9a75b1f6054c108SID-3b36e5cb3f499881True53.0[approve]
3evnqe2cutauoqxA60992691912022-05-27T21:57:41ZNaNNaN144.67.153.38Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like...FP-b4e68a1de7a0e00aSID-41efa10b3354f9f2True831.0[challenge]
4evpl1oijpwktuvA74152071542022-05-27T21:57:44ZNaNNaN169.106.206.11Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0)...FP-ed13b8f486bc0b9fSID-6bbfb3b371ddd3b3True178.0[approve]
\n", "
" ], "text/plain": [ " EVENT_ID ENTITY_ID EVENT_TIMESTAMP EVENT_LABEL LABEL_TIMESTAMP ip useragent fp session_id are_credentials_valid score outcomes\n", "0 ev7fz3ebsvgs97 A3403324411 2022-05-27T21:56:54Z NaN NaN 142.135.19.20 Mozilla/5.0 (X11; Linux x86_64; rv:1.9.7.20) G... FP-e156e30473ee46df SID-98c30b09877d6031 True 432.0 [approve]\n", "1 evhfmud6ke2enc A7468806085 2022-05-27T21:56:59Z NaN NaN 177.186.170.126 Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_8_1... FP-f54a5390ef306380 SID-274478ebb420a1fc True 34.0 [approve]\n", "2 ev8ea5sbd0r5c9 A0232763213 2022-05-27T21:57:32Z NaN NaN 27.238.226.94 Opera/9.85.(X11; Linux i686; ps-AF) Presto/2.9... FP-c9a75b1f6054c108 SID-3b36e5cb3f499881 True 53.0 [approve]\n", "3 evnqe2cutauoqx A6099269191 2022-05-27T21:57:41Z NaN NaN 144.67.153.38 Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like... FP-b4e68a1de7a0e00a SID-41efa10b3354f9f2 True 831.0 [challenge]\n", "4 evpl1oijpwktuv A7415207154 2022-05-27T21:57:44Z NaN NaN 169.106.206.11 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0)... FP-ed13b8f486bc0b9f SID-6bbfb3b371ddd3b3 True 178.0 [approve]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# -- check the first 5 rows --\n", "predictions.head()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "scrolled": false }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# -- check the distribution by labels --\n", "plt.figure(figsize = (20,8))\n", "np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)\n", "plt.hist([predictions[predictions['EVENT_LABEL'].isin(label_mapper['LEGIT'])]['score'], \n", " predictions[predictions['EVENT_LABEL'].isin(label_mapper['FRAUD'])]['score'], \n", " predictions[predictions['EVENT_LABEL'].isna()]['score']], bins = 50)\n", "plt.legend([\"Legit\", \"Fraud\", \"Unlabeled\"], fontsize=12)\n", "plt.title(\"Predicted Score Distribution By Label\")\n", "plt.xlabel(\"Predicted Score\")\n", "plt.ylabel(\"Frequency\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# -- check AUC --\n", "predictions['event_label_int'] = np.nan\n", "predictions.loc[predictions['EVENT_LABEL'].isna(), 'event_label_int'] = 0\n", "predictions.loc[predictions['EVENT_LABEL'].isin(label_mapper['LEGIT']), 'event_label_int'] = 0\n", "predictions.loc[predictions['EVENT_LABEL'].isin(label_mapper['FRAUD']), 'event_label_int'] = 1\n", " \n", "fpr, tpr, threshold = roc_curve(predictions['event_label_int'], predictions['score'])\n", "test_auc = auc(fpr,tpr)\n", "\n", "plt.figure(figsize=(8,8))\n", "plt.plot(fpr, tpr, color='darkorange', lw=2, label=f\"AUC: {test_auc:.2f}\") \n", "plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')\n", "plt.title(MODEL_NAME+\" ROC Curve (Test)\")\n", "plt.xlabel('False Positive Rate (FPR)')\n", "plt.ylabel('True Positive Rate (FPR)')\n", "plt.legend(loc=\"lower right\", fontsize=12)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 9. Get reason codes for GEP calls\n" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "EVENT_ID: ev7fz3ebsvgs97\n", " eventVariableNames relativeImpact logOddsImpact\n", "0 [are_credentials_valid, user, useragent] 5 2.132089\n", "1 [are_credentials_valid, user] 0 0.007980\n", "2 [are_credentials_valid, fp, user] 0 -0.011876\n", "3 [user] 0 -0.151291\n", "4 [are_credentials_valid, ip, user] 5 3.638705\n" ] } ], "source": [ "for eid in predictions['EVENT_ID'].head(1):\n", " # Retrieve event_prediction_time\n", " response = client.list_event_predictions(\n", " eventId={\n", " 'value': eid\n", " },\n", " eventType={\n", " 'value': EVENT_TYPE\n", " },\n", " detectorId={\n", " 'value': DETECTOR_NAME\n", " },\n", " detectorVersionId={\n", " 'value': latest_detector_version\n", " },\n", " predictionTimeRange={\n", " 'startTime': '2022-01-01T00:00:00Z',\n", " 'endTime': '2024-01-01T00:00:00Z'\n", " }\n", " )\n", " last_prediction_timestamp = response['eventPredictionSummaries'][-1]['predictionTimestamp']\n", " \n", " # Get event prediction explanations\n", " response = client.get_event_prediction_metadata(\n", " eventId=eid,\n", " eventTypeName=EVENT_TYPE,\n", " detectorId=DETECTOR_NAME,\n", " detectorVersionId=latest_detector_version,\n", " predictionTimestamp=last_prediction_timestamp\n", " )\n", " print('EVENT_ID: ', eid)\n", " print(pd.DataFrame(response['evaluatedModelVersions'][0]['evaluations'][0]['predictionExplanations']['aggregatedVariablesImpactExplanations']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (Optional) Write Predictions to File\n", "\n", "
💡 Write Predictions \n", "\n", "- You can write your prediction dataset to a CSV to manually review predictions\n", "- Simply add a cell below and copy the code below\n", "\n", "
\n", "\n", "\n", "\n", "```python\n", "\n", "# -- optionally write predictions to a CSV file -- \n", "predictions.to_csv(MODEL_NAME + \".csv\", index=False)\n", "# -- or to a XLS file \n", "predictions.to_excel(MODEL_NAME + \".xlsx\", index=False)\n", "\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# predictions.to_csv(MODEL_NAME + \".csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### (Optional) Clean up resources" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "CLEAN_UP_RESOURCES = False" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if CLEAN_UP_RESOURCES:\n", " \n", " # Deactivate detector\n", " response = client.update_detector_version_status(\n", " detectorId=DETECTOR_NAME,\n", " detectorVersionId=latest_detector_version,\n", " status='INACTIVE'\n", " )\n", "\n", " # Delete detector version\n", " response = client.get_detector_version(\n", " detectorId=DETECTOR_NAME,\n", " detectorVersionId=latest_detector_version\n", " )\n", " print(f'De-activating detector {DETECTOR_NAME}')\n", " while response['status']=='ACTIVE':\n", " time.sleep(60)\n", " response = client.get_detector_version(\n", " detectorId=DETECTOR_NAME,\n", " detectorVersionId=latest_detector_version\n", " )\n", " response = client.delete_detector_version(\n", " detectorId=DETECTOR_NAME,\n", " detectorVersionId=latest_detector_version\n", " )\n", " print(f'Deleted detector {DETECTOR_NAME}, version {latest_detector_version}')\n", "\n", " # Delete rules\n", " rules = client.get_rules(\n", " detectorId=DETECTOR_NAME\n", " )\n", " for rule in rules['ruleDetails']:\n", " response = client.delete_rule(\n", " rule={\n", " 'detectorId': DETECTOR_NAME,\n", " 'ruleId': rule['ruleId'],\n", " 'ruleVersion': rule['ruleVersion']\n", " }\n", " )\n", " print(f'Deleted rules of detector {DETECTOR_NAME}')\n", "\n", " # Delete Detector\n", " response = client.delete_detector(\n", " detectorId=DETECTOR_NAME\n", " )\n", " print(f'Deleted detector {DETECTOR_NAME}')\n", "\n", " # De-activate model version\n", " response = client.update_model_version_status(\n", " modelId=MODEL_NAME,\n", " modelType=MODEL_TYPE,\n", " modelVersionNumber=model_version,\n", " status='INACTIVE'\n", " )\n", " response = client.get_model_version(\n", " modelId=MODEL_NAME,\n", " modelType=MODEL_TYPE,\n", " modelVersionNumber=model_version\n", " )\n", " print(f'De-activating model {MODEL_NAME}')\n", "\n", " while response['status']!='TRAINING_COMPLETE':\n", " time.sleep(60)\n", " response = client.get_model_version(\n", " modelId=MODEL_NAME,\n", " modelType=MODEL_TYPE,\n", " modelVersionNumber=model_version\n", " )\n", "\n", " # Delete model version\n", " response = client.delete_model_version(\n", " modelId=MODEL_NAME,\n", " modelType=MODEL_TYPE,\n", " modelVersionNumber=model_version\n", " )\n", "\n", " # Delete model\n", " response = client.delete_model(\n", " modelId=MODEL_NAME,\n", " modelType=MODEL_TYPE\n", " )\n", " print(f'Deleted model {MODEL_NAME}')\n", "\n", " # Delete stored events\n", " response = client.delete_events_by_event_type(\n", " eventTypeName=EVENT_TYPE\n", " )\n", " response = client.get_delete_events_by_event_type_status(\n", " eventTypeName=EVENT_TYPE\n", " )\n", " while 'IN_PROGRESS' in response['eventsDeletionStatus']:\n", " time.sleep(60)\n", " print(f'Deleting event type: {EVENT_TYPE}')\n", " response = client.get_delete_events_by_event_type_status(\n", " eventTypeName=EVENT_TYPE\n", " )\n", "\n", " # Delete event type\n", " response = client.delete_event_type(\n", " name=EVENT_TYPE\n", " )\n", " print(f'Deleted event type {EVENT_TYPE}')" ] } ], "metadata": { "kernelspec": { "display_name": "conda_mxnet_latest_p37", "language": "python", "name": "conda_mxnet_latest_p37" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }