{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Amazon Fraud Detector - Data Profiler Notebook \n", "\n", "\n", "### Dataset Guidance\n", "-------\n", "\n", "AWS Fraud Detector models support a flexible schema, enabling you to train models to your specific data and business need. This notebook was developed to help you profile your data and identify potenital issues before you train an AFD model. The following summarizes the minimimum CSV File requirements:\n", "\n", "* The files are in CSV UTF-8 (comma delimited) format (*.csv).\n", "* The file should contain at least 10k rows and the following __two__ required fields: \n", "\n", " * Event timestamp \n", " * Fraud label \n", " \n", "* The maximum file size is 10 gigabytes (GB). \n", "\n", "* The following dates and datetime formats are supported:\n", " * Dates: YYYY-MM-DD (eg. 2019-03-21)\n", " * Datetime: YYYY-MM-DD HH:mm:ss (eg. 2019-03-21 12:01:32) \n", " * ISO 8601 Datetime: YYYY-MM-DDTHH:mm:ss+/-HH:mm (eg. 2019-03-21T20:58:41+07:00)\n", "\n", "* The decimal precision is up to four decimal places.\n", "* Numeric data should not contain commas and currency symbols. \n", "* Columns with values that could contain commas, such as address or custom text should be enclosed in double quotes. \n", "\n", "\n", "\n", "### Getting Started with Data \n", "-------\n", "The following general guidance is provided to get the most out of your AWS Fraud Detector Model. \n", "\n", "* Gathering Data - The AFD model requires a minimum of 10k records. We recommend that a minimum of 6 weeks of historic data is collected, though 3 - 6 months of data is preferable. As part of the process the AFD model partitions your data based on the Event Timestamp such that performance metrics are calculated on the out of sample (latest) data, thus the format of the event timestamp is important. \n", "\n", " \n", "* Data & Label Maturity: As part of the data gathering process we want to insure that records have had sufficient time to “mature”, i.e. that enough time has passed to insure “non-fraud\" and “fraud” records have been correctly identified. It often takes 30 - 45 days (or more) to correctly identify fraudulent events, because of this it is important to insure that the latest records are at least 30 days old or older. \n", "\n", " \n", "* Sampling: The AFD training process will sample and partition historic based on event timestamp. There is no need to manually sample the data and doing so may negatively influence your model’s results. \n", "\n", " \n", "* Fraud Labels: The AFD model requires that a minimum of 500 observations are identified and labeled as “fraud”. As noted above, fraud label maturity is important. Insure that extracted data has sufficiently matured to insure that fraudulent events have been reliably found. \n", " \n", " \n", "* Custom Fields: the AFD model requires 2 fields: event timestamp and fraud label. The more custom fields you provide the better the AFD model can differentiate between fraud and not fraud. \n", " \n", " \n", "* Nulls and Missing Values: AFD model handles null and missing values, however the percentage of nulls in key fields should be limited. Especially timestamp and fraud label columns should not contain any missing values. \n", "\n", " \n", "If you would like to know more, please check out the [Fraud Detector's Documentation](https://docs.aws.amazon.com/frauddetector/). \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.core.display import display, HTML\n", "from IPython.display import clear_output\n", "display(HTML(\"\"))\n", "from IPython.display import IFrame\n", "# ------------------------------------------------------------------\n", "import numpy as np\n", "import pandas as pd\n", "pd.set_option('display.max_rows', 500)\n", "pd.set_option('display.max_columns', 500)\n", "pd.set_option('display.width', 1000)\n", "pd.options.display.float_format = '{:.4f}'.format\n", "\n", "# -- AWS stuff -- \n", "import boto3\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check required packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import urllib3\n", "import numpy as np\n", "import pandas as pd\n", "import awswrangler as wr\n", "import boto3\n", "import os\n", "import json\n", "import re\n", "import sys\n", "from jinja2 import Environment, PackageLoader, FileSystemLoader\n", "import itertools\n", "from pandas import DataFrame, Series\n", "import scipy.stats as ss\n", "import logging" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### If you see errors of missing modules, use pip to install those modules" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# !pip install awswrangler\n", "# !pip install Jinja2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Amazon Fraud Detector Profiling \n", "-----\n", "\n", "from github download and copy the afd_profile.py python program and template directory to your notebook \n", "\n", "
afd_profile.py \n", "\n", "- afd_profile.py - is the python package which will generate your profile report. \n", "\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# -- get this package from github -- \n", "import afd_profile" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### File & Field Mapping\n", "-----\n", "Simply map your file and field names to the required config values. \n", "\n", "
Map the Required fields \n", "\n", "- CSVFilePath: S3 path to your CSV file. E.g. s3://mybucket/myfile.csv\n", "- EventTimestampColumn: Column name of event timestamp in your CSV file. This is a mandatory column for AFD.\n", "- LabelColumn: Column name of label in your CSV file. This is a mandatory column for AFD. \n", " **note: the profiler will identify the \"rare\" case and assume that it is fraud**\n", "- FileDelimiter: S3 path to your CSV file. E.g. s3://mybucket/myfile.csv\n", "- ProfileCSV: Do you want to profile your data? This will generate an HTML report of your data statistics and potential AFD validation errors.\n", "- FeatureCorr: Do you want to show pair-wise feature correlation in report? The correlation shows that for each pair of features how much one feature depends on the other. This calculation may take some time.\n", "- ReportSuffix: (Optional) Suffix name of profiling report. The report will be named as report_.html.\n", "- FraudLabels: (Optional) Specify label values to be mapped to FRAUD, separated by comma. E.g. suspicious, fraud. If you don't want to map labels, leave this option blank and reprot will show distribution of original label values.\n", "\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "config = {\n", " \"CSVFilePath\": \"s3://\",\n", " \"FileDelimiter\": \",\",\n", " \"EventTimestampColumn\": \"EVENT_TIMESTAMP\",\n", " \"LabelColumn\": \"EVENT_LABEL\",\n", " \"ProfileCSV\": \"Yes\",\n", " \"FraudLabels\": \" \",\n", " \"FeatureCorr\": \"Yes\",\n", " \"ReportSuffix\": \"\"\n", "} " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Run Profiler\n", "-----\n", "The profiler will read your file and produce an HTML file as a result which will be displayed inline within this notebook. \n", " \n", "Note: you can also open **report.html** in a separate browser tab. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "# -- generate the report object --\n", "report = afd_profile.generate_report(config)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }