{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fraud Detection for Automobile Claims: Data Exploration"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. \n",
    "\n",
    "![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Background\n",
    "\n",
    "This notebook is the first part of a series of notebooks that will demonstrate how to prepare, train, and deploy a model that detects fradulent autoclaims. In this notebook, we will focusing on data exploration. You can choose to run this notebook by itself or in sequence with the other notebooks listed below. Please see the [README.md](README.md) for more information about this use case implemented by this series of notebooks. \n",
    "\n",
    "\n",
    "1. **[Fraud Detection for Automobile Claims: Data Exploration](./0-AutoClaimFraudDetection.ipynb)**\n",
    "1. [Fraud Detection for Automobile Claims: Data Preparation, Process, and Store Features](./1-data-prep-e2e.ipynb)\n",
    "1. [Fraud Detection for Automobile Claims: Train, Check Bias, Tune, Record Lineage, and Register a Model](./2-lineage-train-assess-bias-tune-registry-e2e.ipynb)\n",
    "1. [Fraud Detection for Automobile Claims: Mitigate Bias, Train, Register, and Deploy Unbiased Model](./3-mitigate-bias-train-model2-registry-e2e.ipynb)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Datasets and Exploratory Visualizations\n",
    "\n",
    "The dataset is synthetically generated and consists of customers and claims datasets.\n",
    "Here we will load them and do some exploratory visualizations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install seaborn==0.11.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Importing required libraries.\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns  # visualisation\n",
    "import matplotlib.pyplot as plt  # visualisation\n",
    "\n",
    "%matplotlib inline\n",
    "sns.set(color_codes=True)\n",
    "\n",
    "df_claims = pd.read_csv(\"./data/claims_preprocessed.csv\", index_col=0)\n",
    "df_customers = pd.read_csv(\"./data/customers_preprocessed.csv\", index_col=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(df_claims.isnull().sum().sum())\n",
    "print(df_customers.isnull().sum().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This should return no null values in both of the datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the bar graph customer gender\n",
    "df_customers.customer_gender_female.value_counts(normalize=True).plot.bar()\n",
    "plt.xticks([0, 1], [\"Male\", \"Female\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset is heavily weighted towards male customers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the bar graph of fraudulent claims\n",
    "df_claims.fraud.value_counts(normalize=True).plot.bar()\n",
    "plt.xticks([0, 1], [\"Not Fraud\", \"Fraud\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The overwhemling majority of claims are legitimate (i.e. not fraudulent)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the education categories\n",
    "educ = df_customers.customer_education.value_counts(normalize=True, sort=False)\n",
    "plt.bar(educ.index, educ.values)\n",
    "plt.xlabel(\"Customer Education Level\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the total claim amounts\n",
    "plt.hist(df_claims.total_claim_amount, bins=30)\n",
    "plt.xlabel(\"Total Claim Amount\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Majority of the total claim amounts are under $25,000."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the number of claims filed in the past year\n",
    "df_customers.num_claims_past_year.hist(density=True)\n",
    "plt.suptitle(\"Number of Claims in the Past Year\")\n",
    "plt.xlabel(\"Number of claims per year\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most customers did not file any claims in the previous year, but some filed as many as 7 claims."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.pairplot(\n",
    "    data=df_customers, vars=[\"num_insurers_past_5_years\", \"months_as_customer\", \"customer_age\"]\n",
    ");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Understandably, the `months_as_customer` and `customer_age` are correlated with each other. A younger person have been driving for a smaller amount of time and therefore have a smaller potential for how long they might have been a customer.\n",
    "\n",
    "We can also see that the `num_insurers_past_5_years` is negatively correlated with `months_as_customer`. If someone frequently jumped around to different insurers, then they probably spent less time as a customer of this insurer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_combined = df_customers.join(df_claims)\n",
    "sns.lineplot(x=\"num_insurers_past_5_years\", y=\"fraud\", data=df_combined);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fraud is positively correlated with having a greater number of insurers over the past 5 years. Customers who switched insurers more frequently also had more prevelance of fraud."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.boxplot(x=df_customers[\"months_as_customer\"]);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.boxplot(x=df_customers[\"customer_age\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our customers range from 18 to 75 years old. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_combined.groupby(\"customer_gender_female\").mean()[\"fraud\"].plot.bar()\n",
    "plt.xticks([0, 1], [\"Male\", \"Female\"])\n",
    "plt.suptitle(\"Fraud by Gender\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fraudulent claims come disproportionately from male customers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Creating a correlation matrix of fraud, gender, months as customer, and number of different insurers\n",
    "cols = [\n",
    "    \"fraud\",\n",
    "    \"customer_gender_male\",\n",
    "    \"customer_gender_female\",\n",
    "    \"months_as_customer\",\n",
    "    \"num_insurers_past_5_years\",\n",
    "]\n",
    "corr = df_combined[cols].corr()\n",
    "\n",
    "# plot the correlation matrix\n",
    "sns.heatmap(corr, annot=True, cmap=\"Reds\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fraud is correlated with having more insurers in the past 5 years, and negatively correlated with being a customer for a longer period of time. These go hand in hand and mean that long time customers are less likely to commit fraud."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combined DataSets\n",
    "\n",
    "We have been looking at the indivudual datasets, now let's look at their combined view (join)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df_combined = pd.read_csv(\"./data/claims_customer.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_combined = df_combined.loc[:, ~df_combined.columns.str.contains(\"^Unnamed: 0\")]\n",
    "# get rid of an unwanted column\n",
    "df_combined.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_combined.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's explore any unique, missing, or large percentage category in the combined dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "combined_stats = []\n",
    "\n",
    "\n",
    "for col in df_combined.columns:\n",
    "    combined_stats.append(\n",
    "        (\n",
    "            col,\n",
    "            df_combined[col].nunique(),\n",
    "            df_combined[col].isnull().sum() * 100 / df_combined.shape[0],\n",
    "            df_combined[col].value_counts(normalize=True, dropna=False).values[0] * 100,\n",
    "            df_combined[col].dtype,\n",
    "        )\n",
    "    )\n",
    "\n",
    "stats_df = pd.DataFrame(\n",
    "    combined_stats,\n",
    "    columns=[\"feature\", \"unique_values\", \"percent_missing\", \"percent_largest_category\", \"datatype\"],\n",
    ")\n",
    "stats_df.sort_values(\"percent_largest_category\", ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "\n",
    "sns.set_style(\"white\")\n",
    "\n",
    "corr_list = [\n",
    "    \"customer_age\",\n",
    "    \"months_as_customer\",\n",
    "    \"total_claim_amount\",\n",
    "    \"injury_claim\",\n",
    "    \"vehicle_claim\",\n",
    "    \"incident_severity\",\n",
    "    \"fraud\",\n",
    "]\n",
    "\n",
    "corr_df = df_combined[corr_list]\n",
    "corr = round(corr_df.corr(), 2)\n",
    "\n",
    "fix, ax = plt.subplots(figsize=(15, 15))\n",
    "\n",
    "mask = np.zeros_like(corr, dtype=np.bool)\n",
    "mask[np.triu_indices_from(mask)] = True\n",
    "\n",
    "ax = sns.heatmap(corr, mask=mask, ax=ax, annot=True, cmap=\"OrRd\")\n",
    "\n",
    "ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=10, ha=\"right\", rotation=45)\n",
    "ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=10, va=\"center\", rotation=0)\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notebook CI Test Results\n",
    "\n",
    "This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.\n",
    "\n",
    "![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n",
    "\n",
    "![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/end_to_end|fraud_detection|0-AutoClaimFraudDetection.ipynb)\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science 3.0)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-1:081325390199:image/sagemaker-data-science-310-v1"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}