{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f5c1a587",
   "metadata": {},
   "source": [
    "# Notebook - Metrics for Evaluating an Identity Verification Solution \n",
    "\n",
    "This notebook provides a walkthrough of the steps mentioned in the blog \"Metrics for Evaluating an Identity Verification Solution\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79de9505",
   "metadata": {},
   "source": [
    "## Step 1 - Data Prep \n",
    "\n",
    "1) Upload a dataset of images to an S3 bucket in a single folder. This should include all the images to be tested (both Genuine and Imposter)\n",
    "\n",
    "2) In the same bucket, upload an image pair mapping file in CSV format. Ensure that the column headers are exactly the same as is described in Section 1 of the blog.\n",
    "\n",
    "The resulting folder structure of your S3 bucket should look like the following:\n",
    "```\n",
    "S3 bucket\n",
    "    │ \n",
    "    │ -image_pair_mapping_file.csv\n",
    "    │\n",
    "    └── dataset/\n",
    "       │    -file011.jpg\n",
    "       │    -file012.jpg\n",
    "       │      .\n",
    "       │      .\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e86bc8d",
   "metadata": {},
   "source": [
    "### Install python dependencies if necessary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2599ab76",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install boto3 pandas seaborn matplotlib sklearn "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8a339c8",
   "metadata": {},
   "source": [
    "### Import libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ac140c42",
   "metadata": {},
   "outputs": [],
   "source": [
    "import boto3\n",
    "import io\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.metrics import confusion_matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bca0f5e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize clients\n",
    "rekognition = boto3.client('rekognition')\n",
    "s3 = boto3.client('s3')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c326ed4a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Enter the bucket name where you uploaded the dataset in Step 1\n",
    "bucket_name = \"ENTER_BUCKET_NAME\" # example: \"my_bucket_name\"\n",
    "dataset_folder = \"ENTER_FOLDER_NAME\" # example: \"dataset/\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bbb0c538",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Enter the image pair mapping (CSV) file name that you uploaded in Step 1\n",
    "csv_file = \"ENTER_FILE_NAME\" # example: \"dataset_v1.csv\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ad2d9dd",
   "metadata": {},
   "source": [
    "### Load the image pair mapping file - Genuine examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e08c37c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "obj = s3.get_object(Bucket= bucket_name , Key = csv_file)\n",
    "df = pd.read_csv(io.BytesIO(obj['Body'].read()), encoding='utf8')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4f7446d",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[df['TEST'] == \"Genuine\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ea8f2f4",
   "metadata": {},
   "source": [
    "### Load the image pair mapping file - Imposter examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b576549",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.loc[df['TEST'] == \"Imposter\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2397d23f",
   "metadata": {},
   "source": [
    "## Step 2 - Probing the \"genuine\" and \"imposter image pair sets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7d937e1",
   "metadata": {},
   "source": [
    "### Loop through all image pairs in dataset and get similarity score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31a3dd68",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply the Rekognition CompareFaces API\n",
    "def compare_faces(source_file, target_file, threshold = 0):\n",
    "    response=rekognition.compare_faces(SimilarityThreshold=threshold,\n",
    "                                  SourceImage={'S3Object': {\n",
    "                                      'Bucket': bucket_name,\n",
    "                                      'Name':source_file}},\n",
    "                                  TargetImage={'S3Object': {\n",
    "                                      'Bucket': bucket_name,\n",
    "                                      'Name':target_file}})\n",
    "    \n",
    "    # This notebook assumes 1 face in an image\n",
    "    if len(response['FaceMatches']) == 1:\n",
    "        similarity = float(response['FaceMatches'][0]['Similarity'])\n",
    "        return similarity\n",
    "    elif len(response['FaceMatches']) == 0:\n",
    "        return float(0.0)       \n",
    "    else:\n",
    "        return \"Images should only contain 1 face for this notebook\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20a61c92",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_similarity = df.copy()\n",
    "df_similarity[\"SIMILARITY\"] = None\n",
    "    \n",
    "for index, row in df.iterrows():\n",
    "    source_file = dataset_folder + row[\"SOURCE\"]\n",
    "    target_file = dataset_folder + row[\"TARGET\"]\n",
    "    response_score = compare_faces(source_file, target_file)\n",
    "    df_similarity._set_value(index,\"SIMILARITY\", response_score)\n",
    "\n",
    "df_similarity.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57bee2fb",
   "metadata": {},
   "source": [
    "### BoxPlot of distribution of similarity score by test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dfbb487b",
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.boxplot(data=df_similarity,\n",
    "            x=df_similarity[\"SIMILARITY\"],\n",
    "            y=df_similarity[\"TEST\"]).set(xlabel='Similarity Score', \n",
    "                                         ylabel=None, \n",
    "                                         title = \"Similarity Score Distribution\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c27f27d",
   "metadata": {},
   "source": [
    "### Descriptive Statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cf1d27e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_descriptive_stats = pd.DataFrame(columns=['test','count', 'min' , 'max', 'mean', 'median', 'std'])\n",
    "tests = [\"Genuine\", \"Imposter\"]\n",
    "\n",
    "for test in tests:\n",
    "    count = df_similarity['SIMILARITY'].loc[df_similarity['TEST'] == test].count()\n",
    "    mean = df_similarity['SIMILARITY'].loc[df_similarity['TEST'] == test].mean()\n",
    "    max_ = df_similarity['SIMILARITY'].loc[df_similarity['TEST'] == test].max()\n",
    "    min_ = df_similarity['SIMILARITY'].loc[df_similarity['TEST'] == test].min()\n",
    "    median = df_similarity['SIMILARITY'].loc[df_similarity['TEST'] == test].median()\n",
    "    std = df_similarity['SIMILARITY'].loc[df_similarity['TEST'] == test].std()\n",
    "\n",
    "    new_row = {'test': test,\n",
    "                'count': count,\n",
    "                'min': min_,\n",
    "                'max': max_,\n",
    "                'mean': mean,\n",
    "                'median':median,\n",
    "                'std': std}\n",
    "    df_descriptive_stats = df_descriptive_stats.append(new_row,ignore_index=True)\n",
    "\n",
    "df_descriptive_stats"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29267ef7",
   "metadata": {},
   "source": [
    "## Step 4 - Calculate False Match and Non-Match Rates at different similarity threshold levels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "753f6e42",
   "metadata": {},
   "outputs": [],
   "source": [
    "similarity_thresholds = [80,85,90,95,96,97,98,99]\n",
    "\n",
    "# create output df\n",
    "df_cols = ['Similarity Threshold', 'TN' , 'FN', 'TP', 'FP', 'FNMR (%)', 'FMR (%)']\n",
    "comparison_df = pd.DataFrame(columns=df_cols)\n",
    "\n",
    "# create columns for y_actual and y_pred\n",
    "df_analysis = df_similarity.copy()\n",
    "df_analysis[\"y_actual\"] = None  \n",
    "df_analysis[\"y_pred\"] = None\n",
    "    \n",
    "    \n",
    "for threshold in similarity_thresholds:\n",
    "    # Create y_pred and y_actual columns, 1 == match, 0 == no match\n",
    "    for index, row in df_similarity.iterrows():\n",
    "        # set y_pred\n",
    "        if row[\"SIMILARITY\"] >= threshold:\n",
    "            df_analysis._set_value(index,\"y_pred\", 1) \n",
    "        else:\n",
    "            df_analysis._set_value(index,\"y_pred\", 0) \n",
    "\n",
    "        # set y_actual\n",
    "        if row[\"TEST\"] == \"Genuine\":\n",
    "            df_analysis._set_value(index,\"y_actual\", 1)\n",
    "        else:\n",
    "            df_analysis._set_value(index,\"y_actual\", 0)\n",
    "    \n",
    "    tn, fp, fn, tp = confusion_matrix(df_analysis['y_actual'].tolist(), df_analysis['y_pred'].tolist()).ravel()\n",
    "    FNMR = fn / (tp + fn)\n",
    "    FMR = fp / (tn+fp+fn+tp)\n",
    "    \n",
    "    new_row = {'Similarity Threshold': threshold,\n",
    "                                        'TN': tn,\n",
    "                                        'FN': fn,\n",
    "                                        'TP': tp,\n",
    "                                        'FP': fp,\n",
    "                                        'FNMR (%)':FNMR,\n",
    "                                        'FMR (%)': FMR}\n",
    "    comparison_df = comparison_df.append(new_row,ignore_index=True)\n",
    "comparison_df"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}