{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "## Compute Requirements\n",
    "Make sure you use an instance with at least 32G of memory and 100G of storage.\n",
    "\n",
    "To run evaluations we used `ml.r5.12xlarge` instance with 48 CPUs and 384G memory.\n",
    "A smaller instance can be used to run the same evaluations, for example, `ml.m5.4xlarge` with 16 CPUs and 64G memory.\n",
    "\n",
    "Please use notebook kernel with pytorch already installed. Using `conda_pytorch_p38` or `conda_pytorch_p36` will work. \n",
    "Install dependencies after selecting the kernel."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Install dependencies"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "%pip install -qU -r requirements.txt"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Download and unzip Kaggle dataset\n",
    "We use [IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection/data) dataset in our experiments. Make sure you download API token and place it in `~/.kaggle/kaggle.json` before downloading the dataset. Please refer to the [Kaggle API documentation](https://github.com/Kaggle/kaggle-api#api-credentials) for more details. You also need to accept [the competition rules](https://www.kaggle.com/competitions/ieee-fraud-detection/rules) before downloading the data."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "!kaggle competitions download -c ieee-fraud-detection -p ./data/ieee-fraud-detection/ --force"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "!unzip ./data/ieee-fraud-detection/ieee-fraud-detection.zip -d ./data/ieee-fraud-detection/"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "# Create training and test splits\n",
    "Fraud labels are only available for competition's training data. We sort transactions by timestamp (TransactionDT) column, and use first 80% of the competition's training data to train our models, and retain the last 20% of transactions for testing. We join transaction and identity tables into a single dataframe using TransactionID column. Note that not all of the transactions have identity information, so we are left with a total of 144,233 transactions. And, 115,386 transactions will be used to training, and 28,847 transactions will be used for testing."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "df_identity = pd.read_csv('./data/ieee-fraud-detection/train_identity.csv')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "df_transaction = pd.read_csv('./data/ieee-fraud-detection/train_transaction.csv')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "df=pd.merge(df_identity, df_transaction, on='TransactionID', how='inner')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "df.sort_values(by='TransactionDT', ascending=True, inplace=True)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "n_total = len(df)\n",
    "n_train = int(n_total*0.8)\n",
    "n_test  = n_total - n_train"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "print(f\"Total transactions: {n_total}, training transactions: {n_train}, testing transaction: {n_test}\")"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "df_train = df.head(n_train)\n",
    "df_test  = df.tail(n_test)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "df_train.to_parquet(\"./data/train.parquet\", index=False)\n",
    "df_test.to_parquet(\"./data/test.parquet\", index=False)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p38",
   "language": "python",
   "name": "conda_pytorch_p38"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}