{ "cells": [ { "cell_type": "markdown", "source": [ "## Compute Requirements\n", "Make sure you use an instance with at least 32G of memory and 100G of storage.\n", "\n", "To run evaluations we used `ml.r5.12xlarge` instance with 48 CPUs and 384G memory.\n", "A smaller instance can be used to run the same evaluations, for example, `ml.m5.4xlarge` with 16 CPUs and 64G memory.\n", "\n", "Please use notebook kernel with pytorch already installed. Using `conda_pytorch_p38` or `conda_pytorch_p36` will work. \n", "Install dependencies after selecting the kernel." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "markdown", "source": [ "# Install dependencies" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "%pip install -qU -r requirements.txt" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "# Download and unzip Kaggle dataset\n", "We use [IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection/data) dataset in our experiments. Make sure you download API token and place it in `~/.kaggle/kaggle.json` before downloading the dataset. Please refer to the [Kaggle API documentation](https://github.com/Kaggle/kaggle-api#api-credentials) for more details. You also need to accept [the competition rules](https://www.kaggle.com/competitions/ieee-fraud-detection/rules) before downloading the data." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "!kaggle competitions download -c ieee-fraud-detection -p ./data/ieee-fraud-detection/ --force" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "!unzip ./data/ieee-fraud-detection/ieee-fraud-detection.zip -d ./data/ieee-fraud-detection/" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "# Create training and test splits\n", "Fraud labels are only available for competition's training data. We sort transactions by timestamp (TransactionDT) column, and use first 80% of the competition's training data to train our models, and retain the last 20% of transactions for testing. We join transaction and identity tables into a single dataframe using TransactionID column. Note that not all of the transactions have identity information, so we are left with a total of 144,233 transactions. And, 115,386 transactions will be used to training, and 28,847 transactions will be used for testing." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "df_identity = pd.read_csv('./data/ieee-fraud-detection/train_identity.csv')" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "df_transaction = pd.read_csv('./data/ieee-fraud-detection/train_transaction.csv')" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "df=pd.merge(df_identity, df_transaction, on='TransactionID', how='inner')" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "df.sort_values(by='TransactionDT', ascending=True, inplace=True)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "n_total = len(df)\n", "n_train = int(n_total*0.8)\n", "n_test = n_total - n_train" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "print(f\"Total transactions: {n_total}, training transactions: {n_train}, testing transaction: {n_test}\")" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "df_train = df.head(n_train)\n", "df_test = df.tail(n_test)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "df_train.to_parquet(\"./data/train.parquet\", index=False)\n", "df_test.to_parquet(\"./data/test.parquet\", index=False)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } } ], "metadata": { "kernelspec": { "display_name": "conda_pytorch_p38", "language": "python", "name": "conda_pytorch_p38" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 5 }