## Compute Requirements
Make sure you use an instance with at least 32G of memory and 100G of storage.

To run evaluations we used `ml.r5.12xlarge` instance with 48 CPUs and 384G memory.
A smaller instance can be used to run the same evaluations, for example, `ml.m5.4xlarge` with 16 CPUs and 64G memory.

Please use notebook kernel with pytorch already installed. Using `conda_pytorch_p38` or `conda_pytorch_p36` will work. 
Install dependencies after selecting the kernel.

# Install dependencies

In [None]:
%pip install -qU -r requirements.txt

# Download and unzip Kaggle dataset
We use [IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection/data) dataset in our experiments. Make sure you download API token and place it in `~/.kaggle/kaggle.json` before downloading the dataset. Please refer to the [Kaggle API documentation](https://github.com/Kaggle/kaggle-api#api-credentials) for more details. You also need to accept [the competition rules](https://www.kaggle.com/competitions/ieee-fraud-detection/rules) before downloading the data.

In [None]:
!kaggle competitions download -c ieee-fraud-detection -p ./data/ieee-fraud-detection/ --force

In [None]:
!unzip ./data/ieee-fraud-detection/ieee-fraud-detection.zip -d ./data/ieee-fraud-detection/

# Create training and test splits
Fraud labels are only available for competition's training data. We sort transactions by timestamp (TransactionDT) column, and use first 80% of the competition's training data to train our models, and retain the last 20% of transactions for testing. We join transaction and identity tables into a single dataframe using TransactionID column. Note that not all of the transactions have identity information, so we are left with a total of 144,233 transactions. And, 115,386 transactions will be used to training, and 28,847 transactions will be used for testing.

In [None]:
import numpy as np
import pandas as pd

In [None]:
df_identity = pd.read_csv('./data/ieee-fraud-detection/train_identity.csv')

In [None]:
df_transaction = pd.read_csv('./data/ieee-fraud-detection/train_transaction.csv')

In [None]:
df=pd.merge(df_identity, df_transaction, on='TransactionID', how='inner')

In [None]:
df.sort_values(by='TransactionDT', ascending=True, inplace=True)

In [None]:
n_total = len(df)
n_train = int(n_total*0.8)
n_test = n_total - n_train

In [None]:
print(f"Total transactions: {n_total}, training transactions: {n_train}, testing transaction: {n_test}")

In [None]:
df_train = df.head(n_train)
df_test = df.tail(n_test)

In [None]:
df_train.to_parquet("./data/train.parquet", index=False)
df_test.to_parquet("./data/test.parquet", index=False)