{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ISO 20022 PACS.008 Synthetic Data Generation Notebook\n", "\n", "This notebook demonstrates generation of synthetic raw data and synthetic raw labeled data for ISO20022 pacs.008 serial payment message. The raw labeled data is used to generate training dataset for training a ML model for predicting whether a pacs.008 Cross-Border payment compliant (per CBPR+ specification) XML message will be accepted or rejected by a FI receiving the pacs.008 message. The raw datasets are uploaded to an Amazon S3 bucket that you specify in the notebook. The process of generating raw synthetic data is as follows:\n", "\n", "1. Generate pacs.008 XML messages using `rapide`, an open source ISO20022 message generator available at https://github.com/aws-samples/iso20022-message-generator.\n", "1. Upload the zip file to `data` directory in the same location as this notebook. If `data` directory doesn't exist then create it.\n", "1. Modify the location of zip file(s) in the notebook.\n", "\n", "\n", "**Dataset in this notebook:** For this sampe notebook, `rapide` was used to generate pacs.008 messages using BIC and LEI datasets generated using the accompanying Python notebook [iso20022_lei_bic_dataset.ipynb](./iso20022_lei_bic_datasets.ipynb). This notebook generates:\n", "* Fake BIC codes for following countries:\n", " * CA - Canada\n", " * GB - Greate Britain\n", " * IE - Ireland\n", " * IN - India\n", " * MX - Mexico\n", " * TH - Thailand\n", " * US - United States\n", "* A subset of LEI entities as described in the notebook\n", "\n", "This sample notebook demonstrates generation of raw and labeled raw datasets, it uses a tar gzipped files containing pacs.008 XML messages generated by rapide:\n", "* `iso20022-data/iso20022-raw-messages.tar.gz`: Contains thousands of pacs.008 XML messages \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Environment Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Install packages" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before running the code, please make sure you have this libraries and that the pandas version is the right one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install pandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install xmltodict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "String generators based on regex:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install rstr\n", "!pip install exrex" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Authentication and Authorization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import boto3\n", "import sagemaker\n", "from sagemaker import get_execution_role\n", "\n", "sm_client = boto3.Session().client('sagemaker')\n", "sm_session = sagemaker.Session()\n", "region = boto3.session.Session().region_name\n", "\n", "role = get_execution_role()\n", "print (\"Notebook is running with assumed role {}\".format (role))\n", "print(\"Working with AWS services in the {} region\".format(region))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Provide S3 bucket name\n", "Create a S3 bucket where synthentic raw and labeled data will be stored and provide the bucket name here." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Working directory for the notebook\n", "WORKDIR = os.getcwd()\n", "BASENAME = os.path.dirname(WORKDIR)\n", "\n", "# Create a directory for storing generated synthetic data\n", "synthetic_data_local_path = 'synthetic-data'\n", "if not os.path.exists(synthetic_data_local_path):\n", " os.makedirs(iso20022_data_path)\n", "\n", "# Store all prototype assets in this bucket\n", "# Update this variable with the S3 bucket name in the region where you are running this notebook in your account\n", "# Note that S3 bucket names are unique in the region\n", "s3_bucket_name = 'iso20022-prototype-t3'\n", "s3_bucket_uri = 's3://' + s3_bucket_name\n", "\n", "# Prefix for all files in this prototype\n", "prefix = 'iso20022'\n", "\n", "pacs008_prefix = prefix + '/pacs008'\n", "raw_pacs008_messages_prefix = pacs008_prefix + '/raw-messages'\n", "raw_data_prefix = pacs008_prefix + '/raw-data'\n", "labeled_data_prefix = pacs008_prefix + '/labeled-data'\n", "\n", "raw_data_location = s3_bucket_uri + '/' + raw_data_prefix\n", "labeled_data_location = s3_bucket_uri + '/' + labeled_data_prefix\n", "\n", "print(\"Downloading raw pacs008 messages from {}\".format (s3_bucket_uri + '/' + raw_pacs008_messages_prefix))\n", "print(f\"Raw synthetic data will be uploaded to {raw_data_location}\")\n", "print(f\"Labeled raw synthetic data will be uploaded to {labeled_data_location}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read Raw XML Messages" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import xmltodict\n", "from collections import OrderedDict\n", "import re\n", "import zipfile\n", "import tarfile\n", "import glob\n", "import random\n", "import string\n", "\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler, MinMaxScaler\n", "from sklearn.exceptions import DataConversionWarning\n", "from sklearn.compose import make_column_transformer\n", "\n", "# Flatten .xml files and turn them into a Dictionary\n", "def flatten_dict(d):\n", " def items():\n", " for key, value in d.items():\n", " if isinstance(value, dict):\n", " for subkey, subvalue in flatten_dict(value).items():\n", " yield key + \".\" + subkey, subvalue\n", " else:\n", " yield key, value\n", "\n", " return OrderedDict(items()) \n", "\n", "# Parsing dictionary and transforming it into a data frame.\n", "# Each record is a data frame\n", "def parse_dict_df(file):\n", " with open(file) as fi:\n", " xml=xmltodict.parse(fi.read())\n", " flat= flatten_dict(xml)\n", " data = pd.DataFrame(flat, columns=flat.keys(), index=[0])\n", " return data\n", " \n", "# tar/gzipped archive\n", "raw_data_targz_path = 'iso20022-data/iso20022-raw-messages.tar.gz'\n", "extract_to_dir = './iso20022-raw-messages'\n", "print(f\"Extracting files from {raw_data_targz_path} to {extract_to_dir}\")\n", "with tarfile.open(raw_data_targz_path, 'r:gz') as gzip_ref:\n", " gzip_ref.extractall(extract_to_dir)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Raw Dataset From PACS 008 XML Messages" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "local_path ='./iso20022-raw-messages/messages/'\n", "print('Reading raw messages input data from local path {}'.format(local_path))\n", "files = glob.glob(os.path.join(local_path, \"*.xml\"))\n", "print(f\"No. of pacs008 XML message files: {len(files)}\")\n", "\n", "df = pd.DataFrame()\n", "print(\"Reading XML messages and creating data frame...\")\n", "for f in files:\n", " df = df.append(parse_dict_df(f),ignore_index=True)\n", "\n", "df = df.sample(frac=1)\n", "print(f'Data frame shape after read raw pacs008 XML messages: {df.shape}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Transform Column Names" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Trim column names\n", "for col in list(df.columns):\n", " new_col = col.replace('RequestPayload.h:','').replace('RequestPayload.Doc:','').replace('.h:','.').replace('.Doc:','.')\n", " new_col = new_col.replace('.', '_').replace(':', '_').replace('#','').replace('@','')\n", " df = pd.DataFrame(df.rename(columns={col:new_col}))\n", "print(f\"Columns: {df.columns}\")\n", "\n", "# Change data types.\n", "# All features have 'object' data type. Features should be changed to their respective data type.\n", "date_fts=['AppHdr_CreDt', 'Document_FIToFICstmrCdtTrf_GrpHdr_CreDtTm', \n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_IntrBkSttlmDt',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Dt'\n", " ]\n", "\n", "numeric_fts=['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_IntrBkSttlmAmt_text', \n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstdAmt_text', \n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Amt_text',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_ChrgsInf_Amt_text',\n", " 'Document_FIToFICstmrCdtTrf_GrpHdr_NbOfTxs']\n", "\n", "for col in date_fts:\n", " df=df.astype({col:'datetime64'})\n", "\n", "for col in numeric_fts:\n", " df=df.astype({col:'float64'})\n", "\n", "raw_data_output_path = 'synthetic-data/raw_data.csv'\n", "print(f'Saving synthetic raw data with headers to {raw_data_output_path}')\n", "df.to_csv(raw_data_output_path, index=False)\n", "print(\"Complete synthetic raw dataset preparation.\")\n", "\n", "print(f\"Shape: {df.shape}\")\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upload Raw Dataset to S3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "raw_data_upload_location = sm_session.upload_data(\n", " path=\"synthetic-data/raw_data.csv\",\n", " bucket=s3_bucket_name,\n", " key_prefix=raw_data_prefix\n", ")\n", "\n", "print(f\"Uploaded raw data to: {raw_data_upload_location}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Experiments with data sizes**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "def get_country_code_from_bic(bic):\n", " return bic[4:6]\n", "\n", "# Get Country Code from BIC\n", "print(get_country_code_from_bic('SBININBB'))\n", "\n", "print(f\"Starting rows: {df.shape[0]}\")\n", "\n", "#us_count2 = df.loc[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='US') &\n", "# (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='US')].shape[0]\n", "#print(f\"Total US2 rows: {us_count2}\")\n", "\n", "us_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='US') |\n", " (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='US')]\n", "us_count = us_df.shape[0]\n", "print(f\"Total US rows: {us_count}\")\n", "\n", "us_duplicated_df = us_df[us_df.duplicated()]\n", "print(f\"Duplicated US df shape {us_duplicated_df.shape}\")\n", "removed_us_dups_df = us_df.drop_duplicates()\n", "print(f\"Removed Duplicated US df shape {removed_us_dups_df.shape}\")\n", "\n", "ca_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='CA') |\n", " (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='CA')]\n", "ca_count = ca_df.shape[0]\n", "print(f\"Total Canada rows: {ca_count}\")\n", "\n", "in_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='IN') |\n", " (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='IN')]\n", "in_count = in_df.shape[0]\n", "print(f\"Total India rows: {in_count}\")\n", "\n", "gb_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='GB') |\n", " (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='GB')]\n", "gb_count = gb_df.shape[0]\n", "print(f\"Total GB rows: {gb_count}\")\n", "\n", "ie_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='IE') |\n", " (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='IE')]\n", "ie_count = ie_df.shape[0]\n", "print(f\"Total Ireland rows: {ie_count}\")\n", "\n", "mx_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='MX') |\n", " (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='MX')]\n", "mx_count = mx_df.shape[0]\n", "print(f\"Total Mexico rows: {mx_count}\")\n", "\n", "th_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='TH') |\n", " (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='TH')]\n", "th_count = th_df.shape[0]\n", "print(f\"Total Thailand rows: {th_count}\")\n", "\n", "print(f\"Total rows: {in_count+us_count+ca_count+gb_count+ie_count+mx_count+th_count}\")\n", "\n", "test_frames = [us_df, ca_df, in_df, gb_df, ie_df, mx_df, th_df]\n", "test_final_df = pd.concat(test_frames)\n", "test_f_count = test_final_df.shape[0]\n", "print(f\"Total Test Final DF rows: {test_f_count}\")\n", "\n", "print(\"Duplicates in test_final_df =>\")\n", "print(test_final_df.duplicated())\n", "\n", "duplicated_df = test_final_df[test_final_df.duplicated()]\n", "print(f\"Duplicated df shape {duplicated_df.shape}\")\n", "removed_dups_df = test_final_df.drop_duplicates()\n", "print(f\"Removed Duplicated df shape {removed_dups_df.shape}\")\n", "\n", "print(\"Unique from countries: \")\n", "print(pd.unique(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']))\n", "print(\"Unique to countries: \")\n", "print(pd.unique(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset Labeling \n", "In next several steps we label messages for each country in the raw dataset generated by ISO20022 Message Generator, rapide. The countries we used were: CA, GB, IE, IN, MX, TH, US.\n", "\n", "The rules for labeling are described in the code.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Labeling for India" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Label raw data to create labeled raw data set\n", "\n", "# Labeling for India as Creditor Country\n", "to_india_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='IN']\n", "to_india_rows = to_india_df.shape[0]\n", "print(f\"No. of To India rows: {to_india_rows}\")\n", "to_india_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Labeling for India as Debtor Country\n", "from_india_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='IN']\n", "from_india_rows = from_india_df.shape[0]\n", "print(f\"No. of From India rows: {from_india_rows}\")\n", "from_india_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import string\n", "\n", "# Make regulatory reporting missing i.e. NaN\n", "def make_regulatory_reporting_nan(df):\n", " df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_DbtCdtRptgInd=np.NaN)\n", " df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Authrty_Nm=np.NaN)\n", " df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Authrty_Nm=np.NaN)\n", " df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Authrty_Ctry=np.NaN)\n", " df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Tp=np.NaN)\n", " df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Dt=np.NaN)\n", " df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Ctry=np.NaN)\n", " df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd=np.NaN)\n", " df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Amt_Ccy=np.NaN)\n", " df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Amt_text=np.NaN)\n", " return df\n", "\n", "# Random strings\n", "def generate_random_str():\n", " randstr0 = random.sample(string.ascii_lowercase,8)+random.sample(string.digits,4)+random.sample(string.ascii_lowercase,4)\n", " return ''.join(randstr0)\n", "\n", "def generate_reg_code_str():\n", " randstr0 = random.sample(string.ascii_uppercase,1)+random.sample(string.digits,5)\n", " return ''.join(randstr0)\n", " \n", "# Labeling for payment payments To India i.e. India Creditor Country\n", "# Success - 25% - RgltryRptg.DbtCdtRptgInd=CRED, RgltryRptg.Dtls.Cd with specified \n", "# purpose codes i.e ['00.P0006', '00.P0008', '13.P1301', '13.P1302']\n", "# CdtTrfTxInf.InstrForNxtAgt.InstrInf='None' i.e. NaN\n", "# Failure - 20% - RgltryRptg.DbtCdtRptgInd=CRED, RgltryRptg.Dtls.Cd with specified values\n", "# CdtTrfTxInf.InstrForNxtAgt.InstrInf=\n", "# Presence of /REG/\n", "# /REG/Any value or 00.P0006, 00.P0008, 13.P1301, 13.P1302 \n", "# Success - 25% - RgltryRptg.DbtCdtRptgInd=CRED, RgltryRptg.Dtls.Cd=00.00000\n", "# CdtTrfTxInf.InstrForNxtAgt.InstrInf=\n", "# /REG/15.X0001 FDI in Retail\n", "# /REG/15.X0002 FDI in Agriculture\n", "# /REG/15.X0003 FDI in Transportation\n", "# Failure - 25% - RgltryRptg.DbtCdtRptgInd=CRED, RgltryRptg.Dtls.Cd=00.00000\n", "# CdtTrfTxInf.InstrForNxtAgt.InstrInf=\n", "# Missing /REG/\n", "# /REG/\n", "# /REG/15.X000[2-100] FDI in Retail\n", "# /REG/15.X0002 FDI in Retail\n", "# /REG/15.X000[1|3|4-100] FDI in Agriculture\n", "# /REG/15.X000[[1-2]|4-100]] FDI in Transportation\n", "# /REG/15.X0003, 15.X0004, 16.XXXXX, 17.XXXXX \n", "# Failure - 5% - Missing Regulatory Reporting element\n", "# \n", "to_india_fractions = np.array([0.25, 0.20, 0.25, 0.25, 0.05])\n", "# Labeling for regulatory element\n", "to_india_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='IN']\n", "# Shuffle\n", "to_india_df = to_india_df.sample(frac=1)\n", "# Split into 5 dataframes \n", "to_india_success_df1, to_india_failure_df1, to_india_success_df2, to_india_failure_df2, to_india_no_reg_df = np.array_split(to_india_df, \n", " (to_india_fractions[:-1].cumsum() * len(to_india_df)).astype(int))\n", "\n", "print(f\"TO India_df after dataframe split, before processing row counts:\")\n", "print(f\"to_india_df dataframe has {to_india_df.shape[0]} rows\")\n", "print(f\"to_india_success_df1 dataframe has {to_india_success_df1.shape[0]} rows\")\n", "print(f\"to_india_failure_df1 dataframe has {to_india_failure_df1.shape[0]} rows\")\n", "print(f\"to_india_success_df2 dataframe has {to_india_success_df2.shape[0]} rows\")\n", "print(f\"to_india_failure_df2 dataframe has {to_india_failure_df2.shape[0]} rows\")\n", "print(f\"to_india_no_reg_df dataframe has {to_india_no_reg_df.shape[0]} rows\")\n", "\n", "#to_india_no_reg_df = to_india_df.sample(frack=0.05, random_state=299)\n", "\n", "print(f\"TO India after split, post processing row counts:\")\n", "# The ISO20022 message generator creates correct regulatory reporting structure\n", "# For CRED, with codes 00.P0006, 00.P0008, 13.P1301, 13.P1302\n", "# Part 1: Success with RgltryRptg.DbtCdtRptgInd=CRED\n", "to_india_success_df1 = to_india_success_df1.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)\n", "to_india_success_df1.insert(0, 'y_target', 'Success') \n", "print(f\"to_india_success_df1 dataframe has {to_india_success_df1.shape[0]} rows\")\n", "\n", "print(f\"Shape: {to_india_success_df1.shape}\")\n", "to_india_success_df1[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Part 2: Failure 1 with RgltryRptg.DbtCdtRptgInd=CRED, InstrForNxtAgt.InstrInf with /REG/ values\n", "next_agt_instructions_fail_list = ['/REG/', '/REG/00.P0006', '/REG/00.P0008', '/REG/13.P1301', '/REG/13.P1302']\n", "\n", "def gen_india_failure_1(df):\n", " rows = df.shape[0]\n", " print(f\"gen_india_failure_1() dataframe has {rows} rows\")\n", " for i in df.index:\n", " next_agt_instructions = random.choice(next_agt_instructions_fail_list)\n", " append = random.choice([True, False])\n", " if next_agt_instructions=='/REG/':\n", " next_agt_instructions = next_agt_instructions + generate_random_str() if append else next_agt_instructions\n", " df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions\n", "\n", "gen_india_failure_1(to_india_failure_df1)\n", "to_india_failure_df1.insert(0, 'y_target', 'Failure')\n", "print(f\"to_india_failure_df1 dataframe has {to_india_failure_df1.shape[0]} rows\")\n", "\n", "print(f\"Shape: {to_india_failure_df1.shape}\")\n", "to_india_failure_df1[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Part 3: Success with RgltryRptg.DbtCdtRptgInd=CRED, RgltryRptg.Dtls.Cd=00.00000\n", "# Requires InstrForNxtAgt.InstrInf with /REG/ values\n", "next_agt_instructions_succ_list = ['/REG/15.X0001 FDI in Retail', '/REG/15.X0002 FDI in Agriculture',\n", " '/REG/15.X0003 FDI in Transportation']\n", "def gen_india_success_2(df):\n", " rows = df.shape[0]\n", " print(f\"gen_india_success_2() has {rows} rows\")\n", " for i in df.index:\n", " next_agt_instructions = random.choice(next_agt_instructions_succ_list)\n", " df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions\n", " df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd']='00.00000'\n", "\n", "gen_india_success_2(to_india_success_df2)\n", "to_india_success_df2.insert(0, 'y_target', 'Success')\n", "print(f\"to_india_success_df2 dataframe has {to_india_success_df2.shape[0]} rows\")\n", "\n", "print(f\"Shape: {to_india_success_df2.shape}\")\n", "to_india_success_df2[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import exrex\n", "\n", "# Part 4: Failure with RgltryRptg.DbtCdtRptgInd=CRED, RgltryRptg.Dtls.Cd=00.00000\n", "# Missing InstrForNxtAgt.InstrInf with accepted /REG/ values\n", "next_agt_instructions_fail2_list = ['/REG/', '/REG/15.X000', '/REG/16.X']\n", "regs1 = list(exrex.generate('(\\/REG\\/15\\.X000[2-3]) FDI in Retail'))\n", "regs2 = list(exrex.generate('\\/REG\\/15\\.X000(?=1|3) FDI in Agriculture'))\n", "regs3 = list(exrex.generate('(\\/REG\\/15\\.X000[1-2]) FDI in Transportation'))\n", "regs4 = list(exrex.generate('(\\/REG\\/15\\.X000[4-9]) FDI in (Retail|Agriculture|Transportation)'))\n", "#regs5 = list(exrex.generate('(\\/REG\\/15\\.X000[1-9][0-9]) FDI in (Retail|Agriculture|Transportation)'))\n", "regs5 = list(exrex.generate('(\\/REG\\/15\\.X000\\d{2}) FDI in (Retail|Agriculture|Transportation)'))\n", "regs_fail_list = regs1 + regs2 + regs3 + regs4 + regs5\n", "next_agt_instructions_fail2_list = next_agt_instructions_fail2_list + regs_fail_list\n", "\n", "def gen_india_failure_2(df):\n", " rows = df.shape[0]\n", " print(f\"gen_india_failure_2() has {rows} rows\")\n", " for i in df.index:\n", " next_agt_instructions = random.choice(next_agt_instructions_fail2_list)\n", " makenull = random.choice([True, False])\n", " if next_agt_instructions=='/REG/':\n", " next_agt_instructions = '' if makenull else next_agt_instructions\n", " if next_agt_instructions=='/REG/15.X000':\n", " next_agt_instructions = next_agt_instructions + str(random.randint(4, 100))\n", " if next_agt_instructions=='/REG/16.X':\n", " next_agt_instructions = '/REG/' + str(random.randint(16, 100)) + '.' + generate_reg_code_str()\n", " df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions\n", " df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd']='00.00000'\n", " \n", " \n", "gen_india_failure_2(to_india_failure_df2)\n", "to_india_failure_df2.insert(0, 'y_target', 'Failure')\n", "print(f\"to_india_failure_df2 dataframe has {to_india_failure_df2.shape[0]} rows\")\n", "\n", "print(f\"Shape: {to_india_failure_df2.shape}\")\n", "to_india_failure_df2[['y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'\n", " ]].head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Part 5: Regulatory reporting is missing\n", "to_india_no_reg_df = make_regulatory_reporting_nan(to_india_no_reg_df)\n", "print(f\"to_india_no_reg_df dataframe has {to_india_no_reg_df.shape[0]} rows\")\n", "\n", "print(f\"Shape: {to_india_no_reg_df.shape}\")\n", "to_india_no_reg_df[[\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'\n", " ]\n", "]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Labeling for payments From India i.e. India Debtor Country\n", "# Only one rule: presence or absence of regulatory reporting.\n", "# Success - 50% - RgltryRptg.DbtCdtRptgInd=DEBT, RgltryRptg.Dtls.Cd with specified values\n", "# ISO20022 Message Generator includes these specific value, no action needed\n", "# CdtTrfTxInf.InstrForNxtAgt.InstrInf='None' i.e. NaN\n", "# Failure - 50% - RgltryRptg.DbtCdtRptgInd=DEBT, RgltryRptg is missing\n", "\n", "from_india_fractions = np.array([0.50, 0.50])\n", "from_india_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='IN']\n", "# Shuffle\n", "from_india_df = from_india_df.sample(frac=1)\n", "# Split into 2 dataframes with and without regulatory reporting elements\n", "from_india_with_reg_df, from_india_no_reg_df = np.array_split(\n", " from_india_df, \n", " (from_india_fractions[:-1].cumsum() * len(from_india_df)).astype(int))\n", "\n", "print(f\"FROM India df after split, before processing row counts:\")\n", "print(f\"from_india_df dataframe has {from_india_df.shape[0]} rows\")\n", "print(f\"from_india_with_reg_df dataframe has {from_india_with_reg_df.shape[0]} rows\")\n", "print(f\"from_india_no_reg_df dataframe has {from_india_no_reg_df.shape[0]} rows\")\n", "\n", "print(f\"From India df after split, post processing row counts:\")\n", "\n", "# From India with regulatory reporting to Success\n", "#from_india_with_reg_df = from_india_df.sample(frack=0.80, random_state=301)\n", "from_india_with_reg_df = from_india_with_reg_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)\n", "from_india_with_reg_df.insert(0, 'y_target', 'Success') \n", "print(f\"from_india_with_reg_df dataframe has {from_india_with_reg_df.shape[0]} rows\")\n", "\n", "# Remove regulatory element for India\n", "# from_india_no_reg_df = from_india_df.sample(frack=0.20, random_state=299)\n", "from_india_no_reg_df = make_regulatory_reporting_nan(from_india_no_reg_df)\n", "print(f\"from_india_no_reg_df dataframe has {from_india_no_reg_df.shape[0]} rows\")\n", "\n", "# final data frame with no regulatory reporting\n", "india_no_reg_frames = [to_india_no_reg_df, from_india_no_reg_df]\n", "india_no_reg_df = pd.concat(india_no_reg_frames)\n", "india_no_reg_df = india_no_reg_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)\n", "# add target to Failure \n", "india_no_reg_df.insert(0, 'y_target', 'Failure')\n", "print(f\"india_no_reg_df dataframe has {india_no_reg_df.shape[0]} rows\")\n", "\n", "print(f\"Shape: {india_no_reg_df.shape}\")\n", "india_no_reg_df[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'\n", " ]\n", "].head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#from_india_with_reg_df.head()\n", "from_india_with_reg_df[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'\n", " ]\n", "].head(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Assemble all the dataframes to produce final dataset for India\n", "final_india_frames = [to_india_success_df1, \n", " to_india_failure_df1,\n", " to_india_success_df2,\n", " to_india_failure_df2,\n", " india_no_reg_df,\n", " from_india_with_reg_df]\n", "\n", "final_india_df = pd.concat(final_india_frames)\n", "\n", "print(f\"Final labeled India dataframe shape: {final_india_df.shape}\")\n", "final_india_df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "success_df = final_india_df[final_india_df['y_target']=='Success']\n", "success_df.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "failure_df = final_india_df[final_india_df['y_target']=='Failure']\n", "print(failure_df.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Labeling for US" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Labeling for payment payments To US i.e. US Creditor Country\n", "# Success - 30% - CdtTrfTxInf_InstrForNxtAgt_InstrInf='None'\n", "# Success - 30% - CdtTrfTxInf_InstrForNxtAgt_InstrInf=\n", "# /SVC/It is to be delivered in one day. Two day penalty 2bp;three day penalty 3bp;greater than three days forfeit agent fee.\n", "# /SVC/It is to be delivered in two days. Three day penalty 2bp;greater than three days forfeit agent fee.\n", "# /SVC/It is to be delivered in three days. Greater than three days penalty add 2bp per day.\n", "# \n", "# Failure - 40% - CdtTrfTxInf.InstrForNxtAgt.InstrInf=\n", "# Presence of /SVC/ with nothing\n", "# /SVC/Anyother string\n", "# /SVC/It is to be delivered in four days. Greater than four days penalty add 2bp per day \n", "# \n", "to_us_fractions = np.array([0.30, 0.30, 0.40])\n", "# Labeling for CdtTrfTxInf_InstrForNxtAgt_InstrInf element\n", "to_us_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='US']\n", "# Shuffle\n", "to_us_df = to_us_df.sample(frac=1)\n", "# Split into 5 dataframes \n", "to_us_success_df1, to_us_success_df2, to_us_failure_df1 = np.array_split(to_us_df, \n", " (to_us_fractions[:-1].cumsum() * len(to_us_df)).astype(int))\n", "\n", "print(f\"TO US after dataframe split, before processing row counts:\")\n", "print(f\"to_us_df dataframe has {to_us_df.shape[0]} rows\")\n", "print(f\"to_us_success_df1 dataframe has {to_us_success_df1.shape[0]} rows\")\n", "print(f\"to_us_success_df2 dataframe has {to_us_success_df2.shape[0]} rows\")\n", "print(f\"to_us_failure_df1 dataframe has {to_us_failure_df1.shape[0]} rows\")\n", "\n", "print(f\"TO US after split, post processing row counts:\")\n", "\n", "# Part 1: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=NaN\n", "to_us_success_df1 = to_us_success_df1.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)\n", "to_us_success_df1.insert(0, 'y_target', 'Success') \n", "print(f\"to_us_success_df1 dataframe has {to_us_success_df1.shape[0]} rows\")\n", "\n", "print(f\"Shape: {to_us_success_df1.shape}\")\n", "to_us_success_df1[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Part 2: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=/SVC/ valid strings\n", "# \n", "next_agt_instructions_us_succ_list = [\n", " '/SVC/It is to be delivered in one day. Two day penalty 2bp;three day penalty 3bp;greater than three days forfeit agent fee.',\n", " '/SVC/It is to be delivered in two days. Three day penalty 2bp;greater than three days forfeit agent fee.',\n", " '/SVC/It is to be delivered in three days. Greater than three days penalty add 2bp per day.'\n", "]\n", "\n", "def gen_us_success_2(df):\n", " rows = df.shape[0]\n", " print(f\"gen_us_success_2() has {rows} rows\")\n", " for i in df.index:\n", " next_agt_instructions = random.choice(next_agt_instructions_us_succ_list)\n", " df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions\n", "\n", "gen_us_success_2(to_us_success_df2)\n", "to_us_success_df2.insert(0, 'y_target', 'Success') \n", "print(f\"to_us_success_df2 dataframe has {to_us_success_df2.shape[0]} rows\")\n", "\n", "print(f\"Shape: {to_us_success_df2.shape}\")\n", "to_us_success_df2[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Part 3: Failure with dtTrfTxInf_InstrForNxtAgt_InstrInf=/SVC/ with invalid strings\n", "# \n", "next_agt_instructions_us_failure_list = [\n", " '/SVC/'\n", "]\n", "\n", "gen_str1 = list(exrex.generate('\\/SVC\\/It is to be delivered in one day\\. Two day penalty (?=[3-9])bp;three day penalty (?=[1-2]|[4-9])bp;greater than three days forfeit agent fee\\.'))\n", "gen_str2 = list(exrex.generate('\\/SVC\\/It is to be delivered in two days\\. Three penalty (?=[1]|[3-9])bp;greater than three days forfeit agent fee\\.'))\n", "gen_str3 = list(exrex.generate('\\/SVC\\/It is to be delivered in three days\\. Greater than three days penalty add (?=1|[3-9])bp per day\\.'))\n", "gen_str4 = list(exrex.generate('\\/SVC\\/It is to be delivered in (two|three) day\\. Three day penalty 2bp;three day penalty 3bp;greater than three days forfeit agent fee\\.'))\n", "gen_str5 = list(exrex.generate('\\/SVC\\/It is to be delivered in (one|three) days\\. Three penalty 2bp;greater than three days forfeit agent fee\\.'))\n", "gen_str6 = list(exrex.generate('\\/SVC\\/It is to be delivered in (one|two) days\\. Greater than three days penalty add 2bp per day\\.'))\n", "gen_str7 = list(exrex.generate('\\/SVC\\/It is to be delivered in (four|five|six) days\\.'))\n", "gen_str = gen_str1 + gen_str2 + gen_str3 + gen_str4 + gen_str5 + gen_str6 + gen_str7\n", "print(len(gen_str))\n", "next_agt_instructions_us_failure_list = next_agt_instructions_us_failure_list + gen_str\n", "print(len(gen_str))\n", "\n", "def gen_us_failure_1(df):\n", " rows = df.shape[0]\n", " print(f\"gen_us_failure_1() has {rows} rows\")\n", " for i in df.index:\n", " next_agt_instructions = random.choice(next_agt_instructions_us_failure_list)\n", " #addstring = random.choice([True, False])\n", " #if next_agt_instructions=='/SVC/':\n", " # next_agt_instructions = next_agt_instructions + generate_random_str() if addstring else next_agt_instructions\n", " df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions\n", "\n", "gen_us_failure_1(to_us_failure_df1)\n", "to_us_failure_df1.insert(0, 'y_target', 'Failure') \n", "print(f\"to_us_success_df2 dataframe has {to_us_failure_df1.shape[0]} rows\")\n", "\n", "print(f\"Shape: {to_us_failure_df1.shape}\")\n", "to_us_failure_df1[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Labeling for payment payments From US i.e. US Debtor Country\n", "from_us_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='US']\n", "# Shuffle\n", "from_us_df = to_us_df.sample(frac=1)\n", "# No Splits\n", "\n", "print(f\"FROM US after dataframe before processing row counts:\")\n", "print(f\"from_us_df dataframe has {from_us_df.shape[0]} rows\")\n", "\n", "print(f\"FROM US post processing row counts:\")\n", "\n", "# Part 1: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=NaN\n", "from_us_df = from_us_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)\n", "from_us_df.insert(0, 'y_target', 'Success') \n", "print(f\"from_us_df dataframe has {to_us_success_df1.shape[0]} rows\")\n", "\n", "print(f\"Shape: {from_us_df.shape}\")\n", "from_us_df[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Assemble all the dataframes to produce final dataset for US\n", "final_us_frames = [to_us_success_df1, \n", " to_us_success_df2,\n", " to_us_failure_df1,\n", " from_us_df]\n", "\n", "final_us_df = pd.concat(final_us_frames)\n", "\n", "print(f\"Final labeled US dataframe shape: {final_us_df.shape}\")\n", "final_us_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Labeling for Great Britain" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Labeling for payment payments To GB i.e. GB Creditor Country\n", "# Success - 20% - CdtTrfTxInf_InstrForNxtAgt_InstrInf='None'\n", "# Success - 30% - CdtTrfTxInf_InstrForNxtAgt_InstrInf=\n", "# /ACC/No return possible – Creditor Account closing today \n", "# /SVC/It is delivered same business day. Non-delivery penalty 2bp per day.\n", "# /SVC/It is to be delivered in one day. Two day penalty 2bp;greater than two days penalty add 1bp per day.\n", "# \n", "# Failure - 50% - CdtTrfTxInf.InstrForNxtAgt.InstrInf=\n", "# Presence of /ACC/ with nothing\n", "# Presence of /SVC/ with nothing\n", "# /SVC/ Anyother string\n", "# /SVC/It is to be delivered in two days. Greater than two days penalty add 2bp per day.\n", "# /SVC/It is to be delivered in three days. Greater than three days penalty add 2bp per day. \n", "# \n", "to_gb_fractions = np.array([0.20, 0.30, 0.50])\n", "# Labeling for CdtTrfTxInf_InstrForNxtAgt_InstrInf element\n", "to_gb_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='GB']\n", "# Shuffle\n", "to_gb_df = to_gb_df.sample(frac=1)\n", "# Split into 3 dataframes \n", "to_gb_success_df1, to_gb_success_df2, to_gb_failure_df1 = np.array_split(to_gb_df, \n", " (to_gb_fractions[:-1].cumsum() * len(to_gb_df)).astype(int))\n", "\n", "print(f\"TO GB after dataframe split, before processing row counts:\")\n", "print(f\"to_gb_df dataframe has {to_gb_df.shape[0]} rows\")\n", "print(f\"to_gb_success_df1 dataframe has {to_gb_success_df1.shape[0]} rows\")\n", "print(f\"to_gb_success_df2 dataframe has {to_gb_success_df2.shape[0]} rows\")\n", "print(f\"to_gb_failure_df1 dataframe has {to_gb_failure_df1.shape[0]} rows\")\n", "\n", "print(f\"TO GB after split, post processing row counts:\")\n", "\n", "# Part 1: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=NaN\n", "to_gb_success_df1 = to_gb_success_df1.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)\n", "to_gb_success_df1.insert(0, 'y_target', 'Success') \n", "print(f\"to_gb_success_df1 dataframe has {to_gb_success_df1.shape[0]} rows\")\n", "\n", "print(f\"Shape: {to_gb_success_df1.shape}\")\n", "to_gb_success_df1[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Part 2: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=/SVC/ or /ACC/ valid strings\n", "# \n", "next_agt_instructions_gb_succ_list = [\n", " '/ACC/No return possible – Creditor Account closing today', \n", " '/SVC/It is delivered same business day. Non-delivery penalty 2bp per day.',\n", " '/SVC/It is to be delivered in one day. Two day penalty 2bp;greater than two days penalty add 1bp per day.'\n", "]\n", "\n", "def gen_gb_success_2(df):\n", " rows = df.shape[0]\n", " print(f\"gen_gb_success_2() has {rows} rows\")\n", " for i in df.index:\n", " next_agt_instructions = random.choice(next_agt_instructions_gb_succ_list)\n", " df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions\n", "\n", "gen_gb_success_2(to_gb_success_df2)\n", "to_gb_success_df2.insert(0, 'y_target', 'Success') \n", "print(f\"to_gb_success_df2 dataframe has {to_gb_success_df2.shape[0]} rows\")\n", "\n", "print(f\"Shape: {to_gb_success_df2.shape}\")\n", "to_gb_success_df2[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Part 3: Failure with dtTrfTxInf_InstrForNxtAgt_InstrInf=/SVC/ or /ACC/ with invalid strings\n", "# Presence of /ACC/ with nothing\n", "# Presence of /SVC/ with nothing\n", "# /ACC/ Anyother string\n", "# /SVC/ Anyother string\n", "# /SVC/It is to be delivered in two days. Greater than two days penalty add 2bp per day \n", "# /SVC/It is to be delivered in three days. Greater than three days penalty add 2bp per day \n", "next_agt_instructions_gb_failure_list = [\n", " '/ACC/',\n", " '/SVC/'\n", "]\n", "\n", "gen_str_a1 = list(exrex.generate('\\/ACC\\/return possible – Creditor Account closing (today|tomorrow|this week|next week)'))\n", "gen_str_a2 = list(exrex.generate('\\/ACC\\/Yes return possible – Creditor Account closing (today|tomorrow|this week|next week)'))\n", "\n", "gen_str1 = list(exrex.generate('\\/SVC\\/It is delivered same business day\\. Non-delivery penalty (?=[1]|[3-9])bp per day\\.'))\n", "gen_str2 = list(exrex.generate('\\/SVC\\/It is to be delivered in one business day\\. Two day penalty (?=[1]|[3-9])bp;greater than two days penalty add (?=[2-9])bp per day\\.'))\n", "gen_str3 = list(exrex.generate('\\/SVC\\/It is to be delivered in (one|two|three|four) business day\\. Non-delivery penalty 2bp per day\\.'))\n", "gen_str4 = list(exrex.generate('\\/SVC\\/It is to be delivered in (same|two|three|four|five) business day\\. Two day penalty 2bp;greater than two days penalty add 1bp per day\\.'))\n", "\n", "gen_str = gen_str1 + gen_str2 + gen_str3 + gen_str4 + gen_str_a1 + gen_str_a2\n", "\n", "next_agt_instructions_gb_failure_list = next_agt_instructions_gb_failure_list + gen_str\n", "\n", "def gen_gb_failure_1(df):\n", " rows = df.shape[0]\n", " print(f\"gen_gb_failure_1() has {rows} rows\")\n", " for i in df.index:\n", " next_agt_instructions = random.choice(next_agt_instructions_gb_failure_list)\n", " # addstring = random.choice([True, False])\n", " # if next_agt_instructions=='/ACC/':\n", " # next_agt_instructions = next_agt_instructions + generate_random_str() if addstring else next_agt_instructions\n", " # if next_agt_instructions=='/SVC/':\n", " # next_agt_instructions = next_agt_instructions + generate_random_str() if addstring else next_agt_instructions\n", " df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions\n", "\n", "gen_gb_failure_1(to_gb_failure_df1)\n", "to_gb_failure_df1.insert(0, 'y_target', 'Failure') \n", "print(f\"to_gb_success_df2 dataframe has {to_gb_failure_df1.shape[0]} rows\")\n", "\n", "print(f\"Shape: {to_gb_failure_df1.shape}\")\n", "to_gb_failure_df1[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Labeling for payment payments From GB i.e. GB Debtor Country\n", "from_gb_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='GB']\n", "# Shuffle\n", "from_gb_df = to_gb_df.sample(frac=1)\n", "# No Splits\n", "\n", "print(f\"FROM GB after dataframe before processing row counts:\")\n", "print(f\"from_gb_df dataframe has {from_us_df.shape[0]} rows\")\n", "\n", "print(f\"FROM GB post processing row counts:\")\n", "\n", "# Part 1: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=NaN\n", "from_gb_df = from_gb_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)\n", "from_gb_df.insert(0, 'y_target', 'Success') \n", "print(f\"from_gb_df dataframe has {to_gb_success_df1.shape[0]} rows\")\n", "\n", "print(f\"Shape: {from_gb_df.shape}\")\n", "from_gb_df[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Assemble all the dataframes to produce final dataset for GB\n", "final_gb_frames = [to_gb_success_df1, \n", " to_gb_success_df2,\n", " to_gb_failure_df1,\n", " from_gb_df]\n", "\n", "final_gb_df = pd.concat(final_gb_frames)\n", "\n", "print(f\"Final labeled GB dataframe shape: {final_gb_df.shape}\")\n", "final_gb_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Labeling for Ireland" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Labeling for payment payments To Ireland i.e. IE Creditor Country\n", "# Success - 25% - CdtTrfTxInf_InstrForNxtAgt_InstrInf='None'\n", "# Success - 25% - CdtTrfTxInf_InstrForNxtAgt_InstrInf=\n", "# /SVC/It is to be delivered in one day. Two day penalty 2bp, three days penalty 5bp.\n", "# /TRSY/Treasury Services Platinum Customer\n", "# /TRSY/Treasury Services Gold Customer\n", "# /TRSY/Treasury Services Silver Customer\n", "# \n", "# Failure - 50% - CdtTrfTxInf.InstrForNxtAgt.InstrInf=\n", "# Presence of /SVC/ with nothing\n", "# /SVC/ Anyother string \n", "# /TRSY/ Anyother string \n", "# \n", "to_ie_fractions = np.array([0.25, 0.25, 0.50])\n", "# Labeling for CdtTrfTxInf_InstrForNxtAgt_InstrInf element\n", "to_ie_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='IE']\n", "# Shuffle\n", "to_ie_df = to_ie_df.sample(frac=1)\n", "# Split into 3 dataframes \n", "to_ie_success_df1, to_ie_success_df2, to_ie_failure_df1 = np.array_split(to_ie_df, \n", " (to_ie_fractions[:-1].cumsum() * len(to_ie_df)).astype(int))\n", "\n", "print(f\"TO IE after dataframe split, before processing row counts:\")\n", "print(f\"to_ie_df dataframe has {to_gb_df.shape[0]} rows\")\n", "print(f\"to_ie_success_df1 dataframe has {to_ie_success_df1.shape[0]} rows\")\n", "print(f\"to_ie_success_df2 dataframe has {to_ie_success_df2.shape[0]} rows\")\n", "print(f\"to_ie_failure_df1 dataframe has {to_ie_failure_df1.shape[0]} rows\")\n", "\n", "print(f\"TO IE after split, post processing row counts:\")\n", "\n", "# Part 1: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=NaN\n", "to_ie_success_df1 = to_ie_success_df1.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)\n", "to_ie_success_df1.insert(0, 'y_target', 'Success') \n", "print(f\"to_ie_success_df1 dataframe has {to_ie_success_df1.shape[0]} rows\")\n", "\n", "print(f\"Shape: {to_ie_success_df1.shape}\")\n", "to_ie_success_df1[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Part 2: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=/SVC/ valid strings\n", "# \n", "next_agt_instructions_ie_succ_list = [\n", " '/SVC/It is to be delivered in one day. Two day penalty 2bp;three days penalty 5bp.',\n", " '/TRSY/Treasury Services Platinum Customer',\n", " '/TRSY/Treasury Services Gold Customer',\n", " '/TRSY/Treasury Services Silver Customer'\n", "]\n", "\n", "def gen_ie_success_2(df):\n", " rows = df.shape[0]\n", " print(f\"gen_ie_success_2() has {rows} rows\")\n", " for i in df.index:\n", " next_agt_instructions = random.choice(next_agt_instructions_ie_succ_list)\n", " df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions\n", "\n", "gen_ie_success_2(to_ie_success_df2)\n", "to_ie_success_df2.insert(0, 'y_target', 'Success') \n", "print(f\"to_ie_success_df2 dataframe has {to_ie_success_df2.shape[0]} rows\")\n", "\n", "print(f\"Shape: {to_ie_success_df2.shape}\")\n", "to_ie_success_df2[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Part 3: Failure with dtTrfTxInf_InstrForNxtAgt_InstrInf=/SVC/ with invalid strings\n", "# Presence of /SVC/ with nothing\n", "# /SVC/ Anyother string \n", "# Presence of /TRSY/ with nothing\n", "# /TRSY/ Anyother string \n", "next_agt_instructions_ie_failure_list = [\n", " '/SVC/',\n", " '/TRSY/'\n", "]\n", "\n", "gen_str1 = list(exrex.generate('\\/SVC\\/It is to be delivered in one day\\. Two day penalty (?=1|[3-9])bp;three days penalty (?=[1-4]|[6-9])bp\\.'))\n", "gen_str2 = list(exrex.generate('\\/TRSY\\/Treasury Services (Bronze|Copper|Iron|Glass|Earth|Water|Fire|Unimportant|Low|Important|Corporate|Small Business|Large Business|Business) Customer'))\n", "next_agt_instructions_ie_failure_list = next_agt_instructions_ie_failure_list + gen_str1 + gen_str2\n", "\n", "def gen_ie_failure_1(df):\n", " rows = df.shape[0]\n", " print(f\"gen_ie_failure_1() has {rows} rows\")\n", " for i in df.index:\n", " next_agt_instructions = random.choice(next_agt_instructions_ie_failure_list)\n", " #addstring = random.choice([True, False])\n", " #next_agt_instructions = next_agt_instructions + generate_random_str() if addstring else next_agt_instructions\n", " df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions\n", "\n", "gen_ie_failure_1(to_ie_failure_df1)\n", "to_ie_failure_df1.insert(0, 'y_target', 'Failure') \n", "print(f\"to_ie_success_df2 dataframe has {to_ie_failure_df1.shape[0]} rows\")\n", "\n", "print(f\"Shape: {to_ie_failure_df1.shape}\")\n", "to_ie_failure_df1[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Labeling for payment payments From Ireland i.e. IE Debtor Country\n", "from_ie_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='IE']\n", "# Shuffle\n", "from_ie_df = to_ie_df.sample(frac=1)\n", "# No Splits\n", "\n", "print(f\"FROM IE after dataframe before processing row counts:\")\n", "print(f\"from_ie_df dataframe has {from_us_df.shape[0]} rows\")\n", "\n", "print(f\"FROM IE post processing row counts:\")\n", "\n", "# Part 1: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=NaN\n", "from_ie_df = from_ie_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)\n", "from_ie_df.insert(0, 'y_target', 'Success') \n", "print(f\"from_ie_df dataframe has {to_ie_success_df1.shape[0]} rows\")\n", "\n", "print(f\"Shape: {from_ie_df.shape}\")\n", "from_ie_df[[\n", " 'y_target',\n", " 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'\n", "]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Assemble all the dataframes to produce final dataset for Ireland\n", "final_ie_frames = [to_ie_success_df1, \n", " to_ie_success_df2,\n", " to_ie_failure_df1,\n", " from_ie_df]\n", "\n", "final_ie_df = pd.concat(final_ie_frames)\n", "\n", "print(f\"Final labeled Ireland dataframe shape: {final_ie_df.shape}\")\n", "final_ie_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Labeling for Canada, Mexico, Thailand\n", "To Canada, Mexico, Thailand from all except India:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# To Canada, Mexico, Thailand from all, except India\n", "\n", "final_to_ca_mx_th_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry'].isin(['CA', 'MX', 'TH'])) & \n", " (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']!='IN')]\n", "\n", "print(f\"To Canada, Mexico, Thailand, from all except India dataframe shape: {final_to_ca_mx_th_df.shape}\")\n", "\n", "final_to_ca_mx_th_df = final_to_ca_mx_th_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)\n", "final_to_ca_mx_th_df.insert(0, 'y_target', 'Success') \n", "\n", "final_to_ca_mx_th_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To Canada, Mexico, Thailand from India:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# To Canada, Mexico, Thailand from all, from India\n", "\n", "to_ca_mx_th_fr_in_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry'].isin(['CA', 'MX', 'TH'])) & \n", " (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='IN')]\n", "\n", "print(f\"To Canada, Mexico, Thailand from India dataframe shape: {to_ca_mx_th_fr_in_df.shape}\")\n", "\n", "to_fractions = np.array([0.50, 0.50])\n", "# Labeling for CdtTrfTxInf_InstrForNxtAgt_InstrInf element\n", "# Split into 2 dataframes \n", "to_ca_mx_th_fr_in_succ_df, to_ca_mx_th_fr_in_failure_df = np.array_split(to_ca_mx_th_fr_in_df, \n", " (to_fractions[:-1].cumsum() * len(to_ca_mx_th_fr_in_df)).astype(int))\n", "\n", "to_ca_mx_th_fr_in_succ_df = to_ca_mx_th_fr_in_succ_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)\n", "to_ca_mx_th_fr_in_succ_df.insert(0, 'y_target', 'Success') \n", "\n", "to_ca_mx_th_fr_in_failure_df = make_regulatory_reporting_nan(to_ca_mx_th_fr_in_failure_df)\n", "to_ca_mx_th_fr_in_failure_df = to_ca_mx_th_fr_in_failure_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)\n", "to_ca_mx_th_fr_in_failure_df.insert(0, 'y_target', 'Failure') \n", "\n", "frames = [to_ca_mx_th_fr_in_succ_df, to_ca_mx_th_fr_in_failure_df]\n", "final_to_ca_mx_th_fr_in_df = pd.concat(frames)\n", "\n", "print(f\"final_to_ca_mx_th_fr_in_df shape: {final_to_ca_mx_th_fr_in_df.shape}\")\n", "\n", "final_to_ca_mx_th_fr_in_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Final Assembly of Labeled Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "final_frames = [final_us_df, final_india_df, final_gb_df, final_ie_df, final_to_ca_mx_th_df, final_to_ca_mx_th_fr_in_df]\n", "\n", "final_df = pd.concat(final_frames)\n", "\n", "final_df = final_df.sample(frac=1)\n", "\n", "labeled_raw_data_output_path = 'synthetic-data/labeled_data.csv'\n", "print(f'Saving labeled synthetic raw data with headers to {labeled_raw_data_output_path}')\n", "final_df.to_csv(labeled_raw_data_output_path, index=False)\n", "\n", "print(f\"Final labeled dataframe shape: {final_df.shape}\")\n", "final_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Upload to S3 Bucket" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labeled_data_location = sm_session.upload_data(\n", " path=labeled_raw_data_output_path,\n", " bucket=s3_bucket_name,\n", " key_prefix=labeled_data_prefix,\n", ")\n", "print(f\"Labeled Data Location: {labeled_data_location}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "instance_type": "ml.g4dn.xlarge", "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 4 }