# ISO 20022 PACS.008 Synthetic Data Generation Notebook

This notebook demonstrates generation of synthetic raw data and synthetic raw labeled data for ISO20022 pacs.008 serial payment message. The raw labeled data is used to generate training dataset for training a ML model for predicting whether a pacs.008 Cross-Border payment compliant (per CBPR+ specification) XML message will be accepted or rejected by a FI receiving the pacs.008 message. The raw datasets are uploaded to an Amazon S3 bucket that you specify in the notebook. The process of generating raw synthetic data is as follows:

1. Generate pacs.008 XML messages using `rapide`, an open source ISO20022 message generator available at https://github.com/aws-samples/iso20022-message-generator.
1. Upload the zip file to `data` directory in the same location as this notebook. If `data` directory doesn't exist then create it.
1. Modify the location of zip file(s) in the notebook.


**Dataset in this notebook:** For this sampe notebook, `rapide` was used to generate pacs.008 messages using BIC and LEI datasets generated using the accompanying Python notebook [iso20022_lei_bic_dataset.ipynb](./iso20022_lei_bic_datasets.ipynb). This notebook generates:
* Fake BIC codes for following countries:
 * CA - Canada
 * GB - Greate Britain
 * IE - Ireland
 * IN - India
 * MX - Mexico
 * TH - Thailand
 * US - United States
* A subset of LEI entities as described in the notebook

This sample notebook demonstrates generation of raw and labeled raw datasets, it uses a tar gzipped files containing pacs.008 XML messages generated by rapide:
* `iso20022-data/iso20022-raw-messages.tar.gz`: Contains thousands of pacs.008 XML messages 


## Environment Setup

### Install packages

Before running the code, please make sure you have this libraries and that the pandas version is the right one.

In [None]:
!pip install pandas

In [None]:
!pip install xmltodict

String generators based on regex:

In [None]:
!pip install rstr
!pip install exrex

### Authentication and Authorization

In [None]:
import os
import boto3
import sagemaker
from sagemaker import get_execution_role

sm_client = boto3.Session().client('sagemaker')
sm_session = sagemaker.Session()
region = boto3.session.Session().region_name

role = get_execution_role()
print ("Notebook is running with assumed role {}".format (role))
print("Working with AWS services in the {} region".format(region))

## Provide S3 bucket name
Create a S3 bucket where synthentic raw and labeled data will be stored and provide the bucket name here.

In [None]:
# Working directory for the notebook
WORKDIR = os.getcwd()
BASENAME = os.path.dirname(WORKDIR)

# Create a directory for storing generated synthetic data
synthetic_data_local_path = 'synthetic-data'
if not os.path.exists(synthetic_data_local_path):
 os.makedirs(iso20022_data_path)

# Store all prototype assets in this bucket
# Update this variable with the S3 bucket name in the region where you are running this notebook in your account
# Note that S3 bucket names are unique in the region
s3_bucket_name = 'iso20022-prototype-t3'
s3_bucket_uri = 's3://' + s3_bucket_name

# Prefix for all files in this prototype
prefix = 'iso20022'

pacs008_prefix = prefix + '/pacs008'
raw_pacs008_messages_prefix = pacs008_prefix + '/raw-messages'
raw_data_prefix = pacs008_prefix + '/raw-data'
labeled_data_prefix = pacs008_prefix + '/labeled-data'

raw_data_location = s3_bucket_uri + '/' + raw_data_prefix
labeled_data_location = s3_bucket_uri + '/' + labeled_data_prefix

print("Downloading raw pacs008 messages from {}".format (s3_bucket_uri + '/' + raw_pacs008_messages_prefix))
print(f"Raw synthetic data will be uploaded to {raw_data_location}")
print(f"Labeled raw synthetic data will be uploaded to {labeled_data_location}")

## Read Raw XML Messages

In [None]:
import xmltodict
from collections import OrderedDict
import re
import zipfile
import tarfile
import glob
import random
import string

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.exceptions import DataConversionWarning
from sklearn.compose import make_column_transformer

# Flatten .xml files and turn them into a Dictionary
def flatten_dict(d):
 def items():
 for key, value in d.items():
 if isinstance(value, dict):
 for subkey, subvalue in flatten_dict(value).items():
 yield key + "." + subkey, subvalue
 else:
 yield key, value

 return OrderedDict(items()) 

# Parsing dictionary and transforming it into a data frame.
# Each record is a data frame
def parse_dict_df(file):
 with open(file) as fi:
 xml=xmltodict.parse(fi.read())
 flat= flatten_dict(xml)
 data = pd.DataFrame(flat, columns=flat.keys(), index=[0])
 return data
 
# tar/gzipped archive
raw_data_targz_path = 'iso20022-data/iso20022-raw-messages.tar.gz'
extract_to_dir = './iso20022-raw-messages'
print(f"Extracting files from {raw_data_targz_path} to {extract_to_dir}")
with tarfile.open(raw_data_targz_path, 'r:gz') as gzip_ref:
 gzip_ref.extractall(extract_to_dir)


## Create Raw Dataset From PACS 008 XML Messages

In [None]:
local_path ='./iso20022-raw-messages/messages/'
print('Reading raw messages input data from local path {}'.format(local_path))
files = glob.glob(os.path.join(local_path, "*.xml"))
print(f"No. of pacs008 XML message files: {len(files)}")

df = pd.DataFrame()
print("Reading XML messages and creating data frame...")
for f in files:
 df = df.append(parse_dict_df(f),ignore_index=True)

df = df.sample(frac=1)
print(f'Data frame shape after read raw pacs008 XML messages: {df.shape}')

In [None]:
df.head(5)

### Transform Column Names

In [None]:
# Trim column names
for col in list(df.columns):
 new_col = col.replace('RequestPayload.h:','').replace('RequestPayload.Doc:','').replace('.h:','.').replace('.Doc:','.')
 new_col = new_col.replace('.', '_').replace(':', '_').replace('#','').replace('@','')
 df = pd.DataFrame(df.rename(columns={col:new_col}))
print(f"Columns: {df.columns}")

# Change data types.
# All features have 'object' data type. Features should be changed to their respective data type.
date_fts=['AppHdr_CreDt', 'Document_FIToFICstmrCdtTrf_GrpHdr_CreDtTm', 
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_IntrBkSttlmDt',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Dt'
 ]

numeric_fts=['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_IntrBkSttlmAmt_text', 
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstdAmt_text', 
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Amt_text',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_ChrgsInf_Amt_text',
 'Document_FIToFICstmrCdtTrf_GrpHdr_NbOfTxs']

for col in date_fts:
 df=df.astype({col:'datetime64'})

for col in numeric_fts:
 df=df.astype({col:'float64'})

raw_data_output_path = 'synthetic-data/raw_data.csv'
print(f'Saving synthetic raw data with headers to {raw_data_output_path}')
df.to_csv(raw_data_output_path, index=False)
print("Complete synthetic raw dataset preparation.")

print(f"Shape: {df.shape}")
df.head()

### Upload Raw Dataset to S3

In [None]:
raw_data_upload_location = sm_session.upload_data(
 path="synthetic-data/raw_data.csv",
 bucket=s3_bucket_name,
 key_prefix=raw_data_prefix
)

print(f"Uploaded raw data to: {raw_data_upload_location}")

**Experiments with data sizes**

In [None]:

def get_country_code_from_bic(bic):
 return bic[4:6]

# Get Country Code from BIC
print(get_country_code_from_bic('SBININBB'))

print(f"Starting rows: {df.shape[0]}")

#us_count2 = df.loc[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='US') &
# (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='US')].shape[0]
#print(f"Total US2 rows: {us_count2}")

us_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='US') |
 (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='US')]
us_count = us_df.shape[0]
print(f"Total US rows: {us_count}")

us_duplicated_df = us_df[us_df.duplicated()]
print(f"Duplicated US df shape {us_duplicated_df.shape}")
removed_us_dups_df = us_df.drop_duplicates()
print(f"Removed Duplicated US df shape {removed_us_dups_df.shape}")

ca_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='CA') |
 (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='CA')]
ca_count = ca_df.shape[0]
print(f"Total Canada rows: {ca_count}")

in_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='IN') |
 (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='IN')]
in_count = in_df.shape[0]
print(f"Total India rows: {in_count}")

gb_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='GB') |
 (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='GB')]
gb_count = gb_df.shape[0]
print(f"Total GB rows: {gb_count}")

ie_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='IE') |
 (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='IE')]
ie_count = ie_df.shape[0]
print(f"Total Ireland rows: {ie_count}")

mx_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='MX') |
 (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='MX')]
mx_count = mx_df.shape[0]
print(f"Total Mexico rows: {mx_count}")

th_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='TH') |
 (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='TH')]
th_count = th_df.shape[0]
print(f"Total Thailand rows: {th_count}")

print(f"Total rows: {in_count+us_count+ca_count+gb_count+ie_count+mx_count+th_count}")

test_frames = [us_df, ca_df, in_df, gb_df, ie_df, mx_df, th_df]
test_final_df = pd.concat(test_frames)
test_f_count = test_final_df.shape[0]
print(f"Total Test Final DF rows: {test_f_count}")

print("Duplicates in test_final_df =>")
print(test_final_df.duplicated())

duplicated_df = test_final_df[test_final_df.duplicated()]
print(f"Duplicated df shape {duplicated_df.shape}")
removed_dups_df = test_final_df.drop_duplicates()
print(f"Removed Duplicated df shape {removed_dups_df.shape}")

print("Unique from countries: ")
print(pd.unique(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']))
print("Unique to countries: ")
print(pd.unique(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']))

## Dataset Labeling 
In next several steps we label messages for each country in the raw dataset generated by ISO20022 Message Generator, rapide. The countries we used were: CA, GB, IE, IN, MX, TH, US.

The rules for labeling are described in the code.


### Labeling for India

In [None]:
# Label raw data to create labeled raw data set

# Labeling for India as Creditor Country
to_india_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='IN']
to_india_rows = to_india_df.shape[0]
print(f"No. of To India rows: {to_india_rows}")
to_india_df

In [None]:
# Labeling for India as Debtor Country
from_india_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='IN']
from_india_rows = from_india_df.shape[0]
print(f"No. of From India rows: {from_india_rows}")
from_india_df

In [None]:
import string

# Make regulatory reporting missing i.e. NaN
def make_regulatory_reporting_nan(df):
 df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_DbtCdtRptgInd=np.NaN)
 df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Authrty_Nm=np.NaN)
 df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Authrty_Nm=np.NaN)
 df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Authrty_Ctry=np.NaN)
 df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Tp=np.NaN)
 df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Dt=np.NaN)
 df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Ctry=np.NaN)
 df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd=np.NaN)
 df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Amt_Ccy=np.NaN)
 df = df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Amt_text=np.NaN)
 return df

# Random strings
def generate_random_str():
 randstr0 = random.sample(string.ascii_lowercase,8)+random.sample(string.digits,4)+random.sample(string.ascii_lowercase,4)
 return ''.join(randstr0)

def generate_reg_code_str():
 randstr0 = random.sample(string.ascii_uppercase,1)+random.sample(string.digits,5)
 return ''.join(randstr0)
 
# Labeling for payment payments To India i.e. India Creditor Country
# Success - 25% - RgltryRptg.DbtCdtRptgInd=CRED, RgltryRptg.Dtls.Cd with specified 
# purpose codes i.e ['00.P0006', '00.P0008', '13.P1301', '13.P1302']
# CdtTrfTxInf.InstrForNxtAgt.InstrInf='None' i.e. NaN
# Failure - 20% - RgltryRptg.DbtCdtRptgInd=CRED, RgltryRptg.Dtls.Cd with specified values
# CdtTrfTxInf.InstrForNxtAgt.InstrInf=
# Presence of /REG/
# /REG/Any value or 00.P0006, 00.P0008, 13.P1301, 13.P1302 
# Success - 25% - RgltryRptg.DbtCdtRptgInd=CRED, RgltryRptg.Dtls.Cd=00.00000
# CdtTrfTxInf.InstrForNxtAgt.InstrInf=
# /REG/15.X0001 FDI in Retail
# /REG/15.X0002 FDI in Agriculture
# /REG/15.X0003 FDI in Transportation
# Failure - 25% - RgltryRptg.DbtCdtRptgInd=CRED, RgltryRptg.Dtls.Cd=00.00000
# CdtTrfTxInf.InstrForNxtAgt.InstrInf=
# Missing /REG/
# /REG/
# /REG/15.X000[2-100] FDI in Retail
# /REG/15.X0002 FDI in Retail
# /REG/15.X000[1|3|4-100] FDI in Agriculture
# /REG/15.X000[[1-2]|4-100]] FDI in Transportation
# /REG/15.X0003, 15.X0004, 16.XXXXX, 17.XXXXX 
# Failure - 5% - Missing Regulatory Reporting element
# 
to_india_fractions = np.array([0.25, 0.20, 0.25, 0.25, 0.05])
# Labeling for regulatory element
to_india_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='IN']
# Shuffle
to_india_df = to_india_df.sample(frac=1)
# Split into 5 dataframes 
to_india_success_df1, to_india_failure_df1, to_india_success_df2, to_india_failure_df2, to_india_no_reg_df = np.array_split(to_india_df, 
 (to_india_fractions[:-1].cumsum() * len(to_india_df)).astype(int))

print(f"TO India_df after dataframe split, before processing row counts:")
print(f"to_india_df dataframe has {to_india_df.shape[0]} rows")
print(f"to_india_success_df1 dataframe has {to_india_success_df1.shape[0]} rows")
print(f"to_india_failure_df1 dataframe has {to_india_failure_df1.shape[0]} rows")
print(f"to_india_success_df2 dataframe has {to_india_success_df2.shape[0]} rows")
print(f"to_india_failure_df2 dataframe has {to_india_failure_df2.shape[0]} rows")
print(f"to_india_no_reg_df dataframe has {to_india_no_reg_df.shape[0]} rows")

#to_india_no_reg_df = to_india_df.sample(frack=0.05, random_state=299)

print(f"TO India after split, post processing row counts:")
# The ISO20022 message generator creates correct regulatory reporting structure
# For CRED, with codes 00.P0006, 00.P0008, 13.P1301, 13.P1302
# Part 1: Success with RgltryRptg.DbtCdtRptgInd=CRED
to_india_success_df1 = to_india_success_df1.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)
to_india_success_df1.insert(0, 'y_target', 'Success') 
print(f"to_india_success_df1 dataframe has {to_india_success_df1.shape[0]} rows")

print(f"Shape: {to_india_success_df1.shape}")
to_india_success_df1[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'
]]

In [None]:
# Part 2: Failure 1 with RgltryRptg.DbtCdtRptgInd=CRED, InstrForNxtAgt.InstrInf with /REG/ values
next_agt_instructions_fail_list = ['/REG/', '/REG/00.P0006', '/REG/00.P0008', '/REG/13.P1301', '/REG/13.P1302']

def gen_india_failure_1(df):
 rows = df.shape[0]
 print(f"gen_india_failure_1() dataframe has {rows} rows")
 for i in df.index:
 next_agt_instructions = random.choice(next_agt_instructions_fail_list)
 append = random.choice([True, False])
 if next_agt_instructions=='/REG/':
 next_agt_instructions = next_agt_instructions + generate_random_str() if append else next_agt_instructions
 df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions

gen_india_failure_1(to_india_failure_df1)
to_india_failure_df1.insert(0, 'y_target', 'Failure')
print(f"to_india_failure_df1 dataframe has {to_india_failure_df1.shape[0]} rows")

print(f"Shape: {to_india_failure_df1.shape}")
to_india_failure_df1[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'
]]

In [None]:
# Part 3: Success with RgltryRptg.DbtCdtRptgInd=CRED, RgltryRptg.Dtls.Cd=00.00000
# Requires InstrForNxtAgt.InstrInf with /REG/ values
next_agt_instructions_succ_list = ['/REG/15.X0001 FDI in Retail', '/REG/15.X0002 FDI in Agriculture',
 '/REG/15.X0003 FDI in Transportation']
def gen_india_success_2(df):
 rows = df.shape[0]
 print(f"gen_india_success_2() has {rows} rows")
 for i in df.index:
 next_agt_instructions = random.choice(next_agt_instructions_succ_list)
 df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions
 df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd']='00.00000'

gen_india_success_2(to_india_success_df2)
to_india_success_df2.insert(0, 'y_target', 'Success')
print(f"to_india_success_df2 dataframe has {to_india_success_df2.shape[0]} rows")

print(f"Shape: {to_india_success_df2.shape}")
to_india_success_df2[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'
]]

In [None]:
import exrex

# Part 4: Failure with RgltryRptg.DbtCdtRptgInd=CRED, RgltryRptg.Dtls.Cd=00.00000
# Missing InstrForNxtAgt.InstrInf with accepted /REG/ values
next_agt_instructions_fail2_list = ['/REG/', '/REG/15.X000', '/REG/16.X']
regs1 = list(exrex.generate('(\/REG\/15\.X000[2-3]) FDI in Retail'))
regs2 = list(exrex.generate('\/REG\/15\.X000(?=1|3) FDI in Agriculture'))
regs3 = list(exrex.generate('(\/REG\/15\.X000[1-2]) FDI in Transportation'))
regs4 = list(exrex.generate('(\/REG\/15\.X000[4-9]) FDI in (Retail|Agriculture|Transportation)'))
#regs5 = list(exrex.generate('(\/REG\/15\.X000[1-9][0-9]) FDI in (Retail|Agriculture|Transportation)'))
regs5 = list(exrex.generate('(\/REG\/15\.X000\d{2}) FDI in (Retail|Agriculture|Transportation)'))
regs_fail_list = regs1 + regs2 + regs3 + regs4 + regs5
next_agt_instructions_fail2_list = next_agt_instructions_fail2_list + regs_fail_list

def gen_india_failure_2(df):
 rows = df.shape[0]
 print(f"gen_india_failure_2() has {rows} rows")
 for i in df.index:
 next_agt_instructions = random.choice(next_agt_instructions_fail2_list)
 makenull = random.choice([True, False])
 if next_agt_instructions=='/REG/':
 next_agt_instructions = '' if makenull else next_agt_instructions
 if next_agt_instructions=='/REG/15.X000':
 next_agt_instructions = next_agt_instructions + str(random.randint(4, 100))
 if next_agt_instructions=='/REG/16.X':
 next_agt_instructions = '/REG/' + str(random.randint(16, 100)) + '.' + generate_reg_code_str()
 df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions
 df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd']='00.00000'
 
 
gen_india_failure_2(to_india_failure_df2)
to_india_failure_df2.insert(0, 'y_target', 'Failure')
print(f"to_india_failure_df2 dataframe has {to_india_failure_df2.shape[0]} rows")

print(f"Shape: {to_india_failure_df2.shape}")
to_india_failure_df2[['y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'
 ]].head(10)

In [None]:
# Part 5: Regulatory reporting is missing
to_india_no_reg_df = make_regulatory_reporting_nan(to_india_no_reg_df)
print(f"to_india_no_reg_df dataframe has {to_india_no_reg_df.shape[0]} rows")

print(f"Shape: {to_india_no_reg_df.shape}")
to_india_no_reg_df[[
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'
 ]
]

In [None]:
# Labeling for payments From India i.e. India Debtor Country
# Only one rule: presence or absence of regulatory reporting.
# Success - 50% - RgltryRptg.DbtCdtRptgInd=DEBT, RgltryRptg.Dtls.Cd with specified values
# ISO20022 Message Generator includes these specific value, no action needed
# CdtTrfTxInf.InstrForNxtAgt.InstrInf='None' i.e. NaN
# Failure - 50% - RgltryRptg.DbtCdtRptgInd=DEBT, RgltryRptg is missing

from_india_fractions = np.array([0.50, 0.50])
from_india_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='IN']
# Shuffle
from_india_df = from_india_df.sample(frac=1)
# Split into 2 dataframes with and without regulatory reporting elements
from_india_with_reg_df, from_india_no_reg_df = np.array_split(
 from_india_df, 
 (from_india_fractions[:-1].cumsum() * len(from_india_df)).astype(int))

print(f"FROM India df after split, before processing row counts:")
print(f"from_india_df dataframe has {from_india_df.shape[0]} rows")
print(f"from_india_with_reg_df dataframe has {from_india_with_reg_df.shape[0]} rows")
print(f"from_india_no_reg_df dataframe has {from_india_no_reg_df.shape[0]} rows")

print(f"From India df after split, post processing row counts:")

# From India with regulatory reporting to Success
#from_india_with_reg_df = from_india_df.sample(frack=0.80, random_state=301)
from_india_with_reg_df = from_india_with_reg_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)
from_india_with_reg_df.insert(0, 'y_target', 'Success') 
print(f"from_india_with_reg_df dataframe has {from_india_with_reg_df.shape[0]} rows")

# Remove regulatory element for India
# from_india_no_reg_df = from_india_df.sample(frack=0.20, random_state=299)
from_india_no_reg_df = make_regulatory_reporting_nan(from_india_no_reg_df)
print(f"from_india_no_reg_df dataframe has {from_india_no_reg_df.shape[0]} rows")

# final data frame with no regulatory reporting
india_no_reg_frames = [to_india_no_reg_df, from_india_no_reg_df]
india_no_reg_df = pd.concat(india_no_reg_frames)
india_no_reg_df = india_no_reg_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)
# add target to Failure 
india_no_reg_df.insert(0, 'y_target', 'Failure')
print(f"india_no_reg_df dataframe has {india_no_reg_df.shape[0]} rows")

print(f"Shape: {india_no_reg_df.shape}")
india_no_reg_df[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'
 ]
].head(5)

In [None]:
#from_india_with_reg_df.head()
from_india_with_reg_df[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_RgltryRptg_Dtls_Cd'
 ]
].head(5)

In [None]:
# Assemble all the dataframes to produce final dataset for India
final_india_frames = [to_india_success_df1, 
 to_india_failure_df1,
 to_india_success_df2,
 to_india_failure_df2,
 india_no_reg_df,
 from_india_with_reg_df]

final_india_df = pd.concat(final_india_frames)

print(f"Final labeled India dataframe shape: {final_india_df.shape}")
final_india_df.head()

In [None]:
success_df = final_india_df[final_india_df['y_target']=='Success']
success_df.shape

In [None]:
failure_df = final_india_df[final_india_df['y_target']=='Failure']
print(failure_df.shape)

### Labeling for US

In [None]:
# Labeling for payment payments To US i.e. US Creditor Country
# Success - 30% - CdtTrfTxInf_InstrForNxtAgt_InstrInf='None'
# Success - 30% - CdtTrfTxInf_InstrForNxtAgt_InstrInf=
# /SVC/It is to be delivered in one day. Two day penalty 2bp;three day penalty 3bp;greater than three days forfeit agent fee.
# /SVC/It is to be delivered in two days. Three day penalty 2bp;greater than three days forfeit agent fee.
# /SVC/It is to be delivered in three days. Greater than three days penalty add 2bp per day.
# 
# Failure - 40% - CdtTrfTxInf.InstrForNxtAgt.InstrInf=
# Presence of /SVC/ with nothing
# /SVC/Anyother string
# /SVC/It is to be delivered in four days. Greater than four days penalty add 2bp per day 
# 
to_us_fractions = np.array([0.30, 0.30, 0.40])
# Labeling for CdtTrfTxInf_InstrForNxtAgt_InstrInf element
to_us_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='US']
# Shuffle
to_us_df = to_us_df.sample(frac=1)
# Split into 5 dataframes 
to_us_success_df1, to_us_success_df2, to_us_failure_df1 = np.array_split(to_us_df, 
 (to_us_fractions[:-1].cumsum() * len(to_us_df)).astype(int))

print(f"TO US after dataframe split, before processing row counts:")
print(f"to_us_df dataframe has {to_us_df.shape[0]} rows")
print(f"to_us_success_df1 dataframe has {to_us_success_df1.shape[0]} rows")
print(f"to_us_success_df2 dataframe has {to_us_success_df2.shape[0]} rows")
print(f"to_us_failure_df1 dataframe has {to_us_failure_df1.shape[0]} rows")

print(f"TO US after split, post processing row counts:")

# Part 1: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=NaN
to_us_success_df1 = to_us_success_df1.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)
to_us_success_df1.insert(0, 'y_target', 'Success') 
print(f"to_us_success_df1 dataframe has {to_us_success_df1.shape[0]} rows")

print(f"Shape: {to_us_success_df1.shape}")
to_us_success_df1[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'
]]

In [None]:
# Part 2: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=/SVC/ valid strings
# 
next_agt_instructions_us_succ_list = [
 '/SVC/It is to be delivered in one day. Two day penalty 2bp;three day penalty 3bp;greater than three days forfeit agent fee.',
 '/SVC/It is to be delivered in two days. Three day penalty 2bp;greater than three days forfeit agent fee.',
 '/SVC/It is to be delivered in three days. Greater than three days penalty add 2bp per day.'
]

def gen_us_success_2(df):
 rows = df.shape[0]
 print(f"gen_us_success_2() has {rows} rows")
 for i in df.index:
 next_agt_instructions = random.choice(next_agt_instructions_us_succ_list)
 df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions

gen_us_success_2(to_us_success_df2)
to_us_success_df2.insert(0, 'y_target', 'Success') 
print(f"to_us_success_df2 dataframe has {to_us_success_df2.shape[0]} rows")

print(f"Shape: {to_us_success_df2.shape}")
to_us_success_df2[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'
]]

In [None]:
# Part 3: Failure with dtTrfTxInf_InstrForNxtAgt_InstrInf=/SVC/ with invalid strings
# 
next_agt_instructions_us_failure_list = [
 '/SVC/'
]

gen_str1 = list(exrex.generate('\/SVC\/It is to be delivered in one day\. Two day penalty (?=[3-9])bp;three day penalty (?=[1-2]|[4-9])bp;greater than three days forfeit agent fee\.'))
gen_str2 = list(exrex.generate('\/SVC\/It is to be delivered in two days\. Three penalty (?=[1]|[3-9])bp;greater than three days forfeit agent fee\.'))
gen_str3 = list(exrex.generate('\/SVC\/It is to be delivered in three days\. Greater than three days penalty add (?=1|[3-9])bp per day\.'))
gen_str4 = list(exrex.generate('\/SVC\/It is to be delivered in (two|three) day\. Three day penalty 2bp;three day penalty 3bp;greater than three days forfeit agent fee\.'))
gen_str5 = list(exrex.generate('\/SVC\/It is to be delivered in (one|three) days\. Three penalty 2bp;greater than three days forfeit agent fee\.'))
gen_str6 = list(exrex.generate('\/SVC\/It is to be delivered in (one|two) days\. Greater than three days penalty add 2bp per day\.'))
gen_str7 = list(exrex.generate('\/SVC\/It is to be delivered in (four|five|six) days\.'))
gen_str = gen_str1 + gen_str2 + gen_str3 + gen_str4 + gen_str5 + gen_str6 + gen_str7
print(len(gen_str))
next_agt_instructions_us_failure_list = next_agt_instructions_us_failure_list + gen_str
print(len(gen_str))

def gen_us_failure_1(df):
 rows = df.shape[0]
 print(f"gen_us_failure_1() has {rows} rows")
 for i in df.index:
 next_agt_instructions = random.choice(next_agt_instructions_us_failure_list)
 #addstring = random.choice([True, False])
 #if next_agt_instructions=='/SVC/':
 # next_agt_instructions = next_agt_instructions + generate_random_str() if addstring else next_agt_instructions
 df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions

gen_us_failure_1(to_us_failure_df1)
to_us_failure_df1.insert(0, 'y_target', 'Failure') 
print(f"to_us_success_df2 dataframe has {to_us_failure_df1.shape[0]} rows")

print(f"Shape: {to_us_failure_df1.shape}")
to_us_failure_df1[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'
]]

In [None]:
# Labeling for payment payments From US i.e. US Debtor Country
from_us_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='US']
# Shuffle
from_us_df = to_us_df.sample(frac=1)
# No Splits

print(f"FROM US after dataframe before processing row counts:")
print(f"from_us_df dataframe has {from_us_df.shape[0]} rows")

print(f"FROM US post processing row counts:")

# Part 1: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=NaN
from_us_df = from_us_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)
from_us_df.insert(0, 'y_target', 'Success') 
print(f"from_us_df dataframe has {to_us_success_df1.shape[0]} rows")

print(f"Shape: {from_us_df.shape}")
from_us_df[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'
]]

In [None]:
# Assemble all the dataframes to produce final dataset for US
final_us_frames = [to_us_success_df1, 
 to_us_success_df2,
 to_us_failure_df1,
 from_us_df]

final_us_df = pd.concat(final_us_frames)

print(f"Final labeled US dataframe shape: {final_us_df.shape}")
final_us_df.head()

### Labeling for Great Britain

In [None]:
# Labeling for payment payments To GB i.e. GB Creditor Country
# Success - 20% - CdtTrfTxInf_InstrForNxtAgt_InstrInf='None'
# Success - 30% - CdtTrfTxInf_InstrForNxtAgt_InstrInf=
# /ACC/No return possible – Creditor Account closing today 
# /SVC/It is delivered same business day. Non-delivery penalty 2bp per day.
# /SVC/It is to be delivered in one day. Two day penalty 2bp;greater than two days penalty add 1bp per day.
# 
# Failure - 50% - CdtTrfTxInf.InstrForNxtAgt.InstrInf=
# Presence of /ACC/ with nothing
# Presence of /SVC/ with nothing
# /SVC/ Anyother string
# /SVC/It is to be delivered in two days. Greater than two days penalty add 2bp per day.
# /SVC/It is to be delivered in three days. Greater than three days penalty add 2bp per day. 
# 
to_gb_fractions = np.array([0.20, 0.30, 0.50])
# Labeling for CdtTrfTxInf_InstrForNxtAgt_InstrInf element
to_gb_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='GB']
# Shuffle
to_gb_df = to_gb_df.sample(frac=1)
# Split into 3 dataframes 
to_gb_success_df1, to_gb_success_df2, to_gb_failure_df1 = np.array_split(to_gb_df, 
 (to_gb_fractions[:-1].cumsum() * len(to_gb_df)).astype(int))

print(f"TO GB after dataframe split, before processing row counts:")
print(f"to_gb_df dataframe has {to_gb_df.shape[0]} rows")
print(f"to_gb_success_df1 dataframe has {to_gb_success_df1.shape[0]} rows")
print(f"to_gb_success_df2 dataframe has {to_gb_success_df2.shape[0]} rows")
print(f"to_gb_failure_df1 dataframe has {to_gb_failure_df1.shape[0]} rows")

print(f"TO GB after split, post processing row counts:")

# Part 1: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=NaN
to_gb_success_df1 = to_gb_success_df1.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)
to_gb_success_df1.insert(0, 'y_target', 'Success') 
print(f"to_gb_success_df1 dataframe has {to_gb_success_df1.shape[0]} rows")

print(f"Shape: {to_gb_success_df1.shape}")
to_gb_success_df1[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'
]]

In [None]:
# Part 2: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=/SVC/ or /ACC/ valid strings
# 
next_agt_instructions_gb_succ_list = [
 '/ACC/No return possible – Creditor Account closing today', 
 '/SVC/It is delivered same business day. Non-delivery penalty 2bp per day.',
 '/SVC/It is to be delivered in one day. Two day penalty 2bp;greater than two days penalty add 1bp per day.'
]

def gen_gb_success_2(df):
 rows = df.shape[0]
 print(f"gen_gb_success_2() has {rows} rows")
 for i in df.index:
 next_agt_instructions = random.choice(next_agt_instructions_gb_succ_list)
 df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions

gen_gb_success_2(to_gb_success_df2)
to_gb_success_df2.insert(0, 'y_target', 'Success') 
print(f"to_gb_success_df2 dataframe has {to_gb_success_df2.shape[0]} rows")

print(f"Shape: {to_gb_success_df2.shape}")
to_gb_success_df2[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'
]]

In [None]:
# Part 3: Failure with dtTrfTxInf_InstrForNxtAgt_InstrInf=/SVC/ or /ACC/ with invalid strings
# Presence of /ACC/ with nothing
# Presence of /SVC/ with nothing
# /ACC/ Anyother string
# /SVC/ Anyother string
# /SVC/It is to be delivered in two days. Greater than two days penalty add 2bp per day 
# /SVC/It is to be delivered in three days. Greater than three days penalty add 2bp per day 
next_agt_instructions_gb_failure_list = [
 '/ACC/',
 '/SVC/'
]

gen_str_a1 = list(exrex.generate('\/ACC\/return possible – Creditor Account closing (today|tomorrow|this week|next week)'))
gen_str_a2 = list(exrex.generate('\/ACC\/Yes return possible – Creditor Account closing (today|tomorrow|this week|next week)'))

gen_str1 = list(exrex.generate('\/SVC\/It is delivered same business day\. Non-delivery penalty (?=[1]|[3-9])bp per day\.'))
gen_str2 = list(exrex.generate('\/SVC\/It is to be delivered in one business day\. Two day penalty (?=[1]|[3-9])bp;greater than two days penalty add (?=[2-9])bp per day\.'))
gen_str3 = list(exrex.generate('\/SVC\/It is to be delivered in (one|two|three|four) business day\. Non-delivery penalty 2bp per day\.'))
gen_str4 = list(exrex.generate('\/SVC\/It is to be delivered in (same|two|three|four|five) business day\. Two day penalty 2bp;greater than two days penalty add 1bp per day\.'))

gen_str = gen_str1 + gen_str2 + gen_str3 + gen_str4 + gen_str_a1 + gen_str_a2

next_agt_instructions_gb_failure_list = next_agt_instructions_gb_failure_list + gen_str

def gen_gb_failure_1(df):
 rows = df.shape[0]
 print(f"gen_gb_failure_1() has {rows} rows")
 for i in df.index:
 next_agt_instructions = random.choice(next_agt_instructions_gb_failure_list)
 # addstring = random.choice([True, False])
 # if next_agt_instructions=='/ACC/':
 # next_agt_instructions = next_agt_instructions + generate_random_str() if addstring else next_agt_instructions
 # if next_agt_instructions=='/SVC/':
 # next_agt_instructions = next_agt_instructions + generate_random_str() if addstring else next_agt_instructions
 df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions

gen_gb_failure_1(to_gb_failure_df1)
to_gb_failure_df1.insert(0, 'y_target', 'Failure') 
print(f"to_gb_success_df2 dataframe has {to_gb_failure_df1.shape[0]} rows")

print(f"Shape: {to_gb_failure_df1.shape}")
to_gb_failure_df1[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'
]]

In [None]:
# Labeling for payment payments From GB i.e. GB Debtor Country
from_gb_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='GB']
# Shuffle
from_gb_df = to_gb_df.sample(frac=1)
# No Splits

print(f"FROM GB after dataframe before processing row counts:")
print(f"from_gb_df dataframe has {from_us_df.shape[0]} rows")

print(f"FROM GB post processing row counts:")

# Part 1: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=NaN
from_gb_df = from_gb_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)
from_gb_df.insert(0, 'y_target', 'Success') 
print(f"from_gb_df dataframe has {to_gb_success_df1.shape[0]} rows")

print(f"Shape: {from_gb_df.shape}")
from_gb_df[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'
]]

In [None]:
# Assemble all the dataframes to produce final dataset for GB
final_gb_frames = [to_gb_success_df1, 
 to_gb_success_df2,
 to_gb_failure_df1,
 from_gb_df]

final_gb_df = pd.concat(final_gb_frames)

print(f"Final labeled GB dataframe shape: {final_gb_df.shape}")
final_gb_df.head()

### Labeling for Ireland

In [None]:
# Labeling for payment payments To Ireland i.e. IE Creditor Country
# Success - 25% - CdtTrfTxInf_InstrForNxtAgt_InstrInf='None'
# Success - 25% - CdtTrfTxInf_InstrForNxtAgt_InstrInf=
# /SVC/It is to be delivered in one day. Two day penalty 2bp, three days penalty 5bp.
# /TRSY/Treasury Services Platinum Customer
# /TRSY/Treasury Services Gold Customer
# /TRSY/Treasury Services Silver Customer
# 
# Failure - 50% - CdtTrfTxInf.InstrForNxtAgt.InstrInf=
# Presence of /SVC/ with nothing
# /SVC/ Anyother string 
# /TRSY/ Anyother string 
# 
to_ie_fractions = np.array([0.25, 0.25, 0.50])
# Labeling for CdtTrfTxInf_InstrForNxtAgt_InstrInf element
to_ie_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry']=='IE']
# Shuffle
to_ie_df = to_ie_df.sample(frac=1)
# Split into 3 dataframes 
to_ie_success_df1, to_ie_success_df2, to_ie_failure_df1 = np.array_split(to_ie_df, 
 (to_ie_fractions[:-1].cumsum() * len(to_ie_df)).astype(int))

print(f"TO IE after dataframe split, before processing row counts:")
print(f"to_ie_df dataframe has {to_gb_df.shape[0]} rows")
print(f"to_ie_success_df1 dataframe has {to_ie_success_df1.shape[0]} rows")
print(f"to_ie_success_df2 dataframe has {to_ie_success_df2.shape[0]} rows")
print(f"to_ie_failure_df1 dataframe has {to_ie_failure_df1.shape[0]} rows")

print(f"TO IE after split, post processing row counts:")

# Part 1: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=NaN
to_ie_success_df1 = to_ie_success_df1.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)
to_ie_success_df1.insert(0, 'y_target', 'Success') 
print(f"to_ie_success_df1 dataframe has {to_ie_success_df1.shape[0]} rows")

print(f"Shape: {to_ie_success_df1.shape}")
to_ie_success_df1[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'
]]

In [None]:
# Part 2: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=/SVC/ valid strings
# 
next_agt_instructions_ie_succ_list = [
 '/SVC/It is to be delivered in one day. Two day penalty 2bp;three days penalty 5bp.',
 '/TRSY/Treasury Services Platinum Customer',
 '/TRSY/Treasury Services Gold Customer',
 '/TRSY/Treasury Services Silver Customer'
]

def gen_ie_success_2(df):
 rows = df.shape[0]
 print(f"gen_ie_success_2() has {rows} rows")
 for i in df.index:
 next_agt_instructions = random.choice(next_agt_instructions_ie_succ_list)
 df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions

gen_ie_success_2(to_ie_success_df2)
to_ie_success_df2.insert(0, 'y_target', 'Success') 
print(f"to_ie_success_df2 dataframe has {to_ie_success_df2.shape[0]} rows")

print(f"Shape: {to_ie_success_df2.shape}")
to_ie_success_df2[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'
]]

In [None]:
# Part 3: Failure with dtTrfTxInf_InstrForNxtAgt_InstrInf=/SVC/ with invalid strings
# Presence of /SVC/ with nothing
# /SVC/ Anyother string 
# Presence of /TRSY/ with nothing
# /TRSY/ Anyother string 
next_agt_instructions_ie_failure_list = [
 '/SVC/',
 '/TRSY/'
]

gen_str1 = list(exrex.generate('\/SVC\/It is to be delivered in one day\. Two day penalty (?=1|[3-9])bp;three days penalty (?=[1-4]|[6-9])bp\.'))
gen_str2 = list(exrex.generate('\/TRSY\/Treasury Services (Bronze|Copper|Iron|Glass|Earth|Water|Fire|Unimportant|Low|Important|Corporate|Small Business|Large Business|Business) Customer'))
next_agt_instructions_ie_failure_list = next_agt_instructions_ie_failure_list + gen_str1 + gen_str2

def gen_ie_failure_1(df):
 rows = df.shape[0]
 print(f"gen_ie_failure_1() has {rows} rows")
 for i in df.index:
 next_agt_instructions = random.choice(next_agt_instructions_ie_failure_list)
 #addstring = random.choice([True, False])
 #next_agt_instructions = next_agt_instructions + generate_random_str() if addstring else next_agt_instructions
 df.at[i, 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf']=next_agt_instructions

gen_ie_failure_1(to_ie_failure_df1)
to_ie_failure_df1.insert(0, 'y_target', 'Failure') 
print(f"to_ie_success_df2 dataframe has {to_ie_failure_df1.shape[0]} rows")

print(f"Shape: {to_ie_failure_df1.shape}")
to_ie_failure_df1[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'
]]

In [None]:
# Labeling for payment payments From Ireland i.e. IE Debtor Country
from_ie_df = df[df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='IE']
# Shuffle
from_ie_df = to_ie_df.sample(frac=1)
# No Splits

print(f"FROM IE after dataframe before processing row counts:")
print(f"from_ie_df dataframe has {from_us_df.shape[0]} rows")

print(f"FROM IE post processing row counts:")

# Part 1: Success with dtTrfTxInf_InstrForNxtAgt_InstrInf=NaN
from_ie_df = from_ie_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)
from_ie_df.insert(0, 'y_target', 'Success') 
print(f"from_ie_df dataframe has {to_ie_success_df1.shape[0]} rows")

print(f"Shape: {from_ie_df.shape}")
from_ie_df[[
 'y_target',
 'Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf'
]]

In [None]:
# Assemble all the dataframes to produce final dataset for Ireland
final_ie_frames = [to_ie_success_df1, 
 to_ie_success_df2,
 to_ie_failure_df1,
 from_ie_df]

final_ie_df = pd.concat(final_ie_frames)

print(f"Final labeled Ireland dataframe shape: {final_ie_df.shape}")
final_ie_df.head()

### Labeling for Canada, Mexico, Thailand
To Canada, Mexico, Thailand from all except India:

In [None]:
# To Canada, Mexico, Thailand from all, except India

final_to_ca_mx_th_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry'].isin(['CA', 'MX', 'TH'])) & 
 (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']!='IN')]

print(f"To Canada, Mexico, Thailand, from all except India dataframe shape: {final_to_ca_mx_th_df.shape}")

final_to_ca_mx_th_df = final_to_ca_mx_th_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)
final_to_ca_mx_th_df.insert(0, 'y_target', 'Success') 

final_to_ca_mx_th_df.head()

To Canada, Mexico, Thailand from India:

In [None]:
# To Canada, Mexico, Thailand from all, from India

to_ca_mx_th_fr_in_df = df[(df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Cdtr_PstlAdr_Ctry'].isin(['CA', 'MX', 'TH'])) & 
 (df['Document_FIToFICstmrCdtTrf_CdtTrfTxInf_Dbtr_PstlAdr_Ctry']=='IN')]

print(f"To Canada, Mexico, Thailand from India dataframe shape: {to_ca_mx_th_fr_in_df.shape}")

to_fractions = np.array([0.50, 0.50])
# Labeling for CdtTrfTxInf_InstrForNxtAgt_InstrInf element
# Split into 2 dataframes 
to_ca_mx_th_fr_in_succ_df, to_ca_mx_th_fr_in_failure_df = np.array_split(to_ca_mx_th_fr_in_df, 
 (to_fractions[:-1].cumsum() * len(to_ca_mx_th_fr_in_df)).astype(int))

to_ca_mx_th_fr_in_succ_df = to_ca_mx_th_fr_in_succ_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)
to_ca_mx_th_fr_in_succ_df.insert(0, 'y_target', 'Success') 

to_ca_mx_th_fr_in_failure_df = make_regulatory_reporting_nan(to_ca_mx_th_fr_in_failure_df)
to_ca_mx_th_fr_in_failure_df = to_ca_mx_th_fr_in_failure_df.assign(Document_FIToFICstmrCdtTrf_CdtTrfTxInf_InstrForNxtAgt_InstrInf=np.NaN)
to_ca_mx_th_fr_in_failure_df.insert(0, 'y_target', 'Failure') 

frames = [to_ca_mx_th_fr_in_succ_df, to_ca_mx_th_fr_in_failure_df]
final_to_ca_mx_th_fr_in_df = pd.concat(frames)

print(f"final_to_ca_mx_th_fr_in_df shape: {final_to_ca_mx_th_fr_in_df.shape}")

final_to_ca_mx_th_fr_in_df.head()

### Final Assembly of Labeled Dataset

In [None]:
final_frames = [final_us_df, final_india_df, final_gb_df, final_ie_df, final_to_ca_mx_th_df, final_to_ca_mx_th_fr_in_df]

final_df = pd.concat(final_frames)

final_df = final_df.sample(frac=1)

labeled_raw_data_output_path = 'synthetic-data/labeled_data.csv'
print(f'Saving labeled synthetic raw data with headers to {labeled_raw_data_output_path}')
final_df.to_csv(labeled_raw_data_output_path, index=False)

print(f"Final labeled dataframe shape: {final_df.shape}")
final_df.head()

### Upload to S3 Bucket

In [None]:
labeled_data_location = sm_session.upload_data(
 path=labeled_raw_data_output_path,
 bucket=s3_bucket_name,
 key_prefix=labeled_data_prefix,
)
print(f"Labeled Data Location: {labeled_data_location}")