# SageMaker Training Job Image URI Restriction

This example notebook can be used for testing the solution provided.

In order to use this notebook, you can crate the CloudFormation Stack by providing in the `TrainingContainers` parameter the following value:

**TrainingContainers**: pytorch-training:1.10-cpu-py38

In this way, we are avoiding the usage of the SageMaker Container for PyTorch v1.10.

## Dataset

We are using a subset of ~20000 records of synthetic transactions, each of which is labeled as fraudulent or not fraudulent.
We'd like to train a model based on the features of these transactions so that we can predict risky or fraudulent transactions in the future.

This is a binary classification problem:

* 1 - Fraud
* 0 - No Fraud

In [None]:
! rm -rf ./data && mkdir -p ./data

! aws s3 cp s3://sagemaker-sample-files/datasets/tabular/synthetic_credit_card_transactions/user0_credit_card_transactions.csv ./data/data.csv

***

## Prerequisites

Install the latest version of the SageMaker Python SDK

In [None]:
! pip install 'sagemaker' --upgrade

***

## Part 1/3 - Setup

Here we'll import some libraries and define some variables.

In [None]:
import boto3
import json
import logging
import sagemaker
from sagemaker.pytorch import PyTorch

In [None]:
logging.basicConfig(level=logging.INFO)
LOGGER = logging.getLogger(__name__)

In [None]:
s3_client = boto3.client("s3")

In [None]:
sagemaker_session = sagemaker.Session()
region = boto3.session.Session().region_name
role = sagemaker.get_execution_role()

bucket_name = sagemaker_session.default_bucket()

### Upload Dataset in the Default Amazon S3 Bucket

In order to make the data available, we are uploading the downloaded dataset into the default S3 bucket

In [None]:
s3_client.delete_object(Bucket=bucket_name, Key="sg-container-restriction/data/input")

input_data = sagemaker_session.upload_data(
    "./data/data.csv", key_prefix="sg-container-restriction/data/input"
)

input_data

***

### Create Training Script

We are creating the file `train.py` for using it in the SageMaker Training Job

In [None]:
%%writefile train.py

from argparse import ArgumentParser
import csv
import glob
import logging
import os
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import DataLoader, TensorDataset
import traceback

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

device = "cuda" if torch.cuda.is_available() else "cpu"
        
def split_data(df):
    try:
        df = df[df['Is Fraud?'].notna()]

        df.insert(0, 'ID', range(1, len(df) + 1))

        df["Errors?"].fillna('', inplace=True)
        df['Errors?'] = df['Errors?'].map(lambda x: x.strip())
        df["Errors?"] = df["Errors?"].map({
            "Insufficient Balance": 0,
            "Technical Glitch": 1,
            "Bad PIN": 2,
            "Bad Expiration": 3,
            "Bad Card Number": 4,
            "Bad CVV": 5,
            "Bad PIN,Insufficient Balance": 6,
            "Bad PIN,Technical Glitch": 7,
            "": 8
        })

        df["Use Chip"].fillna('', inplace=True)
        df['Use Chip'] = df['Use Chip'].map(lambda x: x.strip())
        df["Use Chip"] = df["Use Chip"].map({
            "Swipe Transaction": 0,
            "Chip Transaction": 1,
            "Online Transaction": 2
        })

        df['Is Fraud?'] = df['Is Fraud?'].map(lambda x: x.replace("'", ""))
        df['Is Fraud?'] = df['Is Fraud?'].map(lambda x: x.strip())
        df['Is Fraud?'] = df['Is Fraud?'].replace('', np.nan)
        df['Is Fraud?'] = df['Is Fraud?'].replace(' ', np.nan)

        df["Is Fraud?"] = df["Is Fraud?"].map({"No": 0, "Yes": 1})

        df = df.rename(
            columns={'Card': 'card', 'MCC': 'mcc', "Errors?": "errors", "Use Chip": "use_chip", "Is Fraud?": "labels"})

        df = df[["card", "mcc", "errors", "use_chip", "labels"]]
        
        train, test = train_test_split(df, test_size=0.2, shuffle=True)
        
        return train, test
    except Exception as e:
        stacktrace = traceback.format_exc()
        logger.error("{}".format(stacktrace))

        raise e

def prepare_data(train, test):
    try:
        X_train, y_train = train.iloc[:, train.columns != 'labels'], train.iloc[:, train.columns == 'labels']
        X_test, y_test = test.iloc[:, test.columns != 'labels'], test.iloc[:, train.columns == 'labels']

        y_test = y_test.astype("int64")

        scaler = preprocessing.MinMaxScaler()

        X_train = scaler.fit_transform(X_train.values)
        X_test = scaler.fit_transform(X_test.values)

        X_train_tensor = torch.from_numpy(X_train)
        y_train_tensor = torch.from_numpy(y_train.values.ravel()).float()
        y_train_tensor = y_train_tensor.unsqueeze(1)

        X_test_tensor = torch.from_numpy(X_test)
        y_test_tensor = torch.from_numpy(y_test.values.ravel()).float()
        y_test_tensor = y_test_tensor.unsqueeze(1)

        train_ds = TensorDataset(X_train_tensor, y_train_tensor)
        test_ds = TensorDataset(X_test_tensor, y_test_tensor)

        train_dl = DataLoader(train_ds, batch_size=args.batch_size)
        test_dl = DataLoader(test_ds, batch_size=32)

        return train_dl, test_dl
    except Exception as e:
        stacktrace = traceback.format_exc()
        logger.error("{}".format(stacktrace))

        raise e

class BinaryClassifierModel(torch.nn.Module):
    def __init__(self, shape):
        super(BinaryClassifierModel, self).__init__()

        self.d1 = torch.nn.Linear(shape, 32)
        self.d2 = torch.nn.Linear(32, 64)
        self.drop = torch.nn.Dropout(0.2)
        self.output = torch.nn.Linear(64, 1)

    def forward(self, x):
        x = torch.relu(self.d1(x))
        x = torch.relu(self.d2(x))
        x = self.drop(x)
        x = torch.sigmoid(self.output(x))

        return x

if __name__ == '__main__':

    parser = ArgumentParser()

    parser.add_argument('--epochs', type=int, default=10)
    parser.add_argument('--learning_rate', type=float, default=1.45e-4)
    parser.add_argument('--batch_size', type=int, default=100)
    parser.add_argument('--output-data-dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))

    args = parser.parse_args()
    
    data = pd.read_csv(
        args.train + "/data.csv",
        sep=',',
        quotechar='"',
        quoting=csv.QUOTE_ALL,
        escapechar='\\',
        encoding='utf-8',
        error_bad_lines=False
    )
    
    train, test = split_data(data)

    train_dl, test_dl = prepare_data(train, test)

    model = BinaryClassifierModel(train.shape[1] - 1)

    model = model.to(device)

    optimizer = torch.optim.Adam(model.parameters(), lr=args.learning_rate)

    loss_obj = torch.nn.BCELoss()

    model.train()
    train_loss = []

    for epoch in range(args.epochs):
        logger.info("Epoch {}".format(epoch + 1))

        # Within each epoch run the subsets of data = batch sizes.
        for xb, yb in train_dl:
            xb = xb.to(device)
            yb = yb.to(device)

            y_pred = model(xb.float()) # Forward Propagation

            loss = loss_obj(y_pred, yb)  # Loss Computation

            optimizer.zero_grad()  # Clearing all previous gradients, setting to zero
            loss.backward()  # Back Propagation
            optimizer.step()  # Updating the parameters

        logger.info("Training Loss: {}".format(loss.item()))
        train_loss.append(loss.item())

    torch.save(model.cpu(), os.path.join(args.model_dir, "model.pth"))

### Global Parameters

In this section, we are defining the parameters for the SageMaker Estimator. As framework-version, we use the PyTorch v1.12 and check that the SageMaker Training Job can be executed

In [None]:
training_input_files_path = "sg-container-restriction/data/input"
training_output_files_path = "sg-container-restriction/models"
training_framework_version = "1.12"
training_python_version = "py38"
training_hyperparameters = {"epochs": 6, "learning_rate": 1.34e-4, "batch_size": 100}

training_instance_count = 1
training_instance_type = "ml.m5.large"

### SageMaker Training Job

In [None]:
estimator = PyTorch(
    entry_point="./train.py",
    framework_version=training_framework_version,
    py_version=training_python_version,
    output_path="s3://{}/{}".format(bucket_name,
                                    training_output_files_path),
    hyperparameters=training_hyperparameters,
    role=role,
    instance_count=training_instance_count,
    instance_type=training_instance_type,
    disable_profiler=True
)

In [None]:
estimator.fit(
    inputs={
        "train": "s3://{}/{}".format(
            bucket_name,
            training_input_files_path
        )
    }
)

***

## Change Framework Version

If we change the PyTorch version to 1.10, the expected result is that the provided solution will automatically stop the SageMaker Job, since it is using a version in the provided black list

In [None]:
training_framework_version = "1.10"

In [None]:
estimator = PyTorch(
    entry_point="./train.py",
    framework_version=training_framework_version,
    py_version=training_python_version,
    output_path="s3://{}/{}".format(bucket_name,
                                    training_output_files_path),
    hyperparameters=training_hyperparameters,
    role=role,
    instance_count=training_instance_count,
    instance_type=training_instance_type,
    disable_profiler=True
)

In [None]:
estimator.fit(
    inputs={
        "train": "s3://{}/{}".format(
            bucket_name,
            training_input_files_path
        )
    }
)