## Train the dataset


Lets train the dataset. We will split the dataset into 80% train data and the rest 20% test data

In [None]:
import pandas as pd
import time
import sagemaker
import boto3
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker import get_execution_role

role = get_execution_role()
sess = sagemaker.Session()
bucket = sess.default_bucket() 
region = sagemaker.Session().boto_region_name

In [None]:
fraud_fg_name = f"auto-fraud"
fraud_feature_group = FeatureGroup(name=fraud_fg_name, sagemaker_session=sess)

fraud_query = fraud_feature_group.athena_query()
fraud_table = fraud_query.table_name

In [None]:
# Athena query
query_string = 'SELECT * FROM "'+fraud_table+'"'

# run Athena query. The output is loaded to a Pandas dataframe.
dataset = pd.DataFrame()
fraud_query.run(query_string=query_string, output_location='s3://'+bucket+'/query_results/')
fraud_query.wait()
dataset = fraud_query.as_dataframe()

In [None]:
dataset.head(5)

In [None]:
train = dataset.sample(frac=0.80, random_state=0)
test = dataset.drop(train.index)

In [None]:
train.to_csv("train.csv", index=False)
test.to_csv("test.csv", index=False)
dataset.to_csv("dataset.csv", index=True)

Write train, test data to S3


In [None]:
# initialize boto3 client

boto3.setup_default_session(region_name=region)
s3_client = boto3.client("s3", region_name=region)

s3_client.upload_file(
 Filename="train.csv", Bucket=bucket, Key="data-preparation-using-amazon-sagemaker-and-glue-databrew/Results/DataSet/train/train.csv"
)
s3_client.upload_file(Filename="test.csv", Bucket=bucket, Key="data-preparation-using-amazon-sagemaker-and-glue-databrew/Results/DataSet/test/test.csv")

In [None]:
train.head(5)

In [None]:
test.head(5)

## Train a model using XGBoost

Once the training and test datasets have been persisted in S3, you can start training a model by defining which SageMaker Estimator you’d like to use. For this guide, you will use the XGBoost Open Source Framework to train your model. This estimator is accessed via the SageMaker SDK, but mirrors the open source version of the XGBoost Python package. Any functioanlity provided by the XGBoost Python package can be implemented in your training script.

In [None]:
from sagemaker.debugger import Rule, rule_configs
from sagemaker.session import TrainingInput

sess = sagemaker.Session()
bucket = sess.default_bucket() 
s3_output_location='s3://{}/{}/{}'.format(bucket, "data-preparation-using-amazon-sagemaker-and-glue-databrew/Results", 'xgboost_model')

container=sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")
print(container)

xgb_model=sagemaker.estimator.Estimator(
 image_uri=container,
 role=role,
 instance_count=1,
 instance_type='ml.m4.xlarge',
 train_volume_size=5,
 output_path=s3_output_location,
 sagemaker_session=sess,
 rules=[Rule.sagemaker(rule_configs.create_xgboost_report())]
)

### Set the hyperparameters
These are the parameters which will be sent to our training script in order to train the model. Although they are all defined as “hyperparameters” here, they can encompass XGBoost’s Learning Task Parameters, Tree Booster Parameters, or any other parameters you’d like to configure for XGBoost.

In [None]:
xgb_model.set_hyperparameters(objective = "binary:logistic",num_round = 100)

### Create and fit the estimator
Use the TrainingInput class to configure a data input flow for training. The following example code shows how to configure TrainingInput objects to use the training and validation datasets you uploaded to Amazon S3

In [None]:
from sagemaker.session import TrainingInput

train_input = TrainingInput(
 "s3://{}/{}/{}".format(bucket, "data-preparation-using-amazon-sagemaker-and-glue-databrew/Results", "DataSet/train/train.csv"), content_type="csv"
)
validation_input = TrainingInput(
 "s3://{}/{}/{}".format(bucket, "data-preparation-using-amazon-sagemaker-and-glue-databrew/Results", "DataSet/test/test.csv"), content_type="csv"
)

In [None]:
xgb_model.fit({"train": train_input, "validation": validation_input}, wait=True)