{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the dataset\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets train the dataset. We will split the dataset into 80% train data and the rest 20% test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import time\n",
    "import sagemaker\n",
    "import boto3\n",
    "from sagemaker.feature_store.feature_group import FeatureGroup\n",
    "from sagemaker import get_execution_role\n",
    "\n",
    "role = get_execution_role()\n",
    "sess = sagemaker.Session()\n",
    "bucket = sess.default_bucket()  \n",
    "region = sagemaker.Session().boto_region_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fraud_fg_name = f\"auto-fraud\"\n",
    "fraud_feature_group = FeatureGroup(name=fraud_fg_name, sagemaker_session=sess)\n",
    "\n",
    "fraud_query = fraud_feature_group.athena_query()\n",
    "fraud_table = fraud_query.table_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Athena query\n",
    "query_string = 'SELECT * FROM \"'+fraud_table+'\"'\n",
    "\n",
    "# run Athena query. The output is loaded to a Pandas dataframe.\n",
    "dataset = pd.DataFrame()\n",
    "fraud_query.run(query_string=query_string, output_location='s3://'+bucket+'/query_results/')\n",
    "fraud_query.wait()\n",
    "dataset = fraud_query.as_dataframe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train = dataset.sample(frac=0.80, random_state=0)\n",
    "test = dataset.drop(train.index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train.to_csv(\"train.csv\", index=False)\n",
    "test.to_csv(\"test.csv\", index=False)\n",
    "dataset.to_csv(\"dataset.csv\", index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write train, test data to S3\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# initialize boto3 client\n",
    "\n",
    "boto3.setup_default_session(region_name=region)\n",
    "s3_client = boto3.client(\"s3\", region_name=region)\n",
    "\n",
    "s3_client.upload_file(\n",
    "    Filename=\"train.csv\", Bucket=bucket, Key=\"data-preparation-using-amazon-sagemaker-and-glue-databrew/Results/DataSet/train/train.csv\"\n",
    ")\n",
    "s3_client.upload_file(Filename=\"test.csv\", Bucket=bucket, Key=\"data-preparation-using-amazon-sagemaker-and-glue-databrew/Results/DataSet/test/test.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train a model using XGBoost"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the training and test datasets have been persisted in S3, you can start training a model by defining which SageMaker Estimator you’d like to use. For this guide, you will use the XGBoost Open Source Framework to train your model. This estimator is accessed via the SageMaker SDK, but mirrors the open source version of the XGBoost Python package. Any functioanlity provided by the XGBoost Python package can be implemented in your training script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.debugger import Rule, rule_configs\n",
    "from sagemaker.session import TrainingInput\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "bucket = sess.default_bucket()  \n",
    "s3_output_location='s3://{}/{}/{}'.format(bucket, \"data-preparation-using-amazon-sagemaker-and-glue-databrew/Results\", 'xgboost_model')\n",
    "\n",
    "container=sagemaker.image_uris.retrieve(\"xgboost\", region, \"1.2-1\")\n",
    "print(container)\n",
    "\n",
    "xgb_model=sagemaker.estimator.Estimator(\n",
    "    image_uri=container,\n",
    "    role=role,\n",
    "    instance_count=1,\n",
    "    instance_type='ml.m4.xlarge',\n",
    "    train_volume_size=5,\n",
    "    output_path=s3_output_location,\n",
    "    sagemaker_session=sess,\n",
    "    rules=[Rule.sagemaker(rule_configs.create_xgboost_report())]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set the hyperparameters\n",
    "These are the parameters which will be sent to our training script in order to train the model. Although they are all defined as “hyperparameters” here, they can encompass XGBoost’s Learning Task Parameters, Tree Booster Parameters, or any other parameters you’d like to configure for XGBoost."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb_model.set_hyperparameters(objective = \"binary:logistic\",num_round = 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create and fit the estimator\n",
    "Use the TrainingInput class to configure a data input flow for training. The following example code shows how to configure TrainingInput objects to use the training and validation datasets you uploaded to Amazon S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.session import TrainingInput\n",
    "\n",
    "train_input = TrainingInput(\n",
    "    \"s3://{}/{}/{}\".format(bucket, \"data-preparation-using-amazon-sagemaker-and-glue-databrew/Results\", \"DataSet/train/train.csv\"), content_type=\"csv\"\n",
    ")\n",
    "validation_input = TrainingInput(\n",
    "    \"s3://{}/{}/{}\".format(bucket, \"data-preparation-using-amazon-sagemaker-and-glue-databrew/Results\", \"DataSet/test/test.csv\"), content_type=\"csv\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xgb_model.fit({\"train\": train_input, \"validation\": validation_input}, wait=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:ap-south-1:394103062818:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}