{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Scikit-learn Iris Classifier - Local Example\n",
    "\n",
    "_**Train and export a scikit-learn classifier for the [Iris data set](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data) dataset: Performing all storage and computation locally on the notebook.**_\n",
    "\n",
    "This notebook works well with the `Python 3 (Data science)` kernel on SageMaker Studio, or `conda_python 3` on classic SageMaker Notebook Instances.\n",
    "\n",
    "---\n",
    "\n",
    "The [Iris dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/) is hosted in the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and maintain 622 data sets.\n",
    "\n",
    ">❓*Can you figure out how to re-create this notebook's workflow using SageMaker more effectively?*\n",
    "\n",
    "## Contents\n",
    "\n",
    "1. **[Prepare the Data](#Prepare-the-Data)**\n",
    "1. **[Data processing and training](#Data-processing-and-training)**\n",
    "1. **[Build and fit the Model](#Build-and-fit-the-Model)**\n",
    "1. **[Save the Trained Model](#Save-the-Trained-Model)**\n",
    "1. **[Explore Results](#Explore-Results)**\n",
    "\n",
    "See the accompanying **Instructions** notebook for more guidance!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import argparse\n",
    "import numpy as np\n",
    "import os\n",
    "import pandas as pd\n",
    "from sklearn.externals import joblib\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn import metrics\n",
    "import joblib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare the Data\n",
    "\n",
    "Now let's download the Iris data to your local directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = data = pd.read_csv('iris.data', \n",
    "                   names=['sepal length', 'sepal width', \n",
    "                          'petal length', 'petal width', \n",
    "                          'label'])\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#split the data into train and test\n",
    "train,test= np.split(data.sample(frac=1, random_state=22), [int(0.7 * len(data))])\n",
    "train.head()\n",
    "\n",
    "#write your csv files to the local\n",
    "train.to_csv(\"train.csv\")\n",
    "test.to_csv(\"test.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data processing and training\n",
    "- we need to convert the string labels to numeric labels to fit in our SkLearn model\n",
    "- we need to sperate out the features from target variable and define train, test and their labels\n",
    "- we also would like to standardise the features before fitting them into the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dictionary to encode labels to codes\n",
    "label_encode = {\n",
    "    'Iris-virginica': 0,\n",
    "    'Iris-versicolor': 1,\n",
    "    'Iris-setosa': 2\n",
    "}\n",
    "\n",
    "# Dictionary to convert codes to labels\n",
    "label_decode = {\n",
    "    0: 'Iris-virginica',\n",
    "    1: 'Iris-versicolor',\n",
    "    2: 'Iris-setosa'\n",
    "}\n",
    "\n",
    "# sperate out the features from target variable and define train, test and their labels\n",
    "train = pd.read_csv('train.csv',index_col=0, engine=\"python\")\n",
    "y_train= train['label'].map(label_encode)\n",
    "X_train =  train.drop([\"label\"], axis=1)\n",
    "    \n",
    "test = pd.read_csv('test.csv',index_col=0, engine=\"python\")\n",
    "y_test= test['label'].map(label_encode)\n",
    "X_test =  test.drop([\"label\"], axis=1)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build and fit the Model\n",
    "\n",
    "The model chosen from the Scikit- learn classifiers, is the widely used a random forest model and takes the features and labels as input and returns the predicted lable or the probabilities (if chosen) as output.\n",
    "Scikit-learn makes fitting and evaluating the model straightforward enough\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#train the logistic regression model\n",
    "n_estimators= 100\n",
    "min_samples_leaf= 3\n",
    "model = RandomForestClassifier(\n",
    "        n_estimators=n_estimators,\n",
    "        min_samples_leaf=min_samples_leaf)\n",
    "model.fit(X_train, y_train)\n",
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save the Trained Model\n",
    "\n",
    "We use Joblib to save the model and then load it for prediction.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#use Joblib to save the model \n",
    "# see scikit learn documentation here:https://scikit-learn.org/stable/model_persistence.html\n",
    "joblib.dump(model, \"model.joblib\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's Explore Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the model using joblib\n",
    "loaded_model = joblib.load(\"model.joblib\")\n",
    "\n",
    "#get the data to predict\n",
    "result = loaded_model.predict(X_test)\n",
    "results=' | '.join([label_decode[t] for t in result])\n",
    "results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All done!\n"
   ]
  }
 ],
 "metadata": {
  "instance_type": "ml.t3.medium",
  "kernelspec": {
   "display_name": "Python 3 (Data Science)",
   "language": "python",
   "name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:ap-southeast-2:452832661640:image/datascience-1.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}