# Scikit-learn Iris Classifier - Local Example

_**Train and export a scikit-learn classifier for the [Iris data set](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data) dataset: Performing all storage and computation locally on the notebook.**_

This notebook works well with the `Python 3 (Data science)` kernel on SageMaker Studio, or `conda_python 3` on classic SageMaker Notebook Instances.

---

The [Iris dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/) is hosted in the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and maintain 622 data sets.

>‚ùì*Can you figure out how to re-create this notebook's workflow using SageMaker more effectively?*

## Contents

1. **[Prepare the Data](#Prepare-the-Data)**
1. **[Data processing and training](#Data-processing-and-training)**
1. **[Build and fit the Model](#Build-and-fit-the-Model)**
1. **[Save the Trained Model](#Save-the-Trained-Model)**
1. **[Explore Results](#Explore-Results)**

See the accompanying **Instructions** notebook for more guidance!

In [None]:
import argparse
import numpy as np
import os
import pandas as pd
from sklearn.externals import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import joblib

## Prepare the Data

Now let's download the Iris data to your local directory.

In [None]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

In [None]:
data = data = pd.read_csv('iris.data', 
                   names=['sepal length', 'sepal width', 
                          'petal length', 'petal width', 
                          'label'])
data.head()

In [None]:
#split the data into train and test
train,test= np.split(data.sample(frac=1, random_state=22), [int(0.7 * len(data))])
train.head()

#write your csv files to the local
train.to_csv("train.csv")
test.to_csv("test.csv")

## Data processing and training
- we need to convert the string labels to numeric labels to fit in our SkLearn model
- we need to sperate out the features from target variable and define train, test and their labels
- we also would like to standardise the features before fitting them into the model

In [None]:
# Dictionary to encode labels to codes
label_encode = {
    'Iris-virginica': 0,
    'Iris-versicolor': 1,
    'Iris-setosa': 2
}

# Dictionary to convert codes to labels
label_decode = {
    0: 'Iris-virginica',
    1: 'Iris-versicolor',
    2: 'Iris-setosa'
}

# sperate out the features from target variable and define train, test and their labels
train = pd.read_csv('train.csv',index_col=0, engine="python")
y_train= train['label'].map(label_encode)
X_train =  train.drop(["label"], axis=1)
    
test = pd.read_csv('test.csv',index_col=0, engine="python")
y_test= test['label'].map(label_encode)
X_test =  test.drop(["label"], axis=1)


## Build and fit the Model

The model chosen from the Scikit- learn classifiers, is the widely used a random forest model and takes the features and labels as input and returns the predicted lable or the probabilities (if chosen) as output.
Scikit-learn makes fitting and evaluating the model straightforward enough


In [None]:
#train the logistic regression model
n_estimators= 100
min_samples_leaf= 3
model = RandomForestClassifier(
        n_estimators=n_estimators,
        min_samples_leaf=min_samples_leaf)
model.fit(X_train, y_train)
model

## Save the Trained Model

We use Joblib to save the model and then load it for prediction.


In [None]:
#use Joblib to save the model 
# see scikit learn documentation here:https://scikit-learn.org/stable/model_persistence.html
joblib.dump(model, "model.joblib")

Let's Explore Results

In [None]:
# load the model using joblib
loaded_model = joblib.load("model.joblib")

#get the data to predict
result = loaded_model.predict(X_test)
results=' | '.join([label_decode[t] for t in result])
results

All done!
