# Scikit-learn Iris Classifier - Local Example

> *This notebook works well with the `Python 3 (Data Science 3.0)` kernel on SageMaker Studio, or `conda_python3` on classic SageMaker Notebook Instances*

In this example, we'll train a random forest model to classify types of iris plant based on measurements of their flowers: Using the well-known [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris).

This "local" notebook will perform all data storage and computation directly on the notebook itself

>‚ùì *Can you figure out how to re-create the workflow more effectively using SageMaker training and inference?*

See the accompanying [**Instructions**](Instructions.ipynb) notebook for more guidance!

---

## Contents

1. **[Prepare the data](#Prepare-the-data)**
1. **[Data processing and training](#Data-processing-and-training)**
1. **[Build and fit the model](#Build-and-fit-the-model)**
1. **[Save the trained model](#Save-the-trained-model)**
1. **[Load and use the model](#Load-and-use-the-model)**
1. **[Next steps](#Next-steps)**

As usual before getting started, we'll import the necessary libraries which should already be installed:

In [None]:
%load_ext autoreload
%autoreload 2

# External Dependencies:
import joblib  # Utilities for saving and re-loading models
import numpy as np  # Matrix/vector math tools
import pandas as pd  # DataFrame (tablular data) utilities
from sklearn import metrics  # Classification model scoring utilities
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

## Prepare the data

The [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris) is available on the popular [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php), which also hosts hundreds of others.

In this example though, we'll download it from an Amazon-maintained copy on Amazon S3 for speed and convenience:

In [None]:
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/iris/iris.data ./data/iris.csv

We can load and visualize the raw data with Pandas, but need to specify the field names manually because the source data doesn't have a header row:

In [None]:
data = pd.read_csv(
    "data/iris.csv",
    names=["sepal length", "sepal width", "petal length", "petal width", "label"],
)
data.head()

The flowers in this dataset are classed into 3 categories with equal representation across each, as shown below:

In [None]:
label_counts = data["label"].value_counts()
class_names = sorted(label_counts.index)
print(f"class_names: {class_names}")
label_counts

Since class imbalance is not an issue, we can randomly shuffle the dataset to split it into separate partitions for model training and evaluation:

In [None]:
train, test = np.split(data.sample(frac=1, random_state=22), [int(0.7 * len(data))])

os.makedirs("data/train", exist_ok=True)
train.to_csv("data/train/train.csv", index=False)
os.makedirs("data/test", exist_ok=True)
test.to_csv("data/test/test.csv", index=False)
train.head()

## Data processing and training

Our model will need zero-based numeric class IDs rather than human-readable names, and separate inputs for the dependent/target variable (the class) versus the independent/input features.

The below cell will load and prepare the raw training and test files:

In [None]:
# Dictionary to map class names to numeric IDs:
label2idx = {name: ix for ix, name in enumerate(class_names)}
# Dictionary to map numeric IDs to class names:
idx2label = {ix: name for ix, name in enumerate(class_names)}

train = pd.read_csv("data/train/train.csv")
y_train = train["label"].map(label2idx)
X_train = train.drop(columns=["label"])

test = pd.read_csv("data/train/test.csv")
y_test = test["label"].map(label2idx)
X_test = test.drop(columns=["label"])

## Build and fit the model

To classify the examples we'll use the [random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) implementation from Scikit-Learn.

We'll first configure the model hyperparameters, and then fit it to the training data by calling `.fit()`:

In [None]:
# Set up model parameters:
n_estimators = 100
min_samples_leaf = 3
model = RandomForestClassifier(n_estimators=n_estimators, min_samples_leaf=min_samples_leaf)

# Train (fit) the model to the data:
model.fit(X_train, y_train)
model

## Save the trained model

As [recommended by Scikit-Learn](https://scikit-learn.org/stable/model_persistence.html), we'll use [Joblib](https://joblib.readthedocs.io/en/latest/) to save the trained model and reload it later for prediction:

In [None]:
joblib.dump(model, "data/model.joblib")

## Load and use the model

Finally, we can re-load the model from the joblib file and try it out on the test data:

In [None]:
# Load the model using joblib:
loaded_model = joblib.load("data/model.joblib")

# Run prediction on the test set:
y_pred = loaded_model.predict(X_test)

results = test.copy()
results["predicted label"] = pd.Series(y_pred).map(lambda idx: idx2label[idx])
results["predicted correct"] = results["label"] == results["predicted label"]
n_correct = results["predicted correct"].sum()
print(
    "Correctly predicted {} out of {} examples ({:.2%})".format(
        n_correct, len(results), n_correct / len(results)
    )
)

results

## Next steps

Here we demonstrated a basic tabular data classification use case, for which you should have seen relatively fast model training and high accuracy.

Next, head on over to the [Instructions notebook](Instructions.ipynb) for guidance on migrating this workflow to a SageMaker training job and model inference!