{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0ec7f76d",
   "metadata": {},
   "source": [
    "## Training Notebook\n",
    "\n",
    "This notebook illustrates training of a simple model to classify digits using the MNIST dataset. This code is used to train the model included with the templates. This is meant to be a starter model to show you how to set up Serverless applications to do inferences. For deeper understanding of how to train a good model for MNIST, we recommend literature from the [MNIST website](http://yann.lecun.com/exdb/mnist/). The dataset is made available under a [Creative Commons Attribution-Share Alike 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "42e8a263",
   "metadata": {},
   "outputs": [],
   "source": [
    "# We'll use scikit-learn to load the dataset\n",
    "\n",
    "! pip install -q scikit-learn==0.23.2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "e8d2a791",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the mnist dataset\n",
    "\n",
    "from sklearn.datasets import fetch_openml\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X, y = fetch_openml('mnist_784', return_X_y=True)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "067fda9a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Next, let's add code for deskewing images (we will use this to improve accuracy)\n",
    "# This code comes from https://fsix.github.io/mnist/Deskewing.html\n",
    "\n",
    "from scipy.ndimage import interpolation\n",
    "\n",
    "def moments(image):\n",
    "    c0, c1 = np.mgrid[:image.shape[0], :image.shape[1]]\n",
    "    img_sum = np.sum(image)\n",
    "    \n",
    "    m0 = np.sum(c0 * image) / img_sum\n",
    "    m1 = np.sum(c1 * image) / img_sum\n",
    "    m00 = np.sum((c0-m0)**2 * image) / img_sum\n",
    "    m11 = np.sum((c1-m1)**2 * image) / img_sum\n",
    "    m01 = np.sum((c0-m0) * (c1-m1) * image) / img_sum\n",
    "    \n",
    "    mu_vector = np.array([m0,m1])\n",
    "    covariance_matrix = np.array([[m00, m01],[m01, m11]])\n",
    "    \n",
    "    return mu_vector, covariance_matrix\n",
    "\n",
    "\n",
    "def deskew(image):\n",
    "    c, v = moments(image)\n",
    "    alpha = v[0,1] / v[0,0]\n",
    "    affine = np.array([[1,0], [alpha,1]])\n",
    "    ocenter = np.array(image.shape) / 2.0\n",
    "    offset = c - np.dot(affine, ocenter)\n",
    "\n",
    "    return interpolation.affine_transform(image, affine, offset=offset)\n",
    "\n",
    "\n",
    "def deskew_images(images):\n",
    "    output_images = []\n",
    "    \n",
    "    for image in images:\n",
    "        output_images.append(deskew(image.reshape(28, 28)).flatten())\n",
    "    \n",
    "    return np.array(output_images)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "831eba31",
   "metadata": {},
   "source": [
    "## XGBoost Model Training\n",
    "\n",
    "For this example, we will train a simple XGBoost classifier using scikit-learn to classify the MNIST digits. We will then freeze the model using `save_model()`. This is same as the starter model file included with the SAM templates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "3348a4cb",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip -q install xgboost==1.1.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "c726ae17",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using versions: sklearn 0.23.2 xgboost 1.1.1\n",
      "Accuracy without deskewing: 0.9782\n"
     ]
    }
   ],
   "source": [
    "import sklearn\n",
    "import xgboost as xgb\n",
    "import numpy as np\n",
    "\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "print('Using versions: sklearn', sklearn.__version__, 'xgboost', xgb.__version__)\n",
    "\n",
    "# Remove the \"tree_method\" and \"gpu_id\" params if you don't have CUDA-compatible GPU set up\n",
    "xgb_model = xgb.XGBClassifier(n_jobs=-1, tree_method='gpu_hist', gpu_id=0)\n",
    "xgb_model.fit(X_train, y_train, eval_metric='error')\n",
    "\n",
    "accuracy = accuracy_score(y_test, xgb_model.predict(np.array(X_test)))\n",
    "\n",
    "print('Accuracy without deskewing:', accuracy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "3827e72a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy with deskewing: 0.9804\n"
     ]
    }
   ],
   "source": [
    "# Next, we will try with deskewing to see if accuracy improves\n",
    "\n",
    "# Remove the \"tree_method\" and \"gpu_id\" params if you don't have CUDA-compatible GPU set up\n",
    "xgb_model = xgb.XGBClassifier(n_jobs=-1, tree_method='gpu_hist', gpu_id=0)\n",
    "xgb_model.fit(deskew_images(X_train), y_train, eval_metric='error')\n",
    "\n",
    "accuracy = accuracy_score(y_test, xgb_model.predict(deskew_images(X_test)))\n",
    "\n",
    "print('Accuracy with deskewing:', accuracy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "38d22e00",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save model to the disk\n",
    "xgb_model.save_model('xgb_digit_classifier.model')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5e3a1f67",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}