{ "cells": [ { "cell_type": "markdown", "id": "0ec7f76d", "metadata": {}, "source": [ "## Training Notebook\n", "\n", "This notebook illustrates training of a simple model to classify digits using the MNIST dataset. This code is used to train the model included with the templates. This is meant to be a starter model to show you how to set up Serverless applications to do inferences. For deeper understanding of how to train a good model for MNIST, we recommend literature from the [MNIST website](http://yann.lecun.com/exdb/mnist/). The dataset is made available under a [Creative Commons Attribution-Share Alike 3.0](https://creativecommons.org/licenses/by-sa/3.0/) license." ] }, { "cell_type": "code", "execution_count": 4, "id": "42e8a263", "metadata": {}, "outputs": [], "source": [ "# We'll use scikit-learn to load the dataset\n", "\n", "! pip install -q scikit-learn==0.23.2" ] }, { "cell_type": "code", "execution_count": 12, "id": "e8d2a791", "metadata": {}, "outputs": [], "source": [ "# Load the mnist dataset\n", "\n", "from sklearn.datasets import fetch_openml\n", "from sklearn.model_selection import train_test_split\n", "\n", "X, y = fetch_openml('mnist_784', return_X_y=True)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10000)" ] }, { "cell_type": "code", "execution_count": 13, "id": "067fda9a", "metadata": {}, "outputs": [], "source": [ "# Next, let's add code for deskewing images (we will use this to improve accuracy)\n", "# This code comes from https://fsix.github.io/mnist/Deskewing.html\n", "\n", "from scipy.ndimage import interpolation\n", "\n", "def moments(image):\n", " c0, c1 = np.mgrid[:image.shape[0], :image.shape[1]]\n", " img_sum = np.sum(image)\n", " \n", " m0 = np.sum(c0 * image) / img_sum\n", " m1 = np.sum(c1 * image) / img_sum\n", " m00 = np.sum((c0-m0)**2 * image) / img_sum\n", " m11 = np.sum((c1-m1)**2 * image) / img_sum\n", " m01 = np.sum((c0-m0) * (c1-m1) * image) / img_sum\n", " \n", " mu_vector = np.array([m0,m1])\n", " covariance_matrix = np.array([[m00, m01],[m01, m11]])\n", " \n", " return mu_vector, covariance_matrix\n", "\n", "\n", "def deskew(image):\n", " c, v = moments(image)\n", " alpha = v[0,1] / v[0,0]\n", " affine = np.array([[1,0], [alpha,1]])\n", " ocenter = np.array(image.shape) / 2.0\n", " offset = c - np.dot(affine, ocenter)\n", "\n", " return interpolation.affine_transform(image, affine, offset=offset)\n", "\n", "\n", "def deskew_images(images):\n", " output_images = []\n", " \n", " for image in images:\n", " output_images.append(deskew(image.reshape(28, 28)).flatten())\n", " \n", " return np.array(output_images)\n" ] }, { "cell_type": "markdown", "id": "831eba31", "metadata": {}, "source": [ "## XGBoost Model Training\n", "\n", "For this example, we will train a simple XGBoost classifier using scikit-learn to classify the MNIST digits. We will then freeze the model using `save_model()`. This is same as the starter model file included with the SAM templates." ] }, { "cell_type": "code", "execution_count": 14, "id": "3348a4cb", "metadata": {}, "outputs": [], "source": [ "! pip -q install xgboost==1.1.1" ] }, { "cell_type": "code", "execution_count": 17, "id": "c726ae17", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using versions: sklearn 0.23.2 xgboost 1.1.1\n", "Accuracy without deskewing: 0.9782\n" ] } ], "source": [ "import sklearn\n", "import xgboost as xgb\n", "import numpy as np\n", "\n", "from sklearn.metrics import accuracy_score\n", "\n", "print('Using versions: sklearn', sklearn.__version__, 'xgboost', xgb.__version__)\n", "\n", "# Remove the \"tree_method\" and \"gpu_id\" params if you don't have CUDA-compatible GPU set up\n", "xgb_model = xgb.XGBClassifier(n_jobs=-1, tree_method='gpu_hist', gpu_id=0)\n", "xgb_model.fit(X_train, y_train, eval_metric='error')\n", "\n", "accuracy = accuracy_score(y_test, xgb_model.predict(np.array(X_test)))\n", "\n", "print('Accuracy without deskewing:', accuracy)" ] }, { "cell_type": "code", "execution_count": 18, "id": "3827e72a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy with deskewing: 0.9804\n" ] } ], "source": [ "# Next, we will try with deskewing to see if accuracy improves\n", "\n", "# Remove the \"tree_method\" and \"gpu_id\" params if you don't have CUDA-compatible GPU set up\n", "xgb_model = xgb.XGBClassifier(n_jobs=-1, tree_method='gpu_hist', gpu_id=0)\n", "xgb_model.fit(deskew_images(X_train), y_train, eval_metric='error')\n", "\n", "accuracy = accuracy_score(y_test, xgb_model.predict(deskew_images(X_test)))\n", "\n", "print('Accuracy with deskewing:', accuracy)" ] }, { "cell_type": "code", "execution_count": 19, "id": "38d22e00", "metadata": {}, "outputs": [], "source": [ "# Save model to the disk\n", "xgb_model.save_model('xgb_digit_classifier.model')" ] }, { "cell_type": "code", "execution_count": null, "id": "5e3a1f67", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" } }, "nbformat": 4, "nbformat_minor": 5 }