{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Forest Inference Library (FIL)\n", "The forest inference library is used to load saved forest models of xgboost, lightgbm or protobuf and perform inference on them. It can be used to perform both classification and regression. In this notebook, we'll begin by fitting a model with XGBoost and saving it. We'll then load the saved model into FIL and use it to infer on new data.\n", "\n", "FIL works in the same way with lightgbm and protobuf model as well.\n", "\n", "The model accepts both numpy arrays and cuDF dataframes. In order to convert your dataset to cudf format please read the cudf documentation on https://rapidsai.github.io/projects/cudf/en/latest/. \n", "\n", "For additional information on the forest inference library please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/latest/index.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!conda install -c rapidsai -c nvidia -c conda-forge \\\n", " -c defaults rapids=0.13 python=3.6" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "conda install -c conda-forge xgboost" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import os\n", "\n", "from cuml.test.utils import array_equal\n", "from cuml.utils.import_utils import has_xgboost\n", "\n", "from sklearn.datasets import make_classification\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.model_selection import train_test_split\n", " \n", "from cuml import ForestInference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check for xgboost\n", "Checks if xgboost is present, if not then it throws an error." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if has_xgboost():\n", " import xgboost as xgb\n", "else:\n", " raise ImportError(\"Please install xgboost using the conda package,\"\n", " \" Use conda install -c conda-forge xgboost \"\n", " \"command to install xgboost\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train helper function\n", "Defines a simple function that trains the XGBoost model and returns the trained model.\n", "\n", "For additional information on the xgboost library please refer to the documentation on : \n", "https://xgboost.readthedocs.io/en/latest/parameter.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def train_xgboost_model(X_train, y_train,\n", " num_rounds, model_path):\n", " # set the xgboost model parameters\n", " params = {'silent': 1, 'eval_metric':'error',\n", " 'objective':'binary:logistic',\n", " 'max_depth': 25}\n", " dtrain = xgb.DMatrix(X_train, label=y_train)\n", " # train the xgboost model\n", " bst = xgb.train(params, dtrain, num_rounds)\n", "\n", " # save the trained xgboost model\n", " bst.save_model(model_path)\n", "\n", " return bst" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predict helper function\n", "Uses the trained xgboost model to perform prediction and return the labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def predict_xgboost_model(X_validation, y_validation, xgb_model):\n", "\n", " # predict using the xgboost model\n", " dvalidation = xgb.DMatrix(X_validation, label=y_validation)\n", " xgb_preds = xgb_model.predict(dvalidation)\n", "\n", " # convert the predicted values from xgboost into class labels\n", " xgb_preds = np.around(xgb_preds)\n", " return xgb_preds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Define parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_rows = 10000\n", "n_columns = 100\n", "n_categories = 2\n", "random_state = np.random.RandomState(43210)\n", "\n", "# enter path to the directory where the trained model will be saved\n", "model_path = 'xgb.model'\n", "\n", "# num of iterations for which the model is trained\n", "num_rounds = 15" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create the dataset\n", "X, y = make_classification(n_samples=n_rows,\n", " n_features=n_columns,\n", " n_informative=int(n_columns/5),\n", " n_classes=n_categories,\n", " random_state=random_state)\n", "train_size = 0.8\n", "\n", "# convert the dataset to np.float32\n", "X = X.astype(np.float32)\n", "y = y.astype(np.float32)\n", "\n", "# split the dataset into training and validation splits\n", "X_train, X_validation, y_train, y_validation = train_test_split(\n", " X, y, train_size=train_size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train and Predict the model\n", "Invoke the function to train the model and get predictions so that we can validate them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# train the xgboost model\n", "xgboost_model = train_xgboost_model(X_train, y_train,\n", " num_rounds, model_path)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "# test the xgboost model\n", "trained_model_preds = predict_xgboost_model(X_validation,\n", " y_validation,\n", " xgboost_model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Forest Inference Library (FIL)\n", "\n", "The load function of the ForestInference class accepts the following parameters:\n", "\n", " filename : str\n", " Path to saved model file in a treelite-compatible format\n", " (See https://treelite.readthedocs.io/en/latest/treelite-api.html\n", " output_class : bool\n", " If true, return a 1 or 0 depending on whether the raw prediction\n", " exceeds the threshold. If False, just return the raw prediction.\n", " threshold : float\n", " Cutoff value above which a prediction is set to 1.0\n", " Only used if the model is classification and output_class is True\n", " algo : string name of the algo from (from algo_t enum)\n", " 'NAIVE' - simple inference using shared memory\n", " 'TREE_REORG' - similar to naive but trees rearranged to be more\n", " coalescing-friendly\n", " 'BATCH_TREE_REORG' - similar to TREE_REORG but predicting\n", " multiple rows per thread block\n", " model_type : str\n", " Format of saved treelite model to load.\n", " Can be 'xgboost', 'lightgbm', or 'protobuf'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loaded the saved model\n", "Use FIL to load the saved xgboost model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fm = ForestInference.load(filename=model_path,\n", " algo='BATCH_TREE_REORG',\n", " output_class=True,\n", " threshold=0.50,\n", " model_type='xgboost')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predict using FIL" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "# perform prediction on the model loaded from path\n", "fil_preds = fm.predict(X_validation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate results\n", "\n", "Verify the predictions for the original and FIL model match." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"The shape of predictions obtained from xgboost : \",(trained_model_preds).shape)\n", "print(\"The shape of predictions obtained from FIL : \",(fil_preds).shape)\n", "print(\"Are the predictions for xgboost and FIL the same : \" , array_equal(trained_model_preds, fil_preds))" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }