{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# An example notebook in R\n",
    "\n",
    "This is an example notebook in R to show how to use the `sagemaker-run-notebook` tool to run R-based notebooks.\n",
    "\n",
    "See the [ReadMe file](README.md) in this directory for information on how to run this notebook.\n",
    "\n",
    "The demonstration here is derived from an example in Winston Chang's __R Graphics Cookbook__ in the section \"Adding Fitted Regression Lines.\"\n",
    "\n",
    "We use `ggplot2` for plotting the dataset, so you'll need to build a container that installs that. Running the shell script `./build-container.sh` in this directory will invoke the CLI tool `run-notebook create-container` with the right arguments for this. The resulting container will be called `r-notebook-runner`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "library(MASS)\n",
    "library(ggplot2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example uses the `biopsy` dataset from the `MASS` library included with R. This is a 699 sample dataset of biopsy results from breast cancer patients. It can be used to try out various machine learning algorithms. Here we just want to visualize the relationship between an input column and the biopsy result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "b <- biopsy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset has two classes, \"benign\" and \"malignant\" and we want to get a feel for how predictive the input columns are. We simply create a numeric feature that's 1 for the malignant samples and 0 for the benign samples. This will allow us to plot a sigmoid that gives a feeling for the relationship between that column and the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "b$classn[b$class == \"benign\"] <- 0\n",
    "b$classn[b$class == \"malignant\"] <- 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The biopsy dataset has 9 measured attributes. These are named `V1` through `V9`, which is not very helpful, but if you get help with `?biopsy` you'll get information about each column. Each of the attributes is scored as an integer from 1 to 10.\n",
    "\n",
    "We'll mark the following cell as the parameter cell. You can look at the sidebar with the wrench icon when you have the cell selected and you'll see that the cell has the `parameters` tag in it's metadata. That's how papermill knows where to substitute in the parameters passed when you run the notebook.\n",
    "\n",
    "By default, we'll see how predictive column `V1` (\"clump thickness\") is, but you can set the `columns` parameter to any of the values `V1` to `V9` when you run the notebook to use that column instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# This is the parameter cell. Override with \"V1\" to \"V9\" to pick one of the 9 input variables to graph.\n",
    "column = \"V1\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we use `ggplot` to visualize the relationship between the selected column and the biopsy result. There are two parts to this visualization:\n",
    "\n",
    "1. Point clouds at each of the possible positions. We use jitter and transparency to give us a feeling for the density at each of the 20 possible positions for an (attribute, result) pair.\n",
    "2. A smoothed line plotted with a logit and error to help us understand the predictive power of a linear model on this attribute alone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ggplot(b, aes(x=.data[[column]], y=classn)) +\n",
    "  geom_point(position=position_jitter(width=0.3, height=0.06), alpha=0.2, shape=21, size=1.5) + \n",
    "  stat_smooth(method=glm, method.args = list(family = \"binomial\"), formula=y ~ x)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.1"
  },
  "sagemaker_run_notebook": {
   "saved_parameters": [
    {
     "name": "column",
     "value": "V2"
    }
   ]
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}