{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualize data with `pandas` and `matplotlib`\n",
    "\n",
    "#### Topics covered in this example\n",
    "* Installing python libraries on the EMR cluster.\n",
    "* Reading data as `pandas` DataFrame and converting to `NumPy` array.\n",
    "* Pie, bar, histogram and scatter plots using `matplotlib`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "\n",
    "## Prerequisites\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>NOTE :</b> In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.</div>\n",
    "\n",
    "* This example uses a public dataset from stanford, hence the EMR cluster attached to this notebook must have internet connectivity.\n",
    "* You may have to restart the kernel after installing the libraries.\n",
    "* This notebook uses the `Python3` kernel.\n",
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "In this example, we are going to visualize data using the `matplotlib` library.  \n",
    "<a href=\"https://github.com/matplotlib/matplotlib\" target=\"_blank\">Matplotlib</a> is one of the most popular Python packages used for data visualization.\n",
    "<a href=\"https://github.com/matplotlib/matplotlib/tree/master/examples\" target=\"_blank\">Gallery</a> covers a variety of ways to use Matplotlib.\n",
    "\n",
    "We use a public dataset (csv) from <a href=\"https://web.stanford.edu/class/cs102/datasets/Countries.csv\" target=\"_blank\">web.stanford.edu</a>  \n",
    "This dataset is a collection of information on coastal status, EU membership, and population (in millions) of different countries.\n",
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Install dependency `matplotlib`.\n",
    "\n",
    "`%pip install` is the same as `!/emr/notebook-env/bin/pip install` and are installed in `/home/emr-notebook/`.\n",
    "\n",
    "After installation, these libraries are available to any user running an EMR notebook attached to the cluster. Python libraries installed this way are available only to processes running on the master node. The libraries are not installed on core or task nodes and are not available to executors running on those nodes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install matplotlib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import dependencies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import math\n",
    "import matplotlib"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the dataset into a pandas DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"https://web.stanford.edu/class/cs102/datasets/Countries.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print some sample records.  \n",
    "`Population` is in millions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(df.head(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Add `% to total` in the dataset. The total of the `% to total` of all the rows is 100."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"% to total\"] = df[\"population\"].apply(lambda x: float(x/df[\"population\"].sum()*100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print sample records."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(df.head(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Display top 10 populated countries for the analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "top_ten_df = df.nlargest(10, [\"% to total\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pie chart for the population of top 10 countries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.pie(top_ten_df[\"% to total\"], labels=top_ten_df[\"country\"], shadow=False, startangle=0, frame=False, autopct=\"%1.1f%%\")\n",
    "plt.axis(\"equal\")\n",
    "plt.title(\"Country Population %\", loc=\"left\")\n",
    "plt.legend(labels = top_ten_df[\"country\"], bbox_to_anchor=(1.7,0.5), loc=\"right\")\n",
    "fig = plt.gcf()\n",
    "fig.set_size_inches(6,6) \n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bar chart for the population of top 10 countries\n",
    "\n",
    "`to_numpy()` converts the pandas DataFrame to a `NumPy` array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = top_ten_df[\"country\"].to_numpy()\n",
    "y = top_ten_df[\"population\"].to_numpy()\n",
    "\n",
    "x_pos = [i for i, _ in enumerate(x)]\n",
    "\n",
    "plt.bar(x_pos, y, color=\"green\")\n",
    "plt.xlabel(\"Country\")\n",
    "plt.ylabel(\"Population\")\n",
    "plt.title(\"Country Population\")\n",
    "fig = plt.gcf()\n",
    "fig.set_size_inches(13,6) \n",
    "plt.xticks(x_pos, x)\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Histogram for population of all countries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "population_dataset = df[\"population\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find bin size for the histogram\n",
    "n_bin = math.ceil((2*(np.percentile(population_dataset, 75) - np.percentile(population_dataset, 25))/population_dataset.count()**(1/3)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.hist(population_dataset, bins=n_bin, color=\"blue\") \n",
    "plt.xlabel(\"Country Population\")\n",
    "plt.ylabel(\"Number of Countries\")\n",
    "plt.title(\"Country Population Histogram\")\n",
    "plt.grid(True)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scatter plot for Boston Houses"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tabular data: Boston Housing Data.\n",
    "The Boston House contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.\n",
    "\n",
    "`CRIM` - per capita crime rate by town.  \n",
    "`LSTAT` - % lower status of the population.  \n",
    "\n",
    "Install dependancy `sklearn`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import *\n",
    "tabular_data = load_boston()\n",
    "tabular_data_df = pd.DataFrame(tabular_data.data, columns=tabular_data.feature_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print sample records."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tabular_data_df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = tabular_data_df[\"CRIM\"]\n",
    "y = tabular_data_df[\"LSTAT\"]\n",
    "\n",
    "plt.scatter(x, y, color=\"blue\", marker=\"o\")\n",
    "plt.xlabel(\"Crime\")\n",
    "plt.ylabel(\"Lower Status of Population\")\n",
    "plt.title(\"Scatterplot for Crime and lower status of the population\")\n",
    "plt.grid(True)\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}