{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "8b1bc4f8",
   "metadata": {},
   "source": [
    "# Using Jupyter Magics on EMR Studio\n",
    "\n",
    "\n",
    "#### Topics covered in this example:\n",
    "\n",
    "* Built-in Magics\n",
    "* EMR Magics\n",
    "    * Mounting a Workspace Directory\n",
    "    * Executing local python files or package\n",
    "    * Downloading a file\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4602d4e6",
   "metadata": {},
   "source": [
    "***\n",
    "\n",
    "## Prerequisites\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>NOTE :</b> In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.</div>\n",
    "\n",
    "* The EMR cluster attached to this notebook should have the `Spark` application installed.\n",
    "* This notebook uses the `PySpark` kernel.\n",
    "***\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ed4a9c0",
   "metadata": {},
   "source": [
    "## Built-in Magics\n",
    "\n",
    "Jupyter magics act as convenient functions that accomplish something useful and saves the effort of writing Python code instead. There are useful buiilt-in magic functions and some are unique to EMR Studio, we document them here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-magics.html\n",
    "\n",
    "The most important magic commands are:\n",
    "\n",
    "1. `%%configure` - allows you to change the session properties of a Spark session in Spark Kernels:\n",
    "\n",
    "```\n",
    "%%configure -f\n",
    "{ \"conf\": {\n",
    "     spark.submit.deployMode\":\"cluster\"\n",
    "     }\n",
    "}\n",
    "```\n",
    "\n",
    "2. `%%display` - is only available in Spark Kernels and allows you to display the rows of a Spark dataframe in a tabular format in addition to providing the ability to visualize the rows in a chart.\n",
    "\n",
    "```\n",
    "%%display df\n",
    "```\n",
    "\n",
    "Let\"s see the `%%display` magic in action:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96714577",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = [{\n",
    "    \"Category\": \"A\",\n",
    "    \"ID\": 1,\n",
    "    \"Value\": 121.44,\n",
    "    \"Truth\": True\n",
    "}, {\n",
    "    \"Category\": \"B\",\n",
    "    \"ID\": 2,\n",
    "    \"Value\": 300.01,\n",
    "    \"Truth\": False\n",
    "}, {\n",
    "    \"Category\": \"C\",\n",
    "    \"ID\": 3,\n",
    "    \"Value\": 10.99,\n",
    "    \"Truth\": None\n",
    "}, {\n",
    "    \"Category\": \"E\",\n",
    "    \"ID\": 4,\n",
    "    \"Value\": 33.87,\n",
    "    \"Truth\": True\n",
    "}]\n",
    "\n",
    "df = spark.createDataFrame(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1e29448c",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%display\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ea11a01",
   "metadata": {},
   "source": [
    "## EMR Magics\n",
    "\n",
    "The EMR magics package available here (https://pypi.org/simple emr-notebooks-magics) offers the following magics that can be used on Python3 kernels as well as Spark Kernels on EMR Studio. The two magics we discuss in this notebook are:\n",
    "\n",
    "* mount_workspace_dir - allows you to mount an EMR Studio Workspace directory to an EMR Cluster.\n",
    "* generate_s3_download_url - alows you to generate a temporary signed download URL for an S3 object.\n",
    "\n",
    "Lets install the EMR-notebooks-magics package on your EMR Cluster:\n",
    "\n",
    "```\n",
    "%pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple emr-notebooks-magics\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "569f2868",
   "metadata": {},
   "source": [
    "### Mounting a Workspace Directory\n",
    "\n",
    "Lets mount a Workpsace directory on to the EMR cluster:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa127024",
   "metadata": {},
   "outputs": [],
   "source": [
    "%mount_workspace_dir <Workspace_Directory>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c36a413",
   "metadata": {},
   "source": [
    "Note that your current directory changes to the mounted Workspace directory and you can list the contents in it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cb3ee468",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "pwd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f868793",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "ls"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "750782e7",
   "metadata": {},
   "source": [
    "### Executing Local Python Files or Packages.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "983b4594",
   "metadata": {},
   "source": [
    "You can now execute local python files and packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d6d05ba",
   "metadata": {},
   "outputs": [],
   "source": [
    "%run -i  \"\"<local_python_file.py>\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84148202",
   "metadata": {},
   "source": [
    "### Downloading a File.\n",
    "\n",
    "Sometimes we need to download a file to our local desktop for e.g. to further analyze some data in Excel. Lets now see the `generate_s3_download_url` magic in action that allows us to do just that. We save the dataframe as a Parquet file in an S3 bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bec9d8dd",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = _\n",
    "s3_url = \"s3://<bucket>/<prefix>/<filename>.parquet.gzip\"\n",
    "df.to_parquet(s3_url, compression=\"gzip\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fcc17350",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sh\n",
    "aws s3 ls s3://<bucket>/<prefix>/<filename>.parquet.gzip"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c671cc85",
   "metadata": {},
   "source": [
    "We can now generate a download URL for this file. Note that the url is a temporary one and the command provides options on how long the url should be available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dbda72bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "%generate_s3_download_url s3://<bucket>/<prefix>/<filename>.parquet.gzip"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f853f3a",
   "metadata": {},
   "source": [
    "and then view the link for the output."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "PySpark",
   "language": "",
   "name": "pysparkkernel"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "python",
    "version": 3
   },
   "mimetype": "text/x-python",
   "name": "pyspark",
   "pygments_lexer": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}