{ "cells": [ { "cell_type": "markdown", "id": "8b1bc4f8", "metadata": {}, "source": [ "# Using Jupyter Magics on EMR Studio\n", "\n", "\n", "#### Topics covered in this example:\n", "\n", "* Built-in Magics\n", "* EMR Magics\n", " * Mounting a Workspace Directory\n", " * Executing local python files or package\n", " * Downloading a file\n" ] }, { "cell_type": "markdown", "id": "4602d4e6", "metadata": {}, "source": [ "***\n", "\n", "## Prerequisites\n", "
\n", "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n", "\n", "* The EMR cluster attached to this notebook should have the `Spark` application installed.\n", "* This notebook uses the `PySpark` kernel.\n", "***\n", "\n" ] }, { "cell_type": "markdown", "id": "4ed4a9c0", "metadata": {}, "source": [ "## Built-in Magics\n", "\n", "Jupyter magics act as convenient functions that accomplish something useful and saves the effort of writing Python code instead. There are useful buiilt-in magic functions and some are unique to EMR Studio, we document them here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-magics.html\n", "\n", "The most important magic commands are:\n", "\n", "1. `%%configure` - allows you to change the session properties of a Spark session in Spark Kernels:\n", "\n", "```\n", "%%configure -f\n", "{ \"conf\": {\n", " spark.submit.deployMode\":\"cluster\"\n", " }\n", "}\n", "```\n", "\n", "2. `%%display` - is only available in Spark Kernels and allows you to display the rows of a Spark dataframe in a tabular format in addition to providing the ability to visualize the rows in a chart.\n", "\n", "```\n", "%%display df\n", "```\n", "\n", "Let\"s see the `%%display` magic in action:" ] }, { "cell_type": "code", "execution_count": null, "id": "96714577", "metadata": {}, "outputs": [], "source": [ "data = [{\n", " \"Category\": \"A\",\n", " \"ID\": 1,\n", " \"Value\": 121.44,\n", " \"Truth\": True\n", "}, {\n", " \"Category\": \"B\",\n", " \"ID\": 2,\n", " \"Value\": 300.01,\n", " \"Truth\": False\n", "}, {\n", " \"Category\": \"C\",\n", " \"ID\": 3,\n", " \"Value\": 10.99,\n", " \"Truth\": None\n", "}, {\n", " \"Category\": \"E\",\n", " \"ID\": 4,\n", " \"Value\": 33.87,\n", " \"Truth\": True\n", "}]\n", "\n", "df = spark.createDataFrame(data)" ] }, { "cell_type": "code", "execution_count": null, "id": "1e29448c", "metadata": {}, "outputs": [], "source": [ "%%display\n", "df" ] }, { "cell_type": "markdown", "id": "5ea11a01", "metadata": {}, "source": [ "## EMR Magics\n", "\n", "The EMR magics package available here (https://pypi.org/simple emr-notebooks-magics) offers the following magics that can be used on Python3 kernels as well as Spark Kernels on EMR Studio. The two magics we discuss in this notebook are:\n", "\n", "* mount_workspace_dir - allows you to mount an EMR Studio Workspace directory to an EMR Cluster.\n", "* generate_s3_download_url - alows you to generate a temporary signed download URL for an S3 object.\n", "\n", "Lets install the EMR-notebooks-magics package on your EMR Cluster:\n", "\n", "```\n", "%pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple emr-notebooks-magics\n", "```" ] }, { "cell_type": "markdown", "id": "569f2868", "metadata": {}, "source": [ "### Mounting a Workspace Directory\n", "\n", "Lets mount a Workpsace directory on to the EMR cluster:" ] }, { "cell_type": "code", "execution_count": null, "id": "aa127024", "metadata": {}, "outputs": [], "source": [ "%mount_workspace_dir " ] }, { "cell_type": "markdown", "id": "7c36a413", "metadata": {}, "source": [ "Note that your current directory changes to the mounted Workspace directory and you can list the contents in it." ] }, { "cell_type": "code", "execution_count": null, "id": "cb3ee468", "metadata": {}, "outputs": [], "source": [ "%%sh\n", "pwd" ] }, { "cell_type": "code", "execution_count": null, "id": "9f868793", "metadata": {}, "outputs": [], "source": [ "%%sh\n", "ls" ] }, { "cell_type": "markdown", "id": "750782e7", "metadata": {}, "source": [ "### Executing Local Python Files or Packages.\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "983b4594", "metadata": {}, "source": [ "You can now execute local python files and packages." ] }, { "cell_type": "code", "execution_count": null, "id": "5d6d05ba", "metadata": {}, "outputs": [], "source": [ "%run -i \"\"\"" ] }, { "cell_type": "markdown", "id": "84148202", "metadata": {}, "source": [ "### Downloading a File.\n", "\n", "Sometimes we need to download a file to our local desktop for e.g. to further analyze some data in Excel. Lets now see the `generate_s3_download_url` magic in action that allows us to do just that. We save the dataframe as a Parquet file in an S3 bucket." ] }, { "cell_type": "code", "execution_count": null, "id": "bec9d8dd", "metadata": {}, "outputs": [], "source": [ "df = _\n", "s3_url = \"s3:////.parquet.gzip\"\n", "df.to_parquet(s3_url, compression=\"gzip\")" ] }, { "cell_type": "code", "execution_count": null, "id": "fcc17350", "metadata": {}, "outputs": [], "source": [ "%%sh\n", "aws s3 ls s3:////.parquet.gzip" ] }, { "cell_type": "markdown", "id": "c671cc85", "metadata": {}, "source": [ "We can now generate a download URL for this file. Note that the url is a temporary one and the command provides options on how long the url should be available." ] }, { "cell_type": "code", "execution_count": null, "id": "dbda72bf", "metadata": {}, "outputs": [], "source": [ "%generate_s3_download_url s3:////.parquet.gzip" ] }, { "cell_type": "markdown", "id": "2f853f3a", "metadata": {}, "source": [ "and then view the link for the output." ] } ], "metadata": { "kernelspec": { "display_name": "PySpark", "language": "", "name": "pysparkkernel" }, "language_info": { "codemirror_mode": { "name": "python", "version": 3 }, "mimetype": "text/x-python", "name": "pyspark", "pygments_lexer": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }