{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Install notebook scoped libraries at runtime\n", "\n", "\n", "#### Topics covered in this example\n", "* Installing notebook scoped libraries at runtime using `install_pypi_package`.\n", "* Converting spark dataFrame to pandas dataFrame.\n", "* `%%display` magic example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "## Prerequisites\n", "
\n", "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n", "\n", "* The EMR cluster attached to this notebook should have the `Spark` application installed.\n", "* This example uses a public dataset, hence the EMR cluster attached to this notebook must have internet connectivity.\n", "* This notebook uses the `PySpark` kernel.\n", "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "This example shows how to use the notebook-scoped libraries feature of EMR Notebooks to import and install your favorite Python libraries at runtime on your EMR cluster, and use these libraries to enhance your data analysis and visualize your results in rich graphical plots.\n", "\n", "The example also shows how to use the `%%display` magic command and how to convert a spark dataFrame to a pandas dataFrame.\n", "\n", "The blogpost: Install Python libraries on a running cluster with EMR Notebooks and document: Installing and Using Kernels and Libraries provide detailed information.\n", "\n", "We use the Amazon customer review dataset that is publically accessible in s3. \n", "This dataset is a collection of reviews written in the Amazon.com marketplace and associated metadata from 1995 until 2015. This is intended to facilitate study into the properties (and the evolution) of customer reviews potentially including how people evaluate and express their experiences with respect to products at scale.\n", "\n", "****" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Examine the current notebook session configuration." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%info" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before starting the analysis, check the libraries that are already available on the cluster. \n", "Use `list_packages()` PySpark API, which lists all the Python libraries on the cluster." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sc.list_packages()" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.011228, "end_time": "2020-10-22T17:26:11.518665", "exception": false, "start_time": "2020-10-22T17:26:11.507437", "status": "completed" }, "tags": [] }, "source": [ "Load the Amazon customer reviews data for books into a Spark DataFrame." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "papermill": { "duration": 29.402308, "end_time": "2020-10-22T17:26:40.932705", "exception": false, "start_time": "2020-10-22T17:26:11.530397", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "df = spark.read.parquet(\"s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet\")" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.008858, "end_time": "2020-10-22T17:26:40.952269", "exception": false, "start_time": "2020-10-22T17:26:40.943411", "status": "completed" }, "tags": [] }, "source": [ "Determine the schema and number of available columns in the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "papermill": { "duration": 0.775197, "end_time": "2020-10-22T17:26:41.738952", "exception": false, "start_time": "2020-10-22T17:26:40.963755", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "# Total columns\n", "print(f\"Total Columns: {len(df.dtypes)}\")\n", "df.printSchema()" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.016013, "end_time": "2020-10-22T17:26:41.771853", "exception": false, "start_time": "2020-10-22T17:26:41.755840", "status": "completed" }, "tags": [] }, "source": [ "Determine the number of available rows in the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "papermill": { "duration": 0.080088, "end_time": "2020-10-22T17:26:41.867894", "exception": false, "start_time": "2020-10-22T17:26:41.787806", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "# Total row\n", "print(f\"Total Rows: {df.count():,}\")" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.016419, "end_time": "2020-10-22T17:26:41.899107", "exception": false, "start_time": "2020-10-22T17:26:41.882688", "status": "completed" }, "tags": [] }, "source": [ "Check the total number of books." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "papermill": { "duration": 0.815754, "end_time": "2020-10-22T17:26:42.728573", "exception": false, "start_time": "2020-10-22T17:26:41.912819", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "# Total number of books\n", "num_of_books = df.select(\"product_id\").distinct().count()\n", "print(f\"Number of Books: {num_of_books:,}\")" ] }, { "cell_type": "markdown", "metadata": { "papermill": { "duration": 0.023551, "end_time": "2020-10-22T17:26:42.773267", "exception": false, "start_time": "2020-10-22T17:26:42.749716", "status": "completed" }, "tags": [] }, "source": [ "Import the Pandas library version 0.25.1 and the latest Matplotlib library from the public PyPI repository. \n", "Install them on the cluster attached to the notebook using the `install_pypi_package` API.\n", "\n", "The `install_pypi_package` PySpark API installs the libraries along with any associated dependencies. \n", "By default, it installs the latest version of the library that is compatible with the Python version you are using. \n", "A specific version of the library can be installed by specifying the library version from the below Pandas example." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "papermill": { "duration": 0.296075, "end_time": "2020-10-22T17:26:43.125913", "exception": false, "start_time": "2020-10-22T17:26:42.829838", "status": "completed" }, "tags": [] }, "outputs": [], "source": [ "sc.install_pypi_package(\"pandas==0.25.1\") #Install pandas version 0.25.1 \n", "sc.install_pypi_package(\"matplotlib\", \"https://pypi.org/simple\") #Install matplotlib from given PyPI repository" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verify that the imported packages are successfully installed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sc.list_packages()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`%%display` magic command is used to display a spark dataFrame as a beautiful HTML table with horizontal and vertical scroll bar. \n", "Use the `%%display` magic command to show the number of reviews provided across multiple years." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%display\n", "df.groupBy(\"year\").count().orderBy(\"year\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Analyze the trend for the number of reviews provided across multiple years. \n", "Use `toPandas()` to convert the Spark dataFrame to a Pandas dataFrame and visualize with Matplotlib." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Number of reviews across years\n", "num_of_reviews_by_year = df.groupBy(\"year\").count().orderBy(\"year\").toPandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Analyze the number of book reviews by year and find the distribution of customer ratings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.clf()\n", "num_of_reviews_by_year.plot(kind=\"area\", x=\"year\",y=\"count\", rot=70, color=\"#bc5090\", legend=None, figsize=(8,6))\n", "plt.xticks(num_of_reviews_by_year.year)\n", "plt.xlim(1995, 2015)\n", "plt.title(\"Number of reviews across years\")\n", "plt.xlabel(\"Year\")\n", "plt.ylabel(\"Number of Reviews\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The preceding commands render the plot on the attached EMR cluster. To visualize the plot within your notebook, use `%matplot` magic." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplot plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Analyze the distribution of star ratings and visualize it using a pie chart." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Distribution of overall star ratings\n", "product_ratings_dist = df.groupBy(\"star_rating\").count().orderBy(\"count\").toPandas()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.clf()\n", "labels = [f\"Star Rating: {rating}\" for rating in product_ratings_dist[\"star_rating\"]]\n", "reviews = [num_reviews for num_reviews in product_ratings_dist[\"count\"]]\n", "colors = [\"#00876c\", \"#89c079\", \"#fff392\", \"#fc9e5a\", \"#de425b\"]\n", "fig, ax = plt.subplots(figsize=(8,5))\n", "w,a,b = ax.pie(reviews, autopct=\"%1.1f%%\", colors=colors)\n", "plt.title(\"Distribution of star ratings for books\")\n", "ax.legend(w, labels, title=\"Star Ratings\", loc=\"center left\", bbox_to_anchor=(1, 0, 0.5, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print the pie chart using `%matplot` magic and visualize it" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplot plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The pie chart shows that 80% of users gave a rating of 4 or higher. Approximately 10% of users rated their books 2 or lower. In general, customers are happy about their book purchases from Amazon." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "## Cleanup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, use the `uninstall_package` Pyspark API to uninstall the `pandas` and `matplotlib` libraries that were installed using the `install_package` API." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sc.uninstall_package(\"pandas\")\n", "sc.uninstall_package(\"matplotlib\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sc.list_packages()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After closing your notebook, the Pandas and Matplot libraries that you installed on the cluster using the `install_pypi_package` API are garbage and collected out of the cluster." ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "PySpark", "language": "", "name": "pysparkkernel" }, "language_info": { "codemirror_mode": { "name": "python", "version": 3 }, "mimetype": "text/x-python", "name": "pyspark", "pygments_lexer": "python3" }, "papermill": { "duration": null, "end_time": null, "environment_variables": {}, "exception": null, "input_path": "/home/notebook/work/demo_pyspark.ipynb", "output_path": "/home/notebook/work/executions/ex-IZXBKZLT803GPIP3MMFA31DW8ASYM/demo_pyspark.ipynb", "parameters": { "DATE": "10-20-2020", "TOP_K": 6, "US_STATES": [ "Wisconsin", "Texas", "Nevada" ] }, "start_time": "2020-10-22T17:25:45.424746", "version": "1.2.1" } }, "nbformat": 4, "nbformat_minor": 4 }