{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![AWS SDK for pandas](_static/logo.png \"AWS SDK for pandas\")](https://github.com/aws/aws-sdk-pandas)\n", "\n", "# 36 - Distributing Calls on Glue Interactive sessions\n", "\n", "AWS SDK for pandas is pre-loaded into [AWS Glue interactive sessions](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-overview.html) with Ray kernel, making it by far the easiest way to experiment with the library at scale." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In AWS Glue Studio, choose `Jupyter Notebook` to create an AWS Glue interactive session:\n", "\n", "![](_static/glue_is_create.png)\n", "\n", "Then select `Ray` as the kernel. The IAM role must trust the AWS Glue service principal.\n", "\n", "![](_static/glue_is_setup.png)\n", "\n", "Once the notebook is up and running you can import the library. Since we are running on AWS Glue with Ray, AWS SDK for pandas will automatically use the existing Ray cluster with no extra configuration needed." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Install the library" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install \"awswrangler[modin]\"" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "editable": true, "trusted": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Welcome to the Glue Interactive Sessions Kernel\n", "For more information on available magic commands, please type %help in any new cell.\n", "\n", "Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html\n", "Installed kernel version: 0.37.0 \n", "Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::977422593089:role/AWSGlueMantaTests\n", "Trying to create a Glue session for the kernel.\n", "Worker Type: Z.2X\n", "Number of Workers: 5\n", "Session ID: 309824f0-bad7-49d0-a2b4-e1b8c7368c5f\n", "Job Type: glueray\n", "Applying the following default arguments:\n", "--glue_kernel_version 0.37.0\n", "--enable-glue-datacatalog true\n", "Waiting for session 309824f0-bad7-49d0-a2b4-e1b8c7368c5f to get into ready status...\n", "Session 309824f0-bad7-49d0-a2b4-e1b8c7368c5f has been created.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2022-11-21 16:24:03,136\tINFO worker.py:1329 -- Connecting to existing Ray cluster at address: 2600:1f10:4674:6822:5b63:3324:984:3152:6379...\n", "2022-11-21 16:24:03,144\tINFO worker.py:1511 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32m127.0.0.1:8265 \u001b[39m\u001b[22m\n" ] } ], "source": [ "import awswrangler as wr" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "trusted": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Read progress: 100%|##########| 9/9 [00:10<00:00, 1.15s/it]\n", "UserWarning: When using a pre-initialized Ray cluster, please ensure that the runtime env sets environment variable __MODIN_AUTOIMPORT_PANDAS__ to 1\n" ] } ], "source": [ "df = wr.s3.read_csv(path=\"s3://nyc-tlc/csv_backup/yellow_tripdata_2021-0*.csv\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "trusted": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " VendorID tpep_pickup_datetime ... total_amount congestion_surcharge\n", "0 1.0 2021-01-01 00:30:10 ... 11.80 2.5\n", "1 1.0 2021-01-01 00:51:20 ... 4.30 0.0\n", "2 1.0 2021-01-01 00:43:30 ... 51.95 0.0\n", "3 1.0 2021-01-01 00:15:48 ... 36.35 0.0\n", "4 2.0 2021-01-01 00:31:49 ... 24.36 2.5\n", "\n", "[5 rows x 18 columns]\n" ] } ], "source": [ "df.head()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "To avoid incurring a charge, make sure to delete the Jupyter Notebook when you are done experimenting.\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "awswrangler-v9JnknIF-py3.8", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "python", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "pygments_lexer": "python3", "version": "3.8.5" }, "vscode": { "interpreter": { "hash": "83297b058d59ee0acd247586c837429190a8258f15c0eea6234359f5557dde51" } } }, "nbformat": 4, "nbformat_minor": 4 }