{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "trusted": true }, "source": [ "\n", "# Glue Studio Notebook\n", "You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.\n", "\n", "## Available Magics\n", "| Magic | Type | Description |\n", "|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|\n", "| %%configure | Dictionary | A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |\n", "| %profile | String | Specify a profile in your aws configuration to use as the credentials provider. |\n", "| %iam_role | String | Specify an IAM role to execute your session with. |\n", "| %region | String | Specify the AWS region in which to initialize a session. |\n", "| %session_id | String | Returns the session ID for the running session. |\n", "| %connections | List | Specify a comma separated list of connections to use in the session. |\n", "| %additional_python_modules | List | Comma separated list of pip packages, s3 paths or private pip arguments. |\n", "| %extra_py_files | List | Comma separated list of additional Python files from S3. |\n", "| %extra_jars | List | Comma separated list of additional Jars to include in the cluster. |\n", "| %number_of_workers | Integer | The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too. |\n", "| %glue_version | String | The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0). |\n", "| %security_config | String | Define a security configuration to be used with this session. |\n", "| %sql | String | Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code. |\n", "| %streaming | String | Changes the session type to Glue Streaming. |\n", "| %etl | String | Changes the session type to Glue ETL. |\n", "| %status | | Returns the status of the current Glue session including its duration, configuration and executing user / role. |\n", "| %stop_session | | Stops the current session. |\n", "| %list_sessions | | Lists all currently running sessions by name and ID. |\n", "| %worker_type | String | Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X. |\n", "| %spark_conf | String | Specify custom spark configurations for your session. E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer. |" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Prepare and configure SparkSession with Delta Lake configuration" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "editable": true, "trusted": true, "vscode": { "languageId": "python_glue_session" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Setting session ID prefix to native-delta-sql-\n", "Setting Glue version to: 3.0\n", "Current idle_timeout is 15 minutes.\n", "idle_timeout has been set to 60 minutes.\n", "The following configurations have been updated: {'--conf': 'spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog', '--datalake-formats': 'delta'}\n" ] } ], "source": [ "%session_id_prefix native-delta-sql-\n", "%glue_version 3.0\n", "%idle_timeout 60\n", "%%configure \n", "{\n", " \"--conf\": \"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog\",\n", " \"--datalake-formats\": \"delta\"\n", "}" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "trusted": true, "vscode": { "languageId": "python_glue_session" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::123456789012:role/AWSGlueServiceRole-StudioNotebook\n", "Trying to create a Glue session for the kernel.\n", "Worker Type: G.1X\n", "Number of Workers: 5\n", "Session ID: native-delta-sql--162db4ea-e072-44c3-ba6e-290bfddca0d4\n", "Job Type: glueetl\n", "Applying the following default arguments:\n", "--glue_kernel_version 0.37.0\n", "--enable-glue-datacatalog true\n", "--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog\n", "--datalake-formats delta\n", "Waiting for session native-delta-sql--162db4ea-e072-44c3-ba6e-290bfddca0d4 to get into ready status...\n", "Session native-delta-sql--162db4ea-e072-44c3-ba6e-290bfddca0d4 has been created.\n", "database_location: s3://glue-deltalake-demo-us-east-1/deltalake_db/\n", "table_location: s3://glue-deltalake-demo-us-east-1/deltalake_db/products/\n" ] } ], "source": [ "bucket_name = \"glue-deltalake-demo-us-east-1\"\n", "\n", "database_name = \"deltalake_db\"\n", "database_location = f\"s3://{bucket_name}/{database_name}/\"\n", "\n", "table_name = \"products\"\n", "table_location = f\"s3://{bucket_name}/{database_name}/{table_name}/\"\n", "\n", "print(f\"database_location: {database_location}\")\n", "print(f\"table_location: {table_location}\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Create a Delta Lake table" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "trusted": true, "vscode": { "languageId": "python_glue_session" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "++\n", "||\n", "++\n", "++\n" ] } ], "source": [ "%%sql\n", "CREATE DATABASE IF NOT EXISTS deltalake_db" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "trusted": true, "vscode": { "languageId": "python_glue_session" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "DataFrame[]\n" ] } ], "source": [ "# create table in metastore\n", "query = f\"\"\"\n", "CREATE TABLE IF NOT EXISTS {database_name}.{table_name} (\n", " product_id STRING,\n", " product_name STRING,\n", " price INT,\n", " category STRING,\n", " updated_at TIMESTAMP\n", ")\n", "USING delta\n", "PARTITIONED BY (category)\n", "LOCATION '{table_location}'\n", "\"\"\"\n", "\n", "spark.sql(query)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "trusted": true, "vscode": { "languageId": "python_glue_session" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "++\n", "||\n", "++\n", "++\n" ] } ], "source": [ "%%sql\n", "USE deltalake_db" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "trusted": true, "vscode": { "languageId": "python_glue_session" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------+---------+-----------+\n", "| database|tableName|isTemporary|\n", "+------------+---------+-----------+\n", "|deltalake_db| products| false|\n", "+------------+---------+-----------+\n" ] } ], "source": [ "%%sql\n", "SHOW TABLES" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Read from Delta Lake table" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "trusted": true, "vscode": { "languageId": "python_glue_session" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+--------------------+-----+-------------+-------------------+\n", "|product_id| product_name|price| category| updated_at|\n", "+----------+--------------------+-----+-------------+-------------------+\n", "| 00007|Mercedes-Benz C-C...| 7341| Nissan|2023-02-15 01:26:27|\n", "| 00019| Cadillac De Ville| 5707| Nissan|2023-02-15 12:17:03|\n", "| 00015| Chevrolet Impala| 3128| Maybach|2023-02-15 00:41:04|\n", "| 00004| Chevrolet Impala| 4054|Mercedes-Benz|2023-02-15 06:05:01|\n", "| 00012| ZAZ Zaporozhets| 5665| Fiat|2023-02-15 11:11:35|\n", "| 00003| Imperial Crown| 4803| Mazda|2023-02-15 12:36:15|\n", "| 00010| Ford F-Series| 5070| Mazda|2023-02-15 05:10:21|\n", "| 00011| Metropolitan| 3622| Maybach|2023-02-15 06:09:05|\n", "| 00013| Alpine A310| 4923| Chevrolet|2023-02-15 01:30:13|\n", "| 00006| Fiat Uno| 6399| Lancia|2023-02-15 02:22:30|\n", "| 00016| Koenigsegg CCX| 2261| Peugeot|2023-02-15 06:10:31|\n", "| 00018| Jeep Cherokee (XJ)| 2601| Peugeot|2023-02-15 09:36:17|\n", "| 00008| Peugeot 206| 2855| Mazda|2023-02-15 06:28:07|\n", "| 00014| Toyota Corolla| 7866| Mazda|2023-02-15 00:04:17|\n", "| 00005| Mazda MX-5| 3700| Daihatsu|2023-02-15 06:37:42|\n", "| 00009| Toyota Corolla|12050| Daihatsu|2023-02-15 05:19:43|\n", "| 00017| SEAT Ibiza|11744| Mazda|2023-02-15 03:29:30|\n", "| 00020| Mazda Familia| 7647| Mazda|2023-02-15 10:22:04|\n", "+----------+--------------------+-----+-------------+-------------------+\n" ] } ], "source": [ "%%sql # Read table from metastore\n", "SELECT * FROM deltalake_db.products" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### View History" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "trusted": true, "vscode": { "languageId": "python_glue_session" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------+-------------------+------+--------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+\n", "|version|timestamp |userId|userName|operation |operationParameters |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics |userMetadata|\n", "+-------+-------------------+------+--------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+\n", "|2 |2023-02-15 03:47:48|null |null |MERGE |{predicate -> (new.`product_id` = old.`product_id`), matchedPredicates -> [{\"actionType\":\"update\"}], notMatchedPredicates -> [{\"actionType\":\"insert\"}]}|null|null |null |1 |null |false |{numTargetRowsCopied -> 1, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 12, executionTimeMs -> 3196, numTargetRowsInserted -> 10, scanTimeMs -> 1814, numTargetRowsUpdated -> 4, numOutputRows -> 15, numSourceRows -> 14, numTargetFilesRemoved -> 4, rewriteTimeMs -> 1290}|null |\n", "|1 |2023-02-15 03:46:51|null |null |MERGE |{predicate -> (new.`product_id` = old.`product_id`), matchedPredicates -> [{\"actionType\":\"update\"}], notMatchedPredicates -> [{\"actionType\":\"insert\"}]}|null|null |null |0 |null |false |{numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 6, executionTimeMs -> 5235, numTargetRowsInserted -> 8, scanTimeMs -> 1994, numTargetRowsUpdated -> 0, numOutputRows -> 8, numSourceRows -> 8, numTargetFilesRemoved -> 0, rewriteTimeMs -> 3194} |null |\n", "|0 |2023-02-15 03:46:29|null |null |CREATE TABLE|{isManaged -> false, description -> null, partitionBy -> [\"category\"], properties -> {}} |null|null |null |null |null |true |{} |null |\n", "+-------+-------------------+------+--------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+\n" ] } ], "source": [ "query = f\"\"\"DESCRIBE HISTORY delta.`{table_location}` \"\"\"\n", "spark.sql(query).show(truncate=False)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Query with time travel" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "trusted": true, "vscode": { "languageId": "python_glue_session" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+----------+-----------------+-----+----------+-------------------+\n", "|product_id|product_name |price|category |updated_at |\n", "+----------+-----------------+-----+----------+-------------------+\n", "|00019 |Cadillac De Ville|5707 |Nissan |2023-02-15 12:17:03|\n", "|00009 |De Tomaso Pantera|7232 |Chevrolet |2023-02-15 04:20:14|\n", "|00020 |Volkswagen Beetle|2018 |Volkswagen|2023-02-15 03:18:10|\n", "|00016 |Metropolitan |8818 |Mazda |2023-02-15 03:10:14|\n", "|00013 |Ford Model T |2404 |Maybach |2023-02-15 05:16:19|\n", "|00015 |Chevrolet Impala |3128 |Maybach |2023-02-15 00:41:04|\n", "|00008 |Peugeot 206 |2855 |Mazda |2023-02-15 06:28:07|\n", "|00014 |Toyota Corolla |7866 |Mazda |2023-02-15 00:04:17|\n", "+----------+-----------------+-----+----------+-------------------+\n" ] } ], "source": [ "query = f\"\"\"SELECT * FROM delta.`{table_location}@v1`\"\"\" # Using a version number\n", "spark.sql(query).show(truncate=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "plaintext" } }, "outputs": [ { "name": "stdout", "text": "Stopping session: native-delta-sql--162db4ea-e072-44c3-ba6e-290bfddca0d4\nStopped session.\n", "output_type": "stream" } ], "source": [ "%stop_session" ] } ], "metadata": { "kernelspec": { "display_name": "Glue PySpark", "language": "python", "name": "glue_pyspark" }, "language_info": { "codemirror_mode": { "name": "python", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "Python_Glue_Session", "pygments_lexer": "python3" } }, "nbformat": 4, "nbformat_minor": 4 }