{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Query `Hudi` dataset using Spark SQL\n", "\n", "#### Topics covered in this example\n", "* Hudi operations like Insert, Upsert, Delete, Read and Incremental querying." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "## Prerequisites\n", "
\n", "NOTE : In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.
\n", "\n", "* To use Hudi with Amazon EMR Notebooks, you must first copy the Hudi jar files from the local file system to HDFS on the master node of the EMR cluster. You then use the notebook to configure your EMR notebook to use Hudi. Follow the `Setup` steps.\n", "* With Amazon EMR release version 5.28.0 and later, Amazon EMR installs Hudi components by default when Spark, Hive, or Presto is installed. The EMR cluster attached to this notebook should have the `Spark` and `Hive` applications installed.\n", "* This example uses a public dataset, hence the EMR cluster attached to this notebook must have internet connectivity.\n", "* This notebook uses the `Spark` kernel.\n", "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "Hudi is a data management framework used to simplify incremental data processing and data pipeline development by providing record-level insert, update, upsert, and delete capabilities. By efficiently managing how data is laid out in Amazon S3, Hudi allows data to be ingested and updated in near real time. Hudi carefully maintains metadata of the actions performed on the dataset to help ensure that the actions are atomic and consistent.\n", "\n", "You can use Hive, Spark, or Presto to query a Hudi dataset interactively or build data processing pipelines using incremental pull. Incremental pull refers to the ability to pull only the data that changed between two actions.\n", "\n", "The post: Work with a Hudi Dataset provides detailed information.\n", "\n", "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "1. Create an S3 bucket location to save your hudi dataset. For example: s3://EXAMPLE-BUCKET/my-hudi-dataset/\n", "\n", "2. Connect to the master node of the cluster using SSH and then copy the jar files from the local filesystem to HDFS as shown in the following examples. In the example, we create a directory in HDFS for clarity of file management. You can choose your own destination in HDFS, if desired.\n", "\n", "```\n", "hdfs dfs -mkdir -p /apps/hudi/lib hdfs dfs -copyFromLocal /usr/lib/hudi/hudi-spark-bundle.jar /apps/hudi/lib/hudi-spark-bundle.jar hdfs dfs -copyFromLocal /usr/lib/spark/external/lib/spark-avro.jar /apps/hudi/lib/spark-avro.jar\n", "```\n", "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%configure\n", "{ \n", " \"conf\": \n", " {\n", " \"spark.jars\":\"hdfs:///apps/hudi/lib/hudi-spark-bundle.jar,hdfs:///apps/hudi/lib/spark-avro.jar\",\n", " \"spark.serializer\":\"org.apache.spark.serializer.KryoSerializer\",\n", " \"spark.sql.hive.convertMetastoreParquet\":\"false\"\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Initialize a Spark Session for Hudi. \n", "When using Scala, make sure you import the following classes in your Spark session. This needs to be done once per Spark session." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import org.apache.spark.sql.SaveMode\n", "import org.apache.spark.sql.functions._\n", "import org.apache.hudi.DataSourceWriteOptions\n", "import org.apache.hudi.config.HoodieWriteConfig\n", "import org.apache.hudi.hive.MultiPartKeysValueExtractor\n", "\n", "// Create a DataFrame\n", "inputDF = spark.createDataFrame(\n", " [\n", " (\"100\", \"2015-01-01\", \"2015-01-01T13:51:39.340396Z\"),\n", " (\"101\", \"2015-01-01\", \"2015-01-01T12:14:58.597216Z\"),\n", " (\"102\", \"2015-01-01\", \"2015-01-01T13:51:40.417052Z\"),\n", " (\"103\", \"2015-01-01\", \"2015-01-01T13:51:40.519832Z\"),\n", " (\"104\", \"2015-01-02\", \"2015-01-01T12:15:00.512679Z\"),\n", " (\"105\", \"2015-01-02\", \"2015-01-01T13:51:42.248818Z\"),\n", " ],\n", " [\"id\", \"creation_date\", \"last_update_time\"]\n", ")\n", "\n", "// Specify common DataSourceWriteOptions in the single hudiOptions variable\n", "hudiOptions = {\n", "\"hoodie.table.name\": \"my_hudi_table\",\n", "\"hoodie.datasource.write.recordkey.field\": \"id\",\n", "\"hoodie.datasource.write.partitionpath.field\": \"creation_date\",\n", "\"hoodie.datasource.write.precombine.field\": \"last_update_time\",\n", "\"hoodie.datasource.hive_sync.enable\": \"true\",\n", "\"hoodie.datasource.hive_sync.table\": \"my_hudi_table\",\n", "\"hoodie.datasource.hive_sync.partition_fields\": \"creation_date\",\n", "\"hoodie.datasource.hive_sync.partition_extractor_class\": \"org.apache.hudi.hive.MultiPartKeysValueExtractor\"\n", "}\n", "\n", "// Write a DataFrame as a Hudi dataset to the S3 location that you created in Step 1.\n", "inputDF.write\n", ".format(\"org.apache.hudi\")\n", ".option(\"hoodie.datasource.write.operation\", \"insert\")\n", ".options(**hudiOptions)\n", ".mode(\"overwrite\")\n", ".save(\"s3://EXAMPLE-BUCKET/my-hudi-dataset/\") // Change this to the S3 location that you created in Step 1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Upsert Data \n", "Upsert refers to the ability to insert records into an existing dataset if they do not already exist or to update them if they do. \n", "The following example demonstrates how to upsert data by writing a DataFrame. \n", "Unlike the previous insert example, the `OPERATION_OPT_KEY` value is set to `UPSERT_OPERATION_OPT_VAL`. \n", "In addition, `.mode(SaveMode.Append)` is specified to indicate that the record should be appended." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "// Create a new DataFrame from the first row of inputDF with a different creation_date value\n", "updateDF = inputDF.limit(1).withColumn(\"creation_date\", lit(\"new_value\"))\n", "\n", "updateDF.write\n", ".format(\"org.apache.hudi\")\n", ".option(\"hoodie.datasource.write.operation\", \"upsert\")\n", ".options(**hudiOptions)\n", ".mode(\"append\")\n", ".save(\"s3://EXAMPLE-BUCKET/my-hudi-dataset/\") // Change this to the S3 location that you created in Step 1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Delete a Record\n", "To hard delete a record, you can upsert an empty payload. In this case, the `PAYLOAD_CLASS_OPT_KEY` option specifies the `EmptyHoodieRecordPayload` class. \n", "The example uses the same DataFrame, updateDF, used in the upsert example to specify the same record." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "updateDF.write\n", ".format(\"org.apache.hudi\")\n", ".option(\"hoodie.datasource.write.operation\", \"upsert\")\n", ".option(\"hoodie.datasource.write.payload.class\", \"org.apache.hudi.common.model.EmptyHoodieRecordPayload\")\n", ".options(**hudiOptions)\n", ".mode(\"append\")\n", ".save(\"s3://EXAMPLE-BUCKET/my-hudi-dataset/\") // Change this to the S3 location that you created in Step 1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Read from a Hudi Dataset\n", "To retrieve data at the present point in time, Hudi performs snapshot queries by default. \n", "Following is an example of querying the dataset written to S3 in Write to a Hudi Dataset. \n", "Replace `s3://EXAMPLE-BUCKET/my-hudi-dataset` with your table path, and add wildcard asterisks for each partition level, plus one additional asterisk. \n", "In this example, there is one partition level, so we’ve added two wildcard symbols." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "snapshotQueryDF = spark.read\n", " .format(\"org.apache.hudi\")\n", " .load(\"s3://EXAMPLE-BUCKET/my-hudi-dataset\" + \"/*/*\") // Change this to the S3 location that you created in Step 1.\n", " \n", "snapshotQueryDF.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Incremental Queries\n", "You can also perform incremental queries with Hudi to get a stream of records that have changed since a given commit timestamp. \n", "To do so, set the `QUERY_TYPE_OPT_KEY` field to `QUERY_TYPE_INCREMENTAL_OPT_VAL`. Then, add a value for `BEGIN_INSTANTTIME_OPT_KEY` to obtain all records written since the specified time. \n", "Incremental queries are typically ten times more efficient than their batch counterparts since they only process changed records.\n", "\n", "When you perform incremental queries, use the root (base) table path without the wildcard asterisks used for Snapshot queries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "readOptions = {\n", " \"hoodie.datasource.query.type\": \"incremental\",\n", " \"hoodie.datasource.read.begin.instanttime\": ,\n", "}\n", "\n", "incQueryDF = spark.read\n", " .format(\"org.apache.hudi\")\n", " .options(**readOptions)\n", " .load(\"s3://EXAMPLE-BUCKET/my-hudi-dataset\") // Change this to the S3 location that you created in Step 1.\n", " \n", "incQueryDF.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Spark", "language": "", "name": "sparkkernel" }, "language_info": { "codemirror_mode": "text/x-scala", "mimetype": "text/x-scala", "name": "scala", "pygments_lexer": "scala" } }, "nbformat": 4, "nbformat_minor": 4 }