{ "cells": [ { "cell_type": "markdown", "id": "b8e51ab3-f07b-49cd-be19-03fcd3aba754", "metadata": {}, "source": [ "\n", "\n", "\n", "## Optimizing Data Lake Storage with Apache Iceberg table format\n", "\n", "Amazon S3 uses object tagging to categorize storage where each tag is a key-value pair.\n", "From Apache Iceberg perspective, it supports custom S3 Object tags that can be added\n", "to S3 objects while writing and deleting into the table. Iceberg Users can also configure\n", "tag-based object lifecycle policy at bucket level to transition objects to different S3 tiers.\n", "With the s3.delete.tags config property in Iceberg, objects are tagged with the configured\n", "key-value pairs before deletion. When the catalog property s3.delete-enabled is set\n", "to false, the objects are not hard-deleted from S3. This is expected to be used in\n", "combination with S3 delete tagging, so objects are tagged and removed using S3 lifecycle\n", "policy. This property is set to true by default.\n", "The example notebook in this blog shows example implementation of S3 Object tagging\n", "and Lifecycle rules for Apache Iceberg Tables to optimize the storage cost." ] }, { "cell_type": "markdown", "id": "f58fa5c3-fa0e-4df8-a153-6bc4dca79607", "metadata": {}, "source": [ "\n", "## Prerequisites\n", "\n", "In this example we will use Iceberg’s S3 Tags feature with the write tag as write-tag-name=created and delete tag as delete-tag-name=deleted. This example is demonstrated on an EMR emr-6.10.0 cluster with installed applications Hadoop 3.3.3, JupyterEnterpriseGateway 2.6.0, and Spark 3.3.1. The examples are executed on a Jupyter Notebook environment attached to the EMR cluster. To know more about how to create an EMR Cluster with Iceberg and how to use EMR Studio, please refer to the following documents: \\\n", "i. [Create an Iceberg EMR Cluster](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-use-spark-cluster.html) \\\n", "ii.[EMR Studio Guide](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio.html)\n" ] }, { "cell_type": "markdown", "id": "c9b0391e-3fd2-44bd-ae92-91c2bdef8a0b", "metadata": {}, "source": [ "\n", "\n", "\n", "## Configuring Iceberg on Spark session\n", "\n", "Configure your Spark session using the %%configure magic command. We will be using Hive Catalog for Iceberg Tables. \n", "Before you run the following step, create a S3 bucket in your AWS account with following naming convemtion /iceberg/\n", "\n", "Update the your-iceberg-storage-blog in below configuration with the bucket which you created to test this example" ] }, { "cell_type": "code", "execution_count": null, "id": "9dfff449-2164-4d6d-ac23-787e10f71fe3", "metadata": {}, "outputs": [], "source": [ "%%configure -f\n", "{\n", "\"conf\":{\n", " \"spark.sql.extensions\":\"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions\",\n", " \"spark.sql.catalog.dev\":\"org.apache.iceberg.spark.SparkCatalog\",\n", " \"spark.sql.catalog.dev.catalog-impl\":\"org.apache.iceberg.hive.HiveCatalog\",\n", " \"spark.sql.catalog.dev.io-impl\":\"org.apache.iceberg.aws.s3.S3FileIO\",\n", " \"spark.sql.catalog.dev.warehouse\":\"s3:///iceberg/\",\n", " \"spark.sql.catalog.dev.s3.write.tags.write-tag-name\":\"created\",\n", " \"spark.sql.catalog.dev.s3.delete.tags.delete-tag-name\":\"deleted\",\n", " \"spark.sql.catalog.dev.s3.delete-enabled\":\"false\"\n", " }\n", "}" ] }, { "cell_type": "markdown", "id": "c448540d-1894-49e1-bc4e-e0199c0320f2", "metadata": {}, "source": [ "Create an Iceberg Table to be loaded with Amazon Reviews" ] }, { "cell_type": "code", "execution_count": null, "id": "7e0f156b-5f8a-4e17-8e72-2e80cfb1701f", "metadata": {}, "outputs": [], "source": [ "spark.sql(\"\"\" DROP TABLE if exists dev.db.amazon_reviews_iceberg\"\"\")\n", "\n", "spark.sql(\"\"\" CREATE TABLE dev.db.amazon_reviews_iceberg (\n", " marketplace string,\n", " customer_id string,\n", " review_id string,\n", " product_id string,\n", " product_parent string,\n", " product_title string,\n", " star_rating int,\n", " helpful_votes int,\n", " total_votes int,\n", " vine string,\n", " verified_purchase string,\n", " review_headline string,\n", " review_body string,\n", " review_date date,\n", " year int)\n", "USING iceberg \n", "location 's3:///iceberg/db/amazon_reviews_iceberg'\n", "PARTITIONED BY (years(review_date))\"\"\")" ] }, { "cell_type": "code", "execution_count": null, "id": "79432edc-6805-4f87-8d9d-0f0f12ffd91f", "metadata": {}, "outputs": [], "source": [ "spark.sql(\"\"\" select * from dev.db.amazon_reviews_iceberg\"\"\").show()" ] }, { "cell_type": "code", "execution_count": null, "id": "0920a9dd-9fc2-4d9a-9581-46e223968878", "metadata": {}, "outputs": [], "source": [ "spark.sql(\"\"\" select * from dev.db.amazon_reviews_iceberg.snapshots\"\"\").show()" ] }, { "cell_type": "markdown", "id": "9a43a9fc-8603-4f83-ba6d-abc7bdf24b0b", "metadata": {}, "source": [ "\n", "### Inserts" ] }, { "cell_type": "markdown", "id": "200e87de-fb60-4492-be74-65bd39889961", "metadata": {}, "source": [ "We will be using Amazon Product Reviews Dataset dataset for our testing. While inserting the data, we will partition the data by review_date as per the table definition." ] }, { "cell_type": "code", "execution_count": null, "id": "5f84e496-150a-4481-83d1-98e86913013e", "metadata": {}, "outputs": [], "source": [ "df = spark.read.parquet(\"s3://amazon-reviews-pds/parquet/product_category=Electronics/*.parquet\")" ] }, { "cell_type": "markdown", "id": "35911704-292d-4309-9c0a-a4a887f8578f", "metadata": {}, "source": [ "**Run below cell to write data into the Iceberg table, We are writing just one partition for sake of simplicity**" ] }, { "cell_type": "code", "execution_count": null, "id": "d7d431ee-c600-4c9b-868c-b649bd0d972b", "metadata": {}, "outputs": [], "source": [ "df.sortWithinPartitions(\"review_date\").writeTo(\"dev.db.amazon_reviews_iceberg\").append()" ] }, { "cell_type": "markdown", "id": "de9689ff-33a5-4282-adfe-a2936ee4ba3c", "metadata": {}, "source": [ "**Verify data is loaded into iceberg table successfully.**" ] }, { "cell_type": "code", "execution_count": null, "id": "2d7f84de-4269-4e7f-a9e2-678628bf21b7", "metadata": {}, "outputs": [], "source": [ "spark.sql(\"\"\"select * from dev.db.amazon_reviews_iceberg limit 1\"\"\").show()" ] }, { "cell_type": "markdown", "id": "f46d242f-98f2-4a7b-9da0-9d0e75679802", "metadata": {}, "source": [ "**Verify the new snapshot created for this table after the data insert.**" ] }, { "cell_type": "code", "execution_count": null, "id": "48b6412e-7d8a-467b-b622-3878ed55e6ba", "metadata": {}, "outputs": [], "source": [ "spark.sql(\"\"\"SELECT * FROM dev.db.amazon_reviews_iceberg.snapshots\"\"\").show(truncate=False)" ] }, { "cell_type": "markdown", "id": "3fbe029d-285d-470f-9e24-a7257da5750a", "metadata": {}, "source": [ "**Verify the S3 objects related to this table is having the specified tags.** You can do the same from AWS Console or going to the AWSCLI " ] }, { "cell_type": "markdown", "id": "0978bdbd-128e-4c54-9c7e-154f7579d3c9", "metadata": {}, "source": [ "**Insert a single record into the Iceberg table.**" ] }, { "cell_type": "code", "execution_count": null, "id": "aefa67b5-1b83-4719-a896-2b05d6c7e34c", "metadata": {}, "outputs": [], "source": [ "spark.sql(\"\"\"insert into dev.db.amazon_reviews_iceberg values (\"US\", \"99999999\",\"R2RX7KLOQQ5VBG\",\"B00000JBAT\",\"738692522\",\"Diamond Rio Digital\",3,0,0,\"N\",\"N\",\"Why just 30 minutes?\",\"RIO is really great\",date(\"2023-04-06\"),2023)\"\"\")" ] }, { "cell_type": "markdown", "id": "b40a58f3-4c96-44aa-af70-807e6e9e0a0b", "metadata": {}, "source": [ "**Check a new snapshot created after the insert. You will now see two snapshots**" ] }, { "cell_type": "code", "execution_count": null, "id": "0efc522e-83c9-4056-b5c7-e16b0e7f53ec", "metadata": {}, "outputs": [], "source": [ "spark.sql(\"\"\"SELECT * FROM dev.db.amazon_reviews_iceberg.snapshots\"\"\").show()" ] }, { "cell_type": "markdown", "id": "8b88af67-c2c6-4803-9385-c086e236e90d", "metadata": {}, "source": [ "**Now check the S3 Tag populations as below for the new data file created:**" ] }, { "cell_type": "markdown", "id": "50be64e3-d17f-425a-8508-304052c32d80", "metadata": {}, "source": [ "Go to AWS CLI or Console to check the tags populated for the new writes. Let's check the tag corresponding to the object created by single row insert. You can check the S3 folde s3://your-iceberg-storage-blog/iceberg/db/amazon_reviews_iceberg/data/ and point to the partition review_date_year=2023/. Then check the parquet file under this folder to check the tags. From CLI you can run the following command to see the same. \\\n", "xxxxxx@3c22fb1238d8 ~ % aws s3api get-object-tagging --bucket your-iceberg-storage-blog --key iceberg/db /amazon_reviews_iceberg/data/review_date_year=2023/00000-43-2fb892e3-0a3f-4821-a356-xxxxxxxxxx-00001.parquet \\\n", "{\\\n", " \"TagSet\": [\\\n", " {\\\n", " \"Key\": \"write-tag-name\",\\\n", " \"Value\": \"created\"\\\n", " }\\\n", " ]\\\n", "}" ] }, { "cell_type": "markdown", "id": "529964b9-701f-4c6a-9337-fc5167ce63c9", "metadata": {}, "source": [ "**Now Delete sample data and expire snapshots using Iceberg’s S3 Tags feature.** The objects in S3 will have the tag delete-tag-name=deleted associated when the relevant snapshot is expired (https://iceberg.apache.org/docs/latest/spark-procedures/#expire_snapshots)." ] }, { "cell_type": "code", "execution_count": null, "id": "03a49c12-d38a-4eb0-9ba3-13d61314513f", "metadata": {}, "outputs": [], "source": [ "spark.sql(\"\"\"delete from dev.db.amazon_reviews_iceberg where review_date = '2023-04-06'\"\"\")" ] }, { "cell_type": "markdown", "id": "c68a328c-28f8-4a23-b201-a219fe8c141f", "metadata": {}, "source": [ "**Check the snapshots and you will find a new snapshot with operation value as delete**" ] }, { "cell_type": "code", "execution_count": null, "id": "16967801-4b06-4648-b85c-bdd930e6d9bd", "metadata": {}, "outputs": [], "source": [ "spark.sql(\"\"\"SELECT * FROM dev.db.amazon_reviews_iceberg.snapshots\"\"\").show()" ] }, { "cell_type": "markdown", "id": "29458947-8622-498d-a828-de6fb4fe9e78", "metadata": {}, "source": [ "**Expire Snapshots and keep only last two snapshots**" ] }, { "cell_type": "code", "execution_count": null, "id": "078769eb-5a1a-431a-b484-ea66dd8f0f1f", "metadata": {}, "outputs": [], "source": [ "spark.sql (\"\"\"CALL dev.system.expire_snapshots(table => 'dev.db.amazon_reviews_iceberg', older_than => DATE '2024-01-01', retain_last => 2)\"\"\")" ] }, { "cell_type": "code", "execution_count": null, "id": "9beacaf5-7f74-4829-87f5-8d22502de225", "metadata": {}, "outputs": [], "source": [ "spark.sql(\"\"\"SELECT * FROM dev.db.amazon_reviews_iceberg.snapshots\"\"\").show()" ] }, { "cell_type": "markdown", "id": "e8920d6c-339d-4f0e-885d-f62c6e81dd89", "metadata": {}, "source": [ "**View existing metadata files from the metadata log entries metatable after expiration of snapshots. Do note that the snapshots which have expired will show the latest snapshot id as null.**" ] }, { "cell_type": "code", "execution_count": null, "id": "ec23473f-fe87-40bf-8ebd-0b29f10ed09d", "metadata": {}, "outputs": [], "source": [ "spark.sql(\"\"\"SELECT * FROM dev.db.amazon_reviews_iceberg.metadata_log_entries\"\"\").show()" ] }, { "cell_type": "markdown", "id": "05085532-0082-4f49-82f1-9df0a2bb8ba1", "metadata": {}, "source": [ "**Create lifecycle configuration for the bucket to transition objects having delete-tag-name=deleted S3 tag to Glacier Instant Retrieval class.** Do note that Amazon S3 runs lifecycle rules once every day at midnight Universal Coordinated Time (UTC) and new lifecycle rules can take up to 48 hours to complete the first run. Amazon S3 Glacier is well suited to archive data that needs immediate access (with milliseconds retrieval). With S3 Glacier Instant Retrieval, customers can save up to 68% on storage costs compared to using the S3 Standard-Infrequent Access (S3 Standard-IA) storage class, when the data is accessed once per quarter." ] }, { "cell_type": "markdown", "id": "a5ebd4df-47b1-48bd-a449-54dd27707afd", "metadata": {}, "source": [ "\n", "## Cleanups\n", "\n", "After you complete the test, please follow this cleanup step to avoid any recurring costs. \\\n", "1. Delete the S3 Buckets that you created for this test \n", "2. Terminate the EMR Cluster \n", "3. Stop and Delete the EMR Notebook instance\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9e4afe06-5157-44f1-9c86-fd77df7af82b", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "PySpark", "language": "python", "name": "pysparkkernel" }, "language_info": { "codemirror_mode": { "name": "python", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "pyspark", "pygments_lexer": "python3" } }, "nbformat": 4, "nbformat_minor": 5 }