{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"[](https://github.com/aws/aws-sdk-pandas)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# 39 - Athena Iceberg"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Athena supports read, time travel, write, and DDL queries for Apache Iceberg tables that use the Apache Parquet format for data and the AWS Glue catalog for their metastore. More in [User Guide](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html)."
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Create Iceberg table"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import getpass\n",
"bucket_name = getpass.getpass()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": "True"
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import awswrangler as wr\n",
"\n",
"glue_database = \"aws_sdk_pandas\"\n",
"glue_table = \"iceberg_test\"\n",
"path = f\"s3://{bucket_name}/iceberg_test/\"\n",
"temp_path = f\"s3://{bucket_name}/iceberg_test_temp/\"\n",
"\n",
"# Cleanup table before create\n",
"wr.catalog.delete_table_if_exists(database=glue_database, table=glue_table)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Create table & insert data"
]
},
{
"cell_type": "markdown",
"source": [
"It is possible to insert Pandas data frame into Iceberg table using `wr.athena.to_iceberg`. If the table does not exist, it will be created:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.DataFrame({\"id\": [1, 2, 3], \"name\": [\"John\", \"Lily\", \"Richard\"]})\n",
"\n",
"wr.athena.to_iceberg(\n",
" df=df,\n",
" database=glue_database,\n",
" table=glue_table,\n",
" table_location=path,\n",
" temp_path=temp_path,\n",
")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Alternatively, it is also possible to insert by directly running `INSERT INTO ... VALUES`:"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'QueryExecutionId': 'e339fcd2-9db1-43ac-bb9e-9730e6395b51',\n",
" 'Query': \"INSERT INTO iceberg_test VALUES (1,'John'), (2, 'Lily'), (3, 'Richard')\",\n",
" 'StatementType': 'DML',\n",
" 'ResultConfiguration': {'OutputLocation': 's3://aws-athena-query-results-...-us-east-1/e339fcd2-9db1-43ac-bb9e-9730e6395b51'},\n",
" 'ResultReuseConfiguration': {'ResultReuseByAgeConfiguration': {'Enabled': False}},\n",
" 'QueryExecutionContext': {'Database': 'aws_sdk_pandas'},\n",
" 'Status': {'State': 'SUCCEEDED',\n",
" 'SubmissionDateTime': datetime.datetime(2023, 3, 16, 10, 40, 8, 612000, tzinfo=tzlocal()),\n",
" 'CompletionDateTime': datetime.datetime(2023, 3, 16, 10, 40, 11, 143000, tzinfo=tzlocal())},\n",
" 'Statistics': {'EngineExecutionTimeInMillis': 2242,\n",
" 'DataScannedInBytes': 0,\n",
" 'DataManifestLocation': 's3://aws-athena-query-results-...-us-east-1/e339fcd2-9db1-43ac-bb9e-9730e6395b51-manifest.csv',\n",
" 'TotalExecutionTimeInMillis': 2531,\n",
" 'QueryQueueTimeInMillis': 241,\n",
" 'QueryPlanningTimeInMillis': 179,\n",
" 'ServiceProcessingTimeInMillis': 48,\n",
" 'ResultReuseInformation': {'ReusedPreviousResult': False}},\n",
" 'WorkGroup': 'primary',\n",
" 'EngineVersion': {'SelectedEngineVersion': 'Athena engine version 3',\n",
" 'EffectiveEngineVersion': 'Athena engine version 3'}}"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.athena.start_query_execution(\n",
" sql=f\"INSERT INTO {glue_table} VALUES (1,'John'), (2, 'Lily'), (3, 'Richard')\",\n",
" database=glue_database,\n",
" wait=True,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'QueryExecutionId': '922c8f02-4c00-4050-b4a7-7016809efa2b',\n",
" 'Query': \"INSERT INTO iceberg_test VALUES (4,'Anne'), (5, 'Jacob'), (6, 'Leon')\",\n",
" 'StatementType': 'DML',\n",
" 'ResultConfiguration': {'OutputLocation': 's3://aws-athena-query-results-...-us-east-1/922c8f02-4c00-4050-b4a7-7016809efa2b'},\n",
" 'ResultReuseConfiguration': {'ResultReuseByAgeConfiguration': {'Enabled': False}},\n",
" 'QueryExecutionContext': {'Database': 'aws_sdk_pandas'},\n",
" 'Status': {'State': 'SUCCEEDED',\n",
" 'SubmissionDateTime': datetime.datetime(2023, 3, 16, 10, 40, 24, 582000, tzinfo=tzlocal()),\n",
" 'CompletionDateTime': datetime.datetime(2023, 3, 16, 10, 40, 27, 352000, tzinfo=tzlocal())},\n",
" 'Statistics': {'EngineExecutionTimeInMillis': 2414,\n",
" 'DataScannedInBytes': 0,\n",
" 'DataManifestLocation': 's3://aws-athena-query-results-...-us-east-1/922c8f02-4c00-4050-b4a7-7016809efa2b-manifest.csv',\n",
" 'TotalExecutionTimeInMillis': 2770,\n",
" 'QueryQueueTimeInMillis': 329,\n",
" 'QueryPlanningTimeInMillis': 189,\n",
" 'ServiceProcessingTimeInMillis': 27,\n",
" 'ResultReuseInformation': {'ReusedPreviousResult': False}},\n",
" 'WorkGroup': 'primary',\n",
" 'EngineVersion': {'SelectedEngineVersion': 'Athena engine version 3',\n",
" 'EffectiveEngineVersion': 'Athena engine version 3'}}"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.athena.start_query_execution(\n",
" sql=f\"INSERT INTO {glue_table} VALUES (4,'Anne'), (5, 'Jacob'), (6, 'Leon')\",\n",
" database=glue_database,\n",
" wait=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Query\n"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" John | \n",
"
\n",
" \n",
" 1 | \n",
" 4 | \n",
" Anne | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" Lily | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" Richard | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Jacob | \n",
"
\n",
" \n",
" 5 | \n",
" 6 | \n",
" Leon | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id name\n",
"0 1 John\n",
"1 4 Anne\n",
"2 2 Lily\n",
"3 3 Richard\n",
"4 5 Jacob\n",
"5 6 Leon"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.athena.read_sql_query(\n",
" sql=f'SELECT * FROM \"{glue_table}\"',\n",
" database=glue_database,\n",
" ctas_approach=False,\n",
" unload_approach=False,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Read query metadata"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"In a SELECT query, you can use the following properties after `table_name` to query Iceberg table metadata:"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"- `$files` Shows a table's current data files"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"- `$manifests` Shows a table's current file manifests"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"- `$history` Shows a table's history"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"- `$partitions` Shows a table's current partitions"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" content | \n",
" file_path | \n",
" file_format | \n",
" record_count | \n",
" file_size_in_bytes | \n",
" column_sizes | \n",
" value_counts | \n",
" null_value_counts | \n",
" nan_value_counts | \n",
" lower_bounds | \n",
" upper_bounds | \n",
" key_metadata | \n",
" split_offsets | \n",
" equality_ids | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" s3://.../iceberg_test/data/089a... | \n",
" PARQUET | \n",
" 3 | \n",
" 360 | \n",
" {1=48, 2=63} | \n",
" {1=3, 2=3} | \n",
" {1=0, 2=0} | \n",
" {} | \n",
" {1=1, 2=John} | \n",
" {1=3, 2=Richard} | \n",
" <NA> | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" s3://.../iceberg_test/data/5736... | \n",
" PARQUET | \n",
" 3 | \n",
" 355 | \n",
" {1=48, 2=61} | \n",
" {1=3, 2=3} | \n",
" {1=0, 2=0} | \n",
" {} | \n",
" {1=4, 2=Anne} | \n",
" {1=6, 2=Leon} | \n",
" <NA> | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" content file_path file_format \\\n",
"0 0 s3://.../iceberg_test/data/089a... PARQUET \n",
"1 0 s3://.../iceberg_test/data/5736... PARQUET \n",
"\n",
" record_count file_size_in_bytes column_sizes value_counts \\\n",
"0 3 360 {1=48, 2=63} {1=3, 2=3} \n",
"1 3 355 {1=48, 2=61} {1=3, 2=3} \n",
"\n",
" null_value_counts nan_value_counts lower_bounds upper_bounds \\\n",
"0 {1=0, 2=0} {} {1=1, 2=John} {1=3, 2=Richard} \n",
"1 {1=0, 2=0} {} {1=4, 2=Anne} {1=6, 2=Leon} \n",
"\n",
" key_metadata split_offsets equality_ids \n",
"0 NaN NaN \n",
"1 NaN NaN "
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.athena.read_sql_query(\n",
" sql=f'SELECT * FROM \"{glue_table}$files\"',\n",
" database=glue_database,\n",
" ctas_approach=False,\n",
" unload_approach=False,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" path | \n",
" length | \n",
" partition_spec_id | \n",
" added_snapshot_id | \n",
" added_data_files_count | \n",
" added_rows_count | \n",
" existing_data_files_count | \n",
" existing_rows_count | \n",
" deleted_data_files_count | \n",
" deleted_rows_count | \n",
" partitions | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" s3://.../iceberg_test/metadata/... | \n",
" 6538 | \n",
" 0 | \n",
" 4379263637983206651 | \n",
" 1 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" [] | \n",
"
\n",
" \n",
" 1 | \n",
" s3://.../iceberg_test/metadata/... | \n",
" 6548 | \n",
" 0 | \n",
" 2934717851675145063 | \n",
" 1 | \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" [] | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" path length \\\n",
"0 s3://.../iceberg_test/metadata/... 6538 \n",
"1 s3://.../iceberg_test/metadata/... 6548 \n",
"\n",
" partition_spec_id added_snapshot_id added_data_files_count \\\n",
"0 0 4379263637983206651 1 \n",
"1 0 2934717851675145063 1 \n",
"\n",
" added_rows_count existing_data_files_count existing_rows_count \\\n",
"0 3 0 0 \n",
"1 3 0 0 \n",
"\n",
" deleted_data_files_count deleted_rows_count partitions \n",
"0 0 0 [] \n",
"1 0 0 [] "
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.athena.read_sql_query(\n",
" sql=f'SELECT * FROM \"{glue_table}$manifests\"',\n",
" database=glue_database,\n",
" ctas_approach=False,\n",
" unload_approach=False,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" made_current_at | \n",
" snapshot_id | \n",
" parent_id | \n",
" is_current_ancestor | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2023-03-16 09:40:10.438000+00:00 | \n",
" 2934717851675145063 | \n",
" <NA> | \n",
" True | \n",
"
\n",
" \n",
" 1 | \n",
" 2023-03-16 09:40:26.754000+00:00 | \n",
" 4379263637983206651 | \n",
" 2934717851675144704 | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" made_current_at snapshot_id parent_id \\\n",
"0 2023-03-16 09:40:10.438000+00:00 2934717851675145063 \n",
"1 2023-03-16 09:40:26.754000+00:00 4379263637983206651 2934717851675144704 \n",
"\n",
" is_current_ancestor \n",
"0 True \n",
"1 True "
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = wr.athena.read_sql_query(\n",
" sql=f'SELECT * FROM \"{glue_table}$history\"',\n",
" database=glue_database,\n",
" ctas_approach=False,\n",
" unload_approach=False,\n",
")\n",
"\n",
"# Save snapshot id\n",
"snapshot_id = df.snapshot_id[0]\n",
"\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" record_count | \n",
" file_count | \n",
" total_size | \n",
" data | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 6 | \n",
" 2 | \n",
" 715 | \n",
" {id={min=1, max=6, null_count=0, nan_count=nul... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" record_count file_count total_size \\\n",
"0 6 2 715 \n",
"\n",
" data \n",
"0 {id={min=1, max=6, null_count=0, nan_count=nul... "
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.athena.read_sql_query(\n",
" sql=f'SELECT * FROM \"{glue_table}$partitions\"',\n",
" database=glue_database,\n",
" ctas_approach=False,\n",
" unload_approach=False,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Time travel"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" John | \n",
"
\n",
" \n",
" 1 | \n",
" 4 | \n",
" Anne | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" Lily | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" Richard | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Jacob | \n",
"
\n",
" \n",
" 5 | \n",
" 6 | \n",
" Leon | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id name\n",
"0 1 John\n",
"1 4 Anne\n",
"2 2 Lily\n",
"3 3 Richard\n",
"4 5 Jacob\n",
"5 6 Leon"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.athena.read_sql_query(\n",
" sql=f\"SELECT * FROM {glue_table} FOR TIMESTAMP AS OF (current_timestamp - interval '5' second)\",\n",
" database=glue_database,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Version travel\n"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" John | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Lily | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Richard | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id name\n",
"0 1 John\n",
"1 2 Lily\n",
"2 3 Richard"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.athena.read_sql_query(\n",
" sql=f\"SELECT * FROM {glue_table} FOR VERSION AS OF {snapshot_id}\",\n",
" database=glue_database,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Optimize"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"The `OPTIMIZE table REWRITE DATA` compaction action rewrites data files into a more optimized layout based on their size and number of associated delete files. For syntax and table property details, see [OPTIMIZE](https://docs.aws.amazon.com/athena/latest/ug/optimize-statement.html).\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'QueryExecutionId': '94666790-03ae-42d7-850a-fae99fa79a68',\n",
" 'Query': 'OPTIMIZE iceberg_test REWRITE DATA USING BIN_PACK',\n",
" 'StatementType': 'DDL',\n",
" 'ResultConfiguration': {'OutputLocation': 's3://aws-athena-query-results-...-us-east-1/tables/94666790-03ae-42d7-850a-fae99fa79a68'},\n",
" 'ResultReuseConfiguration': {'ResultReuseByAgeConfiguration': {'Enabled': False}},\n",
" 'QueryExecutionContext': {'Database': 'aws_sdk_pandas'},\n",
" 'Status': {'State': 'SUCCEEDED',\n",
" 'SubmissionDateTime': datetime.datetime(2023, 3, 16, 10, 49, 42, 857000, tzinfo=tzlocal()),\n",
" 'CompletionDateTime': datetime.datetime(2023, 3, 16, 10, 49, 45, 655000, tzinfo=tzlocal())},\n",
" 'Statistics': {'EngineExecutionTimeInMillis': 2622,\n",
" 'DataScannedInBytes': 220,\n",
" 'DataManifestLocation': 's3://aws-athena-query-results-...-us-east-1/tables/94666790-03ae-42d7-850a-fae99fa79a68-manifest.csv',\n",
" 'TotalExecutionTimeInMillis': 2798,\n",
" 'QueryQueueTimeInMillis': 124,\n",
" 'QueryPlanningTimeInMillis': 252,\n",
" 'ServiceProcessingTimeInMillis': 52,\n",
" 'ResultReuseInformation': {'ReusedPreviousResult': False}},\n",
" 'WorkGroup': 'primary',\n",
" 'EngineVersion': {'SelectedEngineVersion': 'Athena engine version 3',\n",
" 'EffectiveEngineVersion': 'Athena engine version 3'}}"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.athena.start_query_execution(\n",
" sql=f\"OPTIMIZE {glue_table} REWRITE DATA USING BIN_PACK\",\n",
" database=glue_database,\n",
" wait=True,\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Vacuum\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"`VACUUM` performs [snapshot expiration](https://iceberg.apache.org/docs/latest/spark-procedures/#expire_snapshots) and [orphan file removal](https://iceberg.apache.org/docs/latest/spark-procedures/#remove_orphan_files). These actions reduce metadata size and remove files not in the current table state that are also older than the retention period specified for the table. For syntax details, see [VACUUM](https://docs.aws.amazon.com/athena/latest/ug/vacuum-statement.html)."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'QueryExecutionId': '717a7de6-b873-49c7-b744-1b0b402f24c9',\n",
" 'Query': 'VACUUM iceberg_test',\n",
" 'StatementType': 'DML',\n",
" 'ResultConfiguration': {'OutputLocation': 's3://aws-athena-query-results-...-us-east-1/717a7de6-b873-49c7-b744-1b0b402f24c9.csv'},\n",
" 'ResultReuseConfiguration': {'ResultReuseByAgeConfiguration': {'Enabled': False}},\n",
" 'QueryExecutionContext': {'Database': 'aws_sdk_pandas'},\n",
" 'Status': {'State': 'SUCCEEDED',\n",
" 'SubmissionDateTime': datetime.datetime(2023, 3, 16, 10, 50, 41, 14000, tzinfo=tzlocal()),\n",
" 'CompletionDateTime': datetime.datetime(2023, 3, 16, 10, 50, 43, 441000, tzinfo=tzlocal())},\n",
" 'Statistics': {'EngineExecutionTimeInMillis': 2229,\n",
" 'DataScannedInBytes': 0,\n",
" 'TotalExecutionTimeInMillis': 2427,\n",
" 'QueryQueueTimeInMillis': 153,\n",
" 'QueryPlanningTimeInMillis': 30,\n",
" 'ServiceProcessingTimeInMillis': 45,\n",
" 'ResultReuseInformation': {'ReusedPreviousResult': False}},\n",
" 'WorkGroup': 'primary',\n",
" 'EngineVersion': {'SelectedEngineVersion': 'Athena engine version 3',\n",
" 'EffectiveEngineVersion': 'Athena engine version 3'}}"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.athena.start_query_execution(\n",
" sql=f\"VACUUM {glue_table}\",\n",
" database=glue_database,\n",
" wait=True,\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.9.14",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.14"
}
},
"nbformat": 4,
"nbformat_minor": 4
}