# **EMR Containers integration with AWS Glue** #### **AWS Glue catalog in same account as EKS** In the below example a Spark application will be configured to use [AWS Glue data catalog](https://docs.aws.amazon.com/glue/latest/dg/components-overview.html) as the hive metastore. **gluequery.py** ``` cat > gluequery.py </trip-data.parquet/'") spark.sql("SELECT count(*) FROM sparkemrnyc").show() spark.stop() EOF ``` ``` LOCATION 's3:///trip-data.parquet/' ``` Configure the above property to point to the S3 location containing the data. **Request** ``` cat > Spark-Python-in-s3-awsglue-log.json << EOF { "name": "spark-python-in-s3-awsglue-log", "virtualClusterId": "", "executionRoleArn": "", "releaseLabel": "emr-6.2.0-latest", "jobDriver": { "sparkSubmitJobDriver": { "entryPoint": "s3:///gluequery.py", "sparkSubmitParameters": "--conf spark.driver.cores=3 --conf spark.executor.memory=8G --conf spark.driver.memory=6G --conf spark.executor.cores=3" } }, "configurationOverrides": { "applicationConfiguration": [ { "classification": "spark-defaults", "properties": { "spark.hadoop.hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory", } } ], "monitoringConfiguration": { "cloudWatchMonitoringConfiguration": { "logGroupName": "/emr-containers/jobs", "logStreamNamePrefix": "demo" }, "s3MonitoringConfiguration": { "logUri": "s3://joblogs" } } } } EOF aws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-awsglue-log.json ``` Output from driver logs - Displays the number of rows. ``` +----------+ | count(1)| +----------+ |2716504499| +----------+ ``` #### **AWS Glue catalog in different account** The Spark application is submitted to EMR Virtual cluster in Account A and is configured to connect to [AWS Glue catalog in Account B.](https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html) The IAM policy attached to the job execution role `("executionRoleArn": "") `is in Account A ``` { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "glue:*" ], "Resource": [ "arn:aws:glue:::catalog", "arn:aws:glue:::database/default", "arn:aws:glue:::table/default/sparkemrnyc" ] } ] } ``` IAM policy attached to the AWS Glue catalog in Account B ``` { "Version" : "2012-10-17", "Statement" : [ { "Effect" : "Allow", "Principal" : { "AWS" : "" }, "Action" : "glue:*", "Resource" : [ "arn:aws:glue:::catalog", "arn:aws:glue:::database/default", "arn:aws:glue:::table/default/sparkemrnyc" ] } ] } ``` **Request** ``` cat > Spark-Python-in-s3-awsglue-crossaccount.json << EOF { "name": "spark-python-in-s3-awsglue-crossaccount", "virtualClusterId": "", "executionRoleArn": "", "releaseLabel": "emr-6.2.0-latest", "jobDriver": { "sparkSubmitJobDriver": { "entryPoint": "s3:///gluequery.py", "sparkSubmitParameters": "--conf spark.driver.cores=5 --conf spark.executor.memory=20G --conf spark.driver.memory=15G --conf spark.executor.cores=6 " } }, "configurationOverrides": { "applicationConfiguration": [ { "classification": "spark-defaults", "properties": { "spark.hadoop.hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory", "spark.hadoop.hive.metastore.glue.catalogid":"", } } ], "monitoringConfiguration": { "cloudWatchMonitoringConfiguration": { "logGroupName": "/emr-containers/jobs", "logStreamNamePrefix": "demo" }, "s3MonitoringConfiguration": { "logUri": "s3://joblogs" } } } } EOF aws emr-containers start-job-run --cli-input-json file:///Spark-Python-in-s3-awsglue-crossaccount.json ``` **Configuration of interest** To specify the accountID where the AWS Glue catalog is defined reference the following: [Spark-Glue integration](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html) ``` "spark.hadoop.hive.metastore.glue.catalogid":"", ``` Output from driver logs - displays the number of rows. ``` +----------+ | count(1)| +----------+ |2716504499| +----------+ ``` #### **Sync Hudi table with AWS Glue catalog** In this example, a Spark application will be configured to use [AWS Glue data catalog](https://docs.aws.amazon.com/glue/latest/dg/components-overview.html) as the hive metastore. Starting from Hudi 0.9.0, we can synchronize Hudi table's latest schema to Glue catalog via the Hive Metastore Service (HMS) in hive sync mode. This example runs a Hudi ETL job with EMR on EKS, and interact with AWS Glue metaStore to create a Hudi table. It provides you the native and serverless capabilities to manage your technical metadata. Also you can query Hudi tables in Athena straigt away after the ETL job, which provides your end user an easy data access and shortens the time to insight. **HudiEMRonEKS.py** ``` cat > HudiEMRonEKS.py <