# Migration between the Hive Metastore and the AWS Glue Data Catalog Note: This is a sample script, not supported by AWS officially. We recommend you to test and modify it for your data and use-case. ### Introduction The provided scripts migrate metadata between Hive metastore and AWS Glue Data Catalog. The following scenarios are supported. #### Hive Metastore to an AWS Glue Data Catalog - **Direct Migration**: Set up an AWS Glue ETL job which extracts metadata from your Hive metastore (MySQL) and loads it into your AWS Glue Data Catalog. This method requires an AWS Glue connection to the Hive metastore as a JDBC source. An ETL script is provided to extract metadata from the Hive metastore and write it to AWS Glue Data Catalog. - **Migration using Amazon S3 Objects**: Two ETL jobs are used. The first job extracts your database, table, and partition metadata from your Hive metastore into Amazon S3. This job can be run either as an AWS Glue job or on a cluster with Spark installed. The second is an AWS Glue job that loads the metadata from S3 into the AWS Glue Data Catalog. #### AWS Glue Data Catalog to Hive Metastore - **Direct Migration**: An ETL job extracts metadata from specified databases in the AWS Glue Data Catalog and loads it into a Hive metastore. This job is run by AWS Glue, and requires an AWS Glue connection to the Hive metastore as a JDBC source. - **Migration using Amazon S3 Objects**: Two ETL jobs are used. The first is an AWS Glue job that extracts metadata from specified databases in the AWS Glue Data Catalog and then writes it as S3 objects. The second job loads the S3 objects into a Hive Metastore. The second job can be run either as an AWS Glue job or on a cluster with Spark installed. #### AWS Glue Data Catalog to another AWS Glue Data Catalog - **Migration using Amazon S3 Objects**: Two AWS Glue jobs ETL jobs are run. The first extracts metadata from specified databases in an AWS Glue Data Catalog and loads it into S3. The second loads data from S3 into an AWS Glue Data Catalog. This is the only way to migrate between Data Catalogs in different accounts. It combines the workflow for AWS Glue to Hive using S3 with the workflow for Hive to AWS Glue using S3. #### Limitations - Only databases, tables and partitions can be migrated. Other entities such as column statistics, privileges, roles, functions, and transactions cannot be migrated. - The script only supports a Hive metastore stored in a MySQL-compatible JDBC source. Other Hive metastore database engines such as PostgreSQL are not supported yet. - There is no isolation guarantee, which means that if Hive is doing concurrent modifications to the metastore while the migration job is running, inconsistent metadata can be introduced into the AWS Glue Data Catalog. - There is no streaming support. Hive metastore migration is done as a batch job. - If there is a naming conflict with existing objects in the target Data Catalog, then the existing data is overwritten by the new data. - Your Hive metastore must reside in a MySQL database accessible by AWS Glue. Currently, AWS Glue is able to connect to the JDBC data sources in a VPC subnet, such as RDS, EMR local Hive metastore, or a self-managed database on EC2. If your Hive metastore is not directly accessible by AWS Glue, then you must use Amazon S3 as intermediate staging area for migration. ## Instructions Below are instructions for using each of the migration workflows described above. #### Migrate Directly from Hive to AWS Glue 1. Set up AWS Glue as described in the following steps: - [Set up IAM permissions for AWS Glue](http://docs.aws.amazon.com/glue/latest/dg/getting-started-access.html). - [Set up VPC endpoints for Amazon S3](http://docs.aws.amazon.com/glue/latest/dg/vpc-endpoints-s3.html). - [Set up VPC to connect to JDBC Data Stores](http://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html). If the Hive metastore is in a local database in the EMR master instance, just configure the EMR VPC subnet and EMR master security group, similar to the RDS configuration. - [Set up DNS in your VPC](http://docs.aws.amazon.com/glue/latest/dg/set-up-vpc-dns.html). 2. Gather your Hive metastore JDBC connection information. You can find the Hive metastore JDBC URL, username, and password in the file named `hive-site.xml`. In EMR, this file is located at: `/etc/hive/conf/hive-site.xml`. For example: ```xml javax.jdo.option.ConnectionURL jdbc:mysql://ip-10-0-12-34.ec2.internal:3306/hive?createDatabaseIfNotExist=true javax.jdo.option.ConnectionUserName hive javax.jdo.option.ConnectionPassword MySecretPassword ``` 3. On the AWS Glue console, create a connection to the Hive metastore as a JDBC data source. Follow the [instructions in the Developer Guide](http://docs.aws.amazon.com/glue/latest/dg/populate-add-connection.html) to create a connection that references your Hive metastore. Use the connection JDBC URL, username, and password you gathered in a previous step. Specify the VPC, subnet, and security group associated with your Hive metastore. You can find these on the EMR console if the Hive metastore is on the EMR master node, or on the RDS console, if the metastore is an RDS instance. 4. (Recommended) Test the connection. Choose "Test connection" on the AWS Glue connections page. If the connection test fails, your ETL job might not be able to connect to your Hive metastore data source. Fix the network configuration before moving on. 5. Upload these ETL job scripts to an S3 bucket: import_into_datacatalog.py hive_metastore_migration.py If you configured AWS Glue to access S3 from a VPC endpoint, you must upload the script to a bucket in the same AWS region where your job runs. 6. Create a job on the AWS Glue console to extract metadata from your Hive metastore to migrate it to AWS Glue Data Catalog. Define the job with **An existing script that you provide** and the following properties: - **Script path where the script is stored** - S3 path to `import_into_datacatalog.py` - **Python library path** - S3 path to `hive_metastore_migration.py` Also add the following job parameters. These parameters are defined in the source code of `import_into_datacatalog.py`. - `--mode` set to `from-jdbc`, which means the migration is from a JDBC source into an AWS Glue Data Catalog. - `--connection-name` set to the name of the AWS Glue connection you created to point to the Hive metastore. It is used to extract the Hive JDBC connection information using the native Spark library. - `--region` the AWS region for Glue Data Catalog, for example, `us-east-1`. You can find a list of Glue supported regions here: http://docs.aws.amazon.com/general/latest/gr/rande.html#glue_region. If not provided, `us-east-1` is used as default. - `--database-prefix` (optional) set to a string prefix that is applied to the database name created in AWS Glue Data Catalog. You can use it as a way to track the origin of the metadata, and avoid naming conflicts. The default is the empty string. - `--table-prefix` (optional) set to a string prefix that is applied to the table names created in AWS Glue Data Catalog. Again, you can use it as a way to track the origin of the metadata, and avoid naming conflicts. The default is the empty string. In the Connections page, add the Hive metastore connection you created. The same connection must be specified both here and in the `--connection-name` argument. Finally, review the job and create it. 7. Run the job on demand with the AWS Glue console. When the job is finished, the metadata from the Hive metastore is visible in the AWS Glue console. Check the databases and tables listed to verify that they were migrated correctly. #### Migrate from Hive to AWS Glue using Amazon S3 Objects If your Hive metastore cannot connect to the AWS Glue Data Catalog directly (for example, if it's on a private corporate network), you can use AWS Direct Connect to establish private connectivity to an AWS VPC, or use AWS Database Migration Service to migrate your Hive metastore to a database on AWS. If the above solutions don't apply to your situation, you can choose to first migrate your Hive metastore to Amazon S3 objects as a staging area, then run an ETL job to import the metadata from S3 to the AWS Glue Data Catalog. To do this, you need to have a Spark 2.1.x cluster that can connect to your Hive metastore and export metadata to plain files on S3. The Hive metastore to S3 migration can also run as an Glue ETL job, if AWS Glue can directly connect to your Hive metastore. 1. Make the MySQL connector jar available to the Spark cluster on the master and all worker nodes. Include the jar in the Spark driver class path as well as with the `--jars` and `--driver-class-path` parameters in the `spark-submit` command. You can download the MySql connector [here at MySql.com](https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.42.tar.gz). If you use EMR to do this configuration, you can run the EMR bootstrap script `emr_bootstrap_action.sh` included in the `shell` folder, and then provide `--jars /usr/lib/hadoop/mysql-connector-java-5.1.42-bin.jar` and `--driver-class-path :/usr/lib/hadoop/mysql-connector-java-5.1.42-bin.jar` in the spark-submit script. See [EMR documentation](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html) for how to run Bootstrap script on EMR. 2. Submit the `hive_metastore_migration.py` Spark script to your Spark cluster using the following parameters: - Set `--direction` to `from_metastore`, or omit the argument since `from_metastore` is the default. - Provide the JDBC connection information through these arguments: `--jdbc-url`, `--jdbc-username`, and `--jdbc-password`. - The argument `--output-path` is required. It is either a local file system location or an S3 location. If the output path is a local directory, you can upload the data to an S3 location manually. If it is an S3 path, you need to make sure that the Spark cluster has EMRFS library in its class path. The script will export the metadata to a subdirectory of the output-path you provided. - `--database-prefix` and `--table-prefix` (optional) to set a string prefix that is applied to the database and table names. They are empty by default. - Example spark-submit command to migrate Hive metastore to S3, tested on EMR-4.7.1: ```bash MYSQL_JAR_PATH=/usr/lib/hadoop/mysql-connector-java-5.1.42-bin.jar DRIVER_CLASSPATH=/home/hadoop/*:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:$MYSQL_JAR_PATH spark-submit --driver-class-path $DRIVER_CLASSPATH \ --jars $MYSQL_JAR_PATH \ /home/hadoop/hive_metastore_migration.py \ --mode from-metastore \ --jdbc-url jdbc:mysql://metastore.foo.us-east-1.rds.amazonaws.com:3306 \ --jdbc-user hive \ --jdbc-password myJDBCPassword \ --database-prefix myHiveMetastore_ \ --table-prefix myHiveMetastore_ \ --output-path s3://mybucket/myfolder/ ``` - If the job finishes successfully, it creates 3 sub-folders in the S3 output path you specified named "databases", "tables" and "partitions". These paths will be used in the next step. 3. Upload these ETL job scripts to an S3 bucket: import_into_datacatalog.py hive_metastore_migration.py 4. Create a job on the AWS Glue console to extract metadata from your Hive metastore and write it to AWS Glue Data Catalog. Define the job with **An existing script that you provide** and the following properties: - **Script path where the script is stored** - S3 path to `import_into_datacatalog.py` - **Python library path** - S3 path to `hive_metastore_migration.py` Add the following parameters. - `--mode` set to `from-s3` - `--region` the AWS region for Glue Data Catalog, for example, `us-east-1`. You can find a list of Glue supported regions here: http://docs.aws.amazon.com/general/latest/gr/rande.html#glue_region. If not provided, `us-east-1` is used as default. - `--database-input-path` set to the S3 path containing only databases. For example: `s3://someBucket/output_path_from_previous_job/databases` - `--table-input-path` set to the S3 path containing only tables. For example: `s3://someBucket/output_path_from_previous_job/tables` - `--partition-input-path` set to the S3 path containing only partitions. For example: `s3://someBucket/output_path_from_previous_job/partitions` Also, because there is no need to connect to any JDBC source, the job doesn't require any connections. Finally, if you don't have access to Spark from an on-premises network where your Hive metastore resides, you may implement your own script to load data into S3 using any script and platform, as long as the intermediary data on S3 conforms to the pre-defined format and schema. The migration script requires a separate S3 folder for each entity type: database, table, and partition. Each folder contains one to many files. Each line of a file is a single JSON object. No comma or any other separation character can appear at the end of the line. Each JSON object has a field `type` with a value of `database`, `table`, or `partition`, and a field `item` that contains the metadata payload. The table entity also contains a `database` field to represent the database it belongs to. The partition entity contains a `database` and a `table` field. The exact schema of the entities are defined in the `hive_metastore_migration.py` script. Below is a prettified example of a table entity; in practice, it must all appear on a single line with the whitespace removed: { "type" : "table", "database" : "my_metastore_default", "item" : { "createTime" : "Apr 04, 2017 08:58:43 AM", "lastAccessTime" : "Jan 01, 1970 00:00:00 AM", "owner" : "hadoop", "retention" : 0, "name" : " my_metastore_tbl_mytable", "tableType" : "EXTERNAL_TABLE", "parameters" : { "EXTERNAL" : "TRUE", "transient_lastDdlTime" : "1491296323" }, "partitionKeys" : [ { "name" : "key1", "type" : "int" }, { "name" : "key2", "type" : "string" }, { "name" : "key3", "type" : "double" } ], "storageDescriptor" : { "inputFormat" : "org.apache.hadoop.mapred.TextInputFormat", "compressed" : false, "storedAsSubDirectories" : false, "location" : "S3://mydefaultgluetest/test3keys", "numberOfBuckets" : -1, "outputFormat" : "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat", "columns" :[ { "name" : "col1", "type" : "int" } ], "serdeInfo" : { "serializationLibrary" : "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe", "parameters" : { "serialization.format" : "1" } } } } } #### Migrate Directly from AWS Glue to Hive 1. Follow steps 1 through 4 in the instructions for **Migrate Directly from Hive to AWS Glue**. 2. Upload the the following job scripts to an S3 bucket: export_from_datacatalog.py hive_metastore_migration.py If you configured AWS Glue to access S3 from a VPC endpoint, you must upload the script to a bucket in the same region where your job runs. If you configured AWS Glue to access the public internet through a NAT gateway or NAT instance, cross-region S3 access is allowed. 3. Create a job on the AWS Glue console to extract metadata from the AWS Glue Data Catalog to the JDBC Hive metastore. Define the job with **An existing script that you provide** and the following properties: - **Script path where the script is stored** - S3 path to `export_from_datacatalog.py` - **Python library path** - S3 path to `hive_metastore_migration.py` Add the following parameters. - `--mode` set to `to-jdbc`, which means the migration is directly to a jdbc Hive Metastore - `--connection-name` set to the name of the AWS Glue connection you created to point to the Hive metastore. It is the destination of the migration. - `--region` the AWS region for Glue Data Catalog, for example, `us-east-1`. You can find a list of Glue supported regions here: http://docs.aws.amazon.com/general/latest/gr/rande.html#glue_region. If not provided, `us-east-1` is used as default. - `--database-names` set to a semi-colon(;) separated list of database names to export from Data Catalog. In the Connections page, add the Hive metastore connection you created: 4. Run the job on demand with the AWS Glue console. When the job is finished, run Hive queries on a cluster connected to the metastore to verify that metadata was successfully migrated. #### Migrate from AWS Glue to Hive through Amazon S3 Objects 1. Follow step 1 in **Migrate from Hive to AWS Glue using Amazon S3 Objects**. 2. Create an AWS Glue ETL job similar to the one described in the Direct Migration instructions above. Since the destination is now an S3 bucket instead of a Hive metastore, no connections are required. In the job, add the following parameters: - `--mode` set to `to-s3`, which means the migration is to S3. - `--region` the AWS region for Glue Data Catalog, for example, `us-east-1`. You can find a list of Glue supported regions here: http://docs.aws.amazon.com/general/latest/gr/rande.html#glue_region. If not provided, `us-east-1` is used as default. - `--database-names` set to a semi-colon(;) separated list of database names to export from Data Catalog. - `--output-path` set to the S3 destination path. 3. Submit the `hive_metastore_migration.py` Spark script to your Spark cluster. - Set `--direction` to `to_metastore`. - Provide the JDBC connection information through the arguments: `--jdbc-url`, `--jdbc-username`, and `--jdbc-password`. - The argument `--input-path` is required. This can be a local directory or an S3 path. The path points to the directory containing `databases`, `partitions`, `tables` folders. This path should be the same as in the `--output-path` parameter of step 2, with date and time information appended at the end. For example, if `--output-path` is: s3://gluemigrationbucket/export_output/ The result of the AWS Glue job in step 2 creates `databases`, `partitions`, and `tables` folders in: s3://gluemigrationbucket/export_output/ Then `--input-path` is specified as: s3://gluemigrationbucket/export_output// #### AWS Glue Data Catalog to another AWS Glue Data Catalog You can migrate (copy) metadata from the Data Catalog in one account to another. The steps are: 1. Enable cross-account access for an S3 bucket so that both source and target accounts can access it. See [the Amazon S3 documenation](http://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucket-policies.html#example-bucket-policies-use-case-1) for S3 cross-account access configuration. 2. Upload the the following scripts to an S3 bucket accessible from the source AWS account: export_from_datacatalog.py hive_metastore_migration.py 3. Upload the the following scripts to an S3 bucket accessible from the target AWS account to be updated: import_into_datacatalog.py hive_metastore_migration.py 4. In the source AWS account, create a job on the AWS Glue console to extract metadata from the AWS Glue Data Catalog to S3. Define the job with **An existing script that you provide** and the following properties: - **Script path where the script is stored** - S3 path to `export_from_datacatalog.py` - **Python library path** - S3 path to `hive_metastore_migration.py` Add the following parameters: - `--mode` set to `to-s3`, which means the migration is to S3. - `--region` the AWS region for Glue Data Catalog, for example, `us-east-1`. You can find a list of Glue supported regions here: http://docs.aws.amazon.com/general/latest/gr/rande.html#glue_region. If not provided, `us-east-1` is used as default. - `--database-names` set to a semi-colon(;) separated list of database names to export from Data Catalog. - `--output-path` set to the S3 destination path that you configured with **cross-account access**. The job will create `databases`, `partitions`, and `tables` folders in the S3 output folder you choose. 5. In the target AWS account, create a job on the AWS Glue console to import metadata to the target AWS Glue Data Catalog. Define the job with **An existing script that you provide** and the following properties: - **Script path where the script is stored** - S3 path to `import_into_datacatalog.py` - **Python library path** - S3 path to `hive_metastore_migration.py` Add the following parameters. - `--mode` set to `from-s3` - `--region` the AWS region for Glue Data Catalog, for example, `us-east-1`. You can find a list of Glue supported regions here: http://docs.aws.amazon.com/general/latest/gr/rande.html#glue_region. If not provided, `us-east-1` is used as default. - `--database-input-path` set to the S3 path containing only databases. - `--table-input-path` set to the S3 path containing only tables. - `--partition-input-path` set to the S3 path containing only partitions. 6. (Optional) Manually delete the temporary files generated in the S3 folder. Also, remember to revoke the cross-account access if it's not needed anymore.