# AWS Glue CDK baseline template

This is a baseline template for AWS CDK development with AWS Glue.
This CDK template is built with [AWS CDK v2](https://docs.aws.amazon.com/cdk/v2/guide/home.html) and [AWS CDK Pipelines](https://docs.aws.amazon.com/cdk/v2/guide/cdk_pipeline.html).

Typically, you have multiple accounts to manage and provision resources for your data pipeline. In this template, we assume the following three accounts:
* Pipeline account - This hosts the end-to-end pipeline
* Dev account â€“ This hosts the integration pipeline in the development environment
* Prod account â€“ This hosts the data integration pipeline in the production environment

If you want, you can use the same account and the same Region for all three.

To start applying this end-to-end development lifecycle model to your data platform easily and quickly, we prepared [a baseline template `aws-glue-cdk-baseline` using AWS CDK](https://github.com/aws-samples/aws-glue-cdk-baseline). The template is built on top of [AWS CDK v2](https://docs.aws.amazon.com/cdk/v2/guide/home.html) and [AWS CDK Pipelines](https://docs.aws.amazon.com/cdk/v2/guide/cdk_pipeline.html). It provisions two kinds of stacks; 

* AWS Glue app stack â€“ This provisions the data integration pipeline: one in the dev account and one in the prod account
* Pipeline stack â€“ This provisions the Git repository and CI/CD pipeline in the pipeline account

The AWS Glue app stack provisions the data integration pipeline, including the following resources:

* AWS Glue jobs
* AWS Glue job scripts

At the time of publishing of this template, the AWS CDK has two versions of the AWS Glue module: [@aws-cdk/aws-glue](https://docs.aws.amazon.com/cdk/api/v1/docs/aws-glue-readme.html) and [@aws-cdk/aws-glue-alpha](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-glue-alpha-readme.html), containing [L1 constructs](https://docs.aws.amazon.com/cdk/v2/guide/constructs.html#constructs_l1_using) and [L2 constructs](https://docs.aws.amazon.com/cdk/v2/guide/constructs.html#constructs_using), respectively. The sample Glue app stack is defined using [aws-glue-alpha](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-glue-alpha-readme.html), [the L2 construct](https://docs.aws.amazon.com/cdk/v2/guide/constructs.html#constructs_using) for AWS Glue because itâ€™s straightforward to define and manage AWS Glue resources. If you want to use the L1 construct, refer to [Build, Test and Deploy ETL solutions using AWS Glue and AWS CDK based CI/CD pipelines](https://aws.amazon.com/blogs/big-data/build-test-and-deploy-etl-solutions-using-aws-glue-and-aws-cdk-based-ci-cd-pipelines/).

The pipeline stack provisions the entire CI/CD pipeline, including the following resources:

* AWS IAM roles
* Amazon S3 bucket
* AWS CodeCommit
* AWS CodePipeline
* AWS CodeBuild

Every time the business requirement changes (such as adding data sources or changing data transformation logic), you make changes on the AWS Glue app stack and re-provision the stack to reflect your changes. This is done by committing your changes in the AWS CDK template to the CodeCommit repository, then CodePipeline reflects changes on AWS resources using [AWS CloudFormation change sets](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-updating-stacks-changesets.html).

In the following sections, we present the steps to set up the required environment and demonstrate the end-to-end development lifecycle.

### Pre-requisite

* Python 3.9 or later
* AWS accounts for Pipeline account, Dev account, and Prod account
* [AWS Named profile](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) for Pipeline account, Dev account, and Prod account
* The [AWS CDK Toolkit (cdk command)](https://docs.aws.amazon.com/cdk/v2/guide/cli.html) 2.87.0 or later
* Docker
* [Visual Studio Code](https://code.visualstudio.com/)
* [Visual Studio Code Dev Containers](https://code.visualstudio.com/docs/remote/containers)

### Initialize the project

To initialize the project, complete the following steps:

1. Clone [the baseline template](https://github.com/aws-samples/aws-glue-cdk-baseline) to your workplace.

```
$ git clone git@github.com:aws-samples/aws-glue-cdk-baseline.git
$ cd aws-glue-cdk-baseline.git
```

2. Create a Python [virtual environment](https://docs.python.org/3/library/venv.html) specific to the project on the client machine. 

```
$ python3 -m venv .venv
```

We use a virtual environment in order to isolate the Python environment for this project and not install software globally.

3. Activate the virtual environment according to your OS:

* On MacOS and Linux, use the following code:

```
$ source .venv/bin/activate
```

* On a Windows platform, use the following code:

```
% .venv\Scripts\activate.bat
```

After this step, the subsequent steps run within the bounds of the virtual environment on the client machine and interact with the AWS account as needed.

4. Install the required dependencies described in [requirements.txt](https://github.com/aws-samples/aws-glue-cdk-cicd/blob/main/requirements.txt) to the virtual environment:

```
$ pip install -r requirements.txt
$ pip install -r requirements-dev.txt
```


5. Edit the configuration file `default-config.yaml` based on your environments (replace each account ID with your own): 

```
pipelineAccount:
  awsAccountId: 123456789101
  awsRegion: us-east-1

devAccount:
  awsAccountId: 123456789102
  awsRegion: us-east-1

prodAccount:
  awsAccountId: 123456789103
  awsRegion: us-east-1
```

6. Run `pytest` to initialize the snapshot test files by running following command:

```
$ python3 -m pytest --snapshot-update
```

### Bootstrap your AWS environments

Run the following commands to bootstrap your AWS environments.

1. In the pipeline account, replace `PIPELINE-ACCOUNT-NUMBER`, `REGION`, and `PIPELINE-PROFILE` with your own values: 

```
$ cdk bootstrap aws://<PIPELINE-ACCOUNT-NUMBER>/<REGION> --profile <PIPELINE-PROFILE> \
    --cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess
```

2. In the dev account, replace `DEV-ACCOUNT-NUMBER`, `REGION`, and `DEV-PROFILE` with your own values: 

```
$ cdk bootstrap aws://<DEV-ACCOUNT-NUMBER>/<REGION> --profile <DEV-PROFILE> \
    --cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess \
    --trust PIPELINE-ACCOUNT-NUMBER
```

3. In the prod account, replace `PROD-ACCOUNT-NUMBER`, `REGION`, and `PROD-PROFILE` with your own values: 

```
$ cdk bootstrap aws://<PROD-ACCOUNT-NUMBER>/<REGION> --profile <PROD-PROFILE> \
    --cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess \
    --trust PIPELINE-ACCOUNT-NUMBER
```

When you use only one account for all environments, you can just run the `cdk bootstrap` command one time.

### Deploy your AWS resources

Run the command using Pipeline account to deploy resources defined in the AWS CDK baseline template:

```
$ cdk deploy --profile <PIPELINE-PROFILE>
```

This creates the pipeline stack in the pipeline account and the AWS Glue app stack in the development account.

When the `cdk deploy` command is completed, letâ€™s verify the pipeline using the pipeline account.

1. Open [AWS CodePipeline console](https://us-east-1.console.aws.amazon.com/codesuite/codepipeline/pipelines).
2. Choose `GluePipeline`.

Then verify that GluePipeline has stages; `Source`, `Build`, `UpdatePipeline`, `Assets`, `DeployDev`, and `DeployProd`. Also verify that these five stages `Source`, `Build`, `UpdatePipeline`, `Assets`, `DeployDev` have been succeeded, and `DeployProd` is in pending status. It can take about 15 minutes.

Now that the pipeline has been created successfully, you can also verify the AWS Glue app stack resource on the AWS CloudFormation console in the dev account.
At this step, the AWS Glue app stack is deployed only in the dev account. You can try to run the AWS Glue job `ProcessLegislators` to see how it works.

### Configure your Git repository with AWS CodeCommit

In the earlier step, you cloned the Git repository from GitHub. Although it is possible to configure the CDK template to work with GitHub, GitHub Enterprise, or Bitbucket, this time we use AWS CodeCommit. If you prefer those 3rd party Git providers, [configure connections](https://docs.aws.amazon.com/codepipeline/latest/userguide/connections-github.html), and edit `[pipeline_stack.py](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/aws_glue_cdk_baseline/pipeline_stack.py)` to define the variable `source` to use the target Git provider using `[CodePipelineSource](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.pipelines.CodePipelineSource.html)`.

Because you already ran the `cdk deploy` command, the CodeCommit repository has already been created with all the required code and related files. The first step is to [setup required for access to CodeCommit](https://docs.aws.amazon.com/codecommit/latest/userguide/how-to-migrate-repository-existing.html#how-to-migrate-existing-setup). The next step is to clone the repository from the CodeCommit repository to your local. Run the following commands:

```
$ mkdir aws-glue-cdk-baseline-codecommit
$ cd aws-glue-cdk-baseline-codecommit
$ git clone ssh://git-codecommit.us-east-1.amazonaws.com/v1/repos/aws-glue-cdk-baseline
```

In the next step, we make changes in this local copy of the CodeCommit repository.

## End-to-end development lifecycle

Now that the environment has been successfully created, youâ€™re ready to start developing a data integration pipeline using this baseline template. Letâ€™s walk through end-to-end development lifecycle.

When you want to define your own data integration pipeline, you need to add more AWS Glue jobs and implement job scripts. For this tutorial, letâ€™s assume the use case to add a new AWS Glue job with a new job script to read multiple S3 locations and join them.

### Implement and test in your local

First, implement and test the AWS Glue job and its job script in your local environment using Visual Studio Code.

Set up your development environment by following the steps in [Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container](https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/). The following steps are required in the context of this template.

1. Start Docker.
2. Pull the Docker image which has local development environment using AWS Glue ETL library:

```
$ docker pull `public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01`
```

3. Run the following command to define AWS named profile name:

```
$ PROFILE_NAME="<DEV-PROFILE>"
```

4. Run the following command to make it available with the baseline template:

```
$ cd aws-glue-cdk-baseline/
$ WORKSPACE_LOCATION=$(pwd)
```

5. Run the Docker container:

```
$ docker run -it -v ~/.aws:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 pyspark
```

6. Start Visual Studio Code. 
7. Choose **Remote Explorer** in the navigation pane, and choose the arrow icon of the `workspace` folder in the container `public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01`.

If the workspace folder is not shown up, choose **Open folder** and select `/home/glue_user/workspace`.

Now you install the required dependencies described in [requirements.txt](https://github.com/aws-samples/aws-glue-cdk-cicd/blob/main/requirements.txt) to the container environment. 

8. Run the following commands in [the terminal in Visual Studio Code](https://code.visualstudio.com/docs/terminal/basics):

```
$ pip install -r requirements.txt
$ pip install -r requirements-dev.txt
```

9. Implement the code.

Now letâ€™s make the required changes for a new AWS Glue job here.

* `[aws_glue_cdk_baseline/glue_app_stack.py](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/aws_glue_cdk_baseline/glue_app_stack.py)`

Letâ€™s add this new code block right after the existing job definition of `ProcessLegislators` in order to add the new Glue job `JoinLegislators`:

```python
        self.new_glue_job = glue.Job(self, "JoinLegislators",
            executable=glue.JobExecutable.python_etl(
                glue_version=glue.GlueVersion.V4_0,
                python_version=glue.PythonVersion.THREE,
                script=glue.Code.from_asset(
                    path.join(path.dirname(__file__), "job_scripts/join_legislators.py")
                )
            ),
            description="a new example PySpark job",
            default_arguments={
                "--input_path_orgs": config[stage]['jobs']['JoinLegislators']['inputLocationOrgs'],
                "--input_path_persons": config[stage]['jobs']['JoinLegislators']['inputLocationPersons'],
                "--input_path_memberships": config[stage]['jobs']['JoinLegislators']['inputLocationMemberships']
            },
            tags={
                "environment": self.environment,
                "artifact_id": self.artifact_id,
                "stack_id": self.stack_id,
                "stack_name": self.stack_name
            }
        )
```

Here, you added three job parameters for different S3 locations. In the proceeding steps, you will provide those locations through the Glue job parameters.

Then, create a new job script, and a new unit test script for the new Glue job:

* `aws_glue_cdk_baseline/job_scripts/join_legislators.py`

```python
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import Join
from awsglue.utils import getResolvedOptions


class JoinLegislators:
    def __init__(self):
        params = []
        if '--JOB_NAME' in sys.argv:
            params.append('JOB_NAME')
            params.append('input_path_orgs')
            params.append('input_path_persons')
            params.append('input_path_memberships')
        args = getResolvedOptions(sys.argv, params)

        self.context = GlueContext(SparkContext.getOrCreate())
        self.job = Job(self.context)

        if 'JOB_NAME' in args:
            jobname = args['JOB_NAME']
            self.input_path_orgs = args['input_path_orgs']
            self.input_path_persons = args['input_path_persons']
            self.input_path_memberships = args['input_path_memberships']
        else:
            jobname = "test"
            self.input_path_orgs = "s3://awsglue-datasets/examples/us-legislators/all/organizations.json"
            self.input_path_persons = "s3://awsglue-datasets/examples/us-legislators/all/persons.json"
            self.input_path_memberships = "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"
        self.job.init(jobname, args)
    
    def run(self):
        dyf = join_legislators(self.context, self.input_path_orgs, self.input_path_persons, self.input_path_memberships)
        df = dyf.toDF()
        df.printSchema()
        df.show()
        print(df.count())

def read_dynamic_frame_from_json(glue_context, path):
    return glue_context.create_dynamic_frame.from_options(
        connection_type='s3',
        connection_options={
            'paths': [path],
            'recurse': True
        },
        format='json'
    )

def join_legislators(glue_context, path_orgs, path_persons, path_memberships):
    orgs = read_dynamic_frame_from_json(glue_context, path_orgs)
    persons = read_dynamic_frame_from_json(glue_context, path_persons)
    memberships = read_dynamic_frame_from_json(glue_context, path_memberships)
    orgs = orgs.drop_fields(['other_names', 'identifiers']).rename_field('id', 'org_id').rename_field('name', 'org_name')
    dynamicframe_joined = Join.apply(orgs, Join.apply(persons, memberships, 'id', 'person_id'), 'org_id', 'organization_id').drop_fields(['person_id', 'org_id'])
    return dynamicframe_joined

if __name__ == '__main__':
    JoinLegislators().run()
```

* `aws_glue_cdk_baseline/job_scripts/tests/test_join_legislators.py`

```python
import pytest
import sys
import join_legislators
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions


@pytest.fixture(scope="module", autouse=True)
def glue_context():
    sys.argv.append('--JOB_NAME')
    sys.argv.append('test_count')

    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    context = GlueContext(SparkContext.getOrCreate())
    job = Job(context)
    job.init(args['JOB_NAME'], args)

    yield(context)


def test_counts(glue_context):
    dyf = join_legislators.join_legislators(glue_context, 
        "s3://awsglue-datasets/examples/us-legislators/all/organizations.json",
        "s3://awsglue-datasets/examples/us-legislators/all/persons.json", 
        "s3://awsglue-datasets/examples/us-legislators/all/memberships.json")
    assert dyf.toDF().count() == 10439
```

* `[default-config.yaml](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/default-config.yaml)`

Add following under `prod` and `dev`:

```yaml
    JoinLegislators:
      inputLocationOrgs: "s3://awsglue-datasets/examples/us-legislators/all/organizations.json"
      inputLocationPersons: "s3://awsglue-datasets/examples/us-legislators/all/persons.json"
      inputLocationMemberships: "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"
```

* `[tests/unit/test_glue_app_stack.py](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/tests/unit/test_glue_app_stack.py)`
* `[tests/unit/test_pipeline_stack.py](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/tests/unit/test_pipeline_stack.py)`
* `[tests/snapshot/test_snapshot_glue_app_stack.py](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/tests/snapshot/test_snapshot_glue_app_stack.py)`

Add following under `"jobs"` in the variable `config` in the above three files (No need to replace S3 locations): 

```python
            ,
            "JoinLegislators": {
                "inputLocationOrgs": "s3://path_to_data_orgs",
                "inputLocationPersons": "s3://path_to_data_persons",
                "inputLocationMemberships": "s3://path_to_data_memberships"
            }
```

10. Choose **Run** at the top right to run individual job scripts. If **Run** button is not shown, install **Python** into the container through **Extensions** in the navigation pane.
11. For local unit testing, run following command in [the terminal in Visual Studio Code](https://code.visualstudio.com/docs/terminal/basics):

```
$ cd aws_glue_cdk_baseline/job_scripts/
$ python3 -m pytest
```

Then you can verify that the newly added unit test passed successfully.

12. Run `pytest` to initialize the snapshot test files by running following command:

```
$ cd ../../
$ python3 -m pytest --snapshot-update
```

### Deploy to development environment

Complete following steps to deploy the AWS Glue app stack to the development environment and run integration tests there:

1. [Setup required for access to CodeCommit](https://docs.aws.amazon.com/codecommit/latest/userguide/how-to-migrate-repository-existing.html#how-to-migrate-existing-setup). 
2. Commit and push your changes to AWS CodeCommit repo.

```
$ git add .
$ git commit -m "Add the second Glue job"
$ git push
```

You can see that the pipeline is successfully triggered.

### Integration test

There is nothing required for running the integration test for newly added Glue job. The integration test script `[integ_test_glue_app_stack.py](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/tests/integ/integ_test_glue_app_stack.py)` runs all the jobs including a specific tag, then verify the state and its duration. If you want to change the condition or the threshold, you can edit assertions at [the end of `integ_test_glue_job` method](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/tests/integ/integ_test_glue_app_stack.py#L105-L106).

### Deploy to production environment

Complete the following steps to deploy the AWS Glue app stack to the production environment:

1. On the `GluePipeline` page in the [AWS CodePipeline console](https://us-east-1.console.aws.amazon.com/codesuite/codepipeline/pipelines), choose **Review** under `DeployProd` stage.
2. Choose **Approve**.

Wait for the `DeployProd` stage to be completed, then you can verify the AWS Glue app stack resource in the dev account.


## Clean up

For cleaning up your resources, complete following steps:

1. Run the following command using Pipeline account:

```
$ cdk destroy --profile <PIPELINE-PROFILE>
```

2. Delete the AWS Glue app stack in the dev account and prod account.

## Security

See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.

## License

This library is licensed under the MIT-0 License. See the LICENSE file.