# aws-emr-cdk-atlas

A CDK stack to deploy Amazon EMR with Atlas.

# What does the code do?
1. Creates an AWS EMR cluster within a new VPC.
2. Creates an IAM service role for the EMR cluster to read scripts from s3 bucket.


# How to use the code?
## Install AWS CDK
Please refer to the following the [link](https://docs.aws.amazon.com/cdk/latest/guide/getting_started.html)

## AWS cli config 
Please refer to the following the [link](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)

## Install git

## Initialize the CDK directory
    
    git clone https://github.com/aws-samples/aws-cdk-emr-atlas
    cd aws-cdk-emr-atlas/aws-emr-cdk/

## Activate the virtualenv and install dependencies
    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt
    pip3 install PyYAML
    pip3 install aws_cdk.core
    pip3 install aws_cdk.aws_emr
    pip3 install aws_cdk.aws_ec2


## Change the configurations
Update the configurations in the app-config.yml file.

Before deploy, here is something you need to know:

1. You need a key pair config to EC2, which config in app-config.yaml file as emr->ec2->key_pair.
2. You need to create two S3 bucket, and config it in app-config.yaml for s3_log_bucket and s3_script_bucket.
3. Put file  aws-cdk-emr-atlas/aws-emr-cdk/apache-atlas-emr.sh to the bucket, which is the value of 's3_script_bucket' key in app-config.yaml.
4. The IAM role and job flow role for EMR service, will be created automatically.
5. A VPC with public subnet will be created automatically.

## Now let's deploy!
    cdk synth  # To review the cloudformation template
    cdk diff  # To review the change set
    cdk deploy  # To deploy the stack

## Test EMR
After you deploy the stack, you could find a EMR cluster in the console, try to connect the master node in terminal 
to run a job, and test other services on it, or add a step on the console to test Hadoop and Spark.

## Run a job on EMR cluster
# Additional Resources:
1. AWS Best Practice: [link](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html)
2. Resizing a cluster: [link](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-resize.html)
3. Submit a job on console: [link](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-add-steps-console.html)
4. Submit Hadoop jobs interactively: [link](https://docs.aws.amazon.com/emr/latest/ManagementGuide/interactive-jobs.html)
5. You can terminate a EMR cluster by console/CLI/APIA: [link](https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_TerminateJobFlow.html)