# Datamsk EMR launch 

*Deploy datamask-pyutil using EMR launch*

This project is a CDK stack to orchestrate the data-masking-framework. 

![](datamask-emr-stack.png)

- [AWS Step functions](https://aws.amazon.com/step-functions): orchestrate the data pipeline
- [AWS Lambdas](https://aws.amazon.com/lambda): parse and process the S3 events
- [Amazon SQS](https://aws.amazon.com/sqs/): get the income events from  S3. 
- [Amazon SNS](https://aws.amazon.com/sns/): notify error or success 
- [Amazon EMR](https://aws.amazon.com/emr/): process the masking with Apache Spark
- [Amazon Dynamodb](https://aws.amazon.com/dynamodb/): store the history of processing.  
- [Amazon S3](https://aws.amazon.com/s3/):  store the income files, output files and parameters.

Deploy datamask-pyutil to run with pyspark in a S3 bucket.

- PREFIX: prefix name all the resources created.
- ENV: Define the environment, dev, prod or uat.
- ARTF_PARAM_FILE: The filename to the json parameter file.
- VPC: VPC id to create the EMR cluster
- SUBNET: Subnet id to create the EMR cluster
- AZ: Sybnet AZ id

The deploy.sh script get the version from the [emr-launch](https://github.com/awslabs/aws-emr-launch), python package that helps create the EMR cluster, and deploy the CDK dtamask-launch.
```
$ git clone [REPOSITORY PATH]
$ cd stacks/datamask-emr-launch
$ export PREFIX="org-prefix"
$ export ENV="dev"
$ export ARTF_PARAM_FILE="parm.json"
$ export VPC="vpc_id"
$ export SUBNET="subnet_id"
$ export AZ="az_id"
$ bash scripts/deploy.sh 

```