# Data mask python utilities.

*Data mask python utilities to mask processing.*

## Table of contents

- [Quick Start](#quick-start) 

## Quick Start
### Pyspark + EMR/GLUE

Deploy datamask-pyutil to run with pyspark in a S3 bucket.
```
$ git clone [REPOSITORY PATH]
$ cd datamask-pyutil
$ bash scripts/deploy.sh [BUCKET TO DEPLOY] 

```
   The folder ./artfact will be created on the build process and it will be uploaded on [BUCKET TO DEPLOY]
   Than use the artifacts to run a EMR step or a GLUE job.
   The script entry point will be on "s3::/[BUCKET]/datamask/datamask-pyutil.sh"


### Custom Pyspark   

Create you own pyspark module with the conf_process class.

First import and build artifacts files to be included in the spark-submit execution: 
```
$ git clone [REPOSITORY PATH]
$ cd datamask-pyutil
$ bash scripts/build.sh [BUCKET TO DEPLOY] 
```
Create your own pyspark script following the example above:

```
from datamask_pyutil import conf_process

''' 
Initialize the Spark context 
'''
jobname='JobName'
part_vet=['ano=2020','month=12']

cf = conf_process.DatamaskConfProcess(configPath, jobName)

if not cf.is_job_active():
    print("Jobname[{}] is not active".format(jobName))
else:
  cf.process_spark(part_vet, spark)

```
Than call spark-submit with all files in "./artifacts/spark-dist/*" in the "--py-files"

### Local test

You can use "artifact/datamask-pyutil.sh" to run a local environment of pyspark also.
Use scripts/run-test.sh as a example. 
Test case:
- ./test/test_data/input: test input data
- ./test/test_data/salts: test salts
- ./test/test_parms/test_parms.json: test parameters

### Parameters

All the parameters are stored in a JSON file with the following schema:

See [Parameters schema](./schemas/datamask.md)