# Data Type Validation With Apache Spark
The delimited data type and structure validation framework validates input data against expected schema including number of columns, data types, nullability and assigns error flag and error message column to label specific error details. The purpose of this framework is to keep both the valid and invalid data together and have a mechanism to query specific erroneous data. The framework takes a schema details in a JSON format, input data path and returns a Spark DataFrame object that contains input data labelled with error flag and error message details.
The framework uses Spark native APIs such as RDD, DataFrame and in addition Univocity parser library that is used by Apache Spark and various other Apache open-source projects.
# Error Columns
The framework adds two columns that depicts data structure error details to the actual data set. Theese column are "error_flag" and "error_message".
| `Column` | `Type` |
`Description`
|
|---|---|---|
| error_flag | INT | indicates error flag as an integer value from 0 to 3.
0 = No Error 1 = Data Type Error(s) 2 = Num of Columns Less Than expected 3 = Num of Columns More Than expected |
| error_message | Array[Map[String, String]] | contains error message with specific data components.
If error_flag is 0 then it is an empty array. If error_flag is 1 then it is an array of Map/dicts for all columns having data type validation errors as:
"field" -> "{Field name}"
"value" -> "{Field Value that caused error}"
"dataType" -> "{Field data type}"
"dataFormat" -> "{Field data format}"
"nullability" -> "{Field nullability}"
If error_flag is 2 or 3 it is an array of Map/dicts for all columns having number of columns not as expected as:
"record" -> "{Entire record as a single column}"
"expectedNumberOfColumns" -> "{Expected number of columns}"
"actualNumberOfColumns" -> "{Actual number of columns}"
|
# Schema Structure
The expected data structure is passed as a JSON string with specific details that describe the input data. The definition is as below:
`Element Name` |`Type`|`Description`
|`Mandatory Field/Default value`
|`Example`|
-------------|-------------|------------------------------------|---------|-------|
schemaVersion|String|Schema version|Mandatory|1.0.0|
delimiter |String|Field delimiter as a single character|Mandatory|","
fileHeader |String|Boolean Flag, true if data contains first line as header true or else false|Mandatory| true
quoteChar |String|Field quote character|Optional, Default value double quote|"\\""
escapeChar |String|Escape character |Optional, Default value back slash| "\\\\"
lineSeparator|String|Line separator|Optional, Default value new line|"\\n"
dataStructure|Array|Data structure of each column|Mandatory|{
"fieldName": "some_date_column",
"dataType": "date",
"nullable": false,
"dataFormat": "yyyy-MM-dd"
}
dataStructure.fieldName|String|Field name|Mandatory|sales_date
dataStructure.dataType|String|Field data type|Mandatory|Accepted Hive/Spark Data Types are:
STRING
INT
BIGINT
DOUBLE
DECIMAL
DATE
TIMESTAMP
dataStructure.nullable|Boolean|true if nullable, else false|Mandatory|false
dataStructure.dataFormat|String|Data type format. Required for DECIMAL, DATE and TIMESTAMP data types|Mandatory|For DECIMAL Format = precision, scale i.e., 5,2
For DATE Format = [Java Date Format](https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html) Format i.e., "yyyy-MM-dd"
For TIMESTAMP Format = [Java Date Time Format](https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html) Format i.e., "yyyy-MM-dd HH:mm:ss", Zulu format: "yyyy-MM-dd'T'HH:mm:ss'Z'"
### Structure Template
```json
{
"schemaVersion": "",
"delimiter": "",
"fileHeader": "",
"quoteChar": "",
"escapeChar": "",
"lineSeparator": "",
"dataStructure": [
{
"fieldName": "",
"dataType": "",
"nullable": "",
"dataFormat": ""
},
...
...
...
...
]
}
```
### Example
```json
{
"schemaVersion": "0.0.1",
"delimiter": ",",
"fileHeader": true,
"quoteChar": "\"",
"escapeChar": "\\",
"lineSeparator": "\n",
"dataStructure": [
{
"fieldName": "int1",
"dataType": "int",
"nullable": true
},
{
"fieldName": "int2",
"dataType": "int",
"nullable": false
},
{
"fieldName": "string1",
"dataType": "string",
"nullable": true
},
{
"fieldName": "string2",
"dataType": "string",
"nullable": false
},
{
"fieldName": "decimal1",
"dataType": "decimal",
"dataFormat": "38,0",
"nullable": true
},
{
"fieldName": "decimal2",
"dataType": "decimal",
"dataFormat": "5,2",
"nullable": false
},
{
"fieldName": "date1",
"dataType": "date",
"nullable": true,
"dataFormat": "yyyy-MM-dd"
},
{
"fieldName": "date2",
"dataType": "date",
"nullable": false,
"dataFormat": "yyyy-MM-dd"
},
{
"fieldName": "timestamp1",
"dataType": "timestamp",
"nullable": true,
"dataFormat": "yyyy/MM/dd'T'HH:mm:ss'Z'"
},
{
"fieldName": "timestamp2",
"dataType": "timestamp",
"nullable": false,
"dataFormat": "yyyy/MM/dd'T'HH:mm:ss'Z'"
}
]
}
```
# Clone Repository
```bash
git clone {GIT Repo URL}
cd datatype-validation-with-apache-spark
```
# Open JDK 8 setup
```bash
brew install openjdk@8
export JAVA_HOME=/usr/local/opt/openjdk@8
```
> For Windows and Linux corresponding utilities can be used to download and install Open JDK
# Install sbt
```bash
brew install sbt
```
# Run Tests
```bash
sbt test
```
# Build Package
SBT pack plugin compiles the source code and creates a ```pack/lib``` directory under ```datatype-validation-with-apache-spark/target``` that contains individual JAR files required
```bash
sbt pack
```
```
spark-datatype-com-awsproserve-validation_2.11-1.0.jar
play-functional_2.11-2.7.4.jar
play-json_2.11-2.7.4.jar
```
Above is the list of JAR files required to be added to Apache Spark(AWS Glue, Amazon EMR or any Spark system) setup as:
```
--jars {path}/play-functional_2.11-2.7.4.jar,{path}/play-json_2.11-2.7.4.jar,{path}/spark-datatype-com-awsproserve-validation_2.11-1.0.jar
```
# Data Analysis
The document available at [Data Analysis](docs/how_to_analyze.asciidoc) contains code example and AWS Athena queries to analyze data using the error information derived by the validation framework.
# Security
See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
# License
This library is licensed under the MIT-0 License. See the LICENSE file.