{
"cells": [
{
"cell_type": "markdown",
"id": "b0ee9a28",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"[](https://github.com/aws/aws-sdk-pandas)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "3a2a7b51",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# 37 - Glue Data Quality"
]
},
{
"cell_type": "markdown",
"id": "42724a76",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
" AWS Glue Data Quality helps you evaluate and monitor the quality of your data."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Create test data\n",
"\n",
"First, let's start by creating test data, writing it to S3, and registering it in the Glue Data Catalog."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5237eed8",
"metadata": {
"collapsed": false,
"pycharm": {
"is_executing": true,
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import awswrangler as wr\n",
"import pandas as pd\n",
"\n",
"glue_database = \"aws_sdk_pandas\"\n",
"glue_table = \"my_glue_table\"\n",
"path = \"s3://BUCKET_NAME/my_glue_table/\"\n",
"\n",
"df = pd.DataFrame({\"c0\": [0, 1, 2], \"c1\": [0, 1, 2], \"c2\": [0, 0, 0]})\n",
"wr.s3.to_parquet(df, path, dataset=True, database=glue_database, table=glue_table, partition_cols=[\"c2\"])"
]
},
{
"cell_type": "markdown",
"id": "13ba6d9d",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Start with recommended data quality rules"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "81fa434e",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"AWS Glue Data Quality can recommend a set of data quality rules so you can get started quickly.\n",
"\n",
"Note: Running Glue Data Quality recommendation and evaluation tasks requires an IAM role. This role must trust the Glue principal and allow permissions to various resources including the Glue table and the S3 bucket where your data is stored. Moreover, data quality IAM actions must be granted. To find out more, check [Authorization](https://docs.aws.amazon.com/glue/latest/dg/data-quality-authorization.html)."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "a1983e63",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rule_type | \n",
" parameter | \n",
" expression | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" RowCount | \n",
" None | \n",
" between 1 and 6 | \n",
"
\n",
" \n",
" 1 | \n",
" IsComplete | \n",
" \"c0\" | \n",
" None | \n",
"
\n",
" \n",
" 2 | \n",
" Uniqueness | \n",
" \"c0\" | \n",
" > 0.95 | \n",
"
\n",
" \n",
" 3 | \n",
" ColumnValues | \n",
" \"c0\" | \n",
" <= 2 | \n",
"
\n",
" \n",
" 4 | \n",
" IsComplete | \n",
" \"c1\" | \n",
" None | \n",
"
\n",
" \n",
" 5 | \n",
" Uniqueness | \n",
" \"c1\" | \n",
" > 0.95 | \n",
"
\n",
" \n",
" 6 | \n",
" ColumnValues | \n",
" \"c1\" | \n",
" <= 2 | \n",
"
\n",
" \n",
" 7 | \n",
" IsComplete | \n",
" \"c2\" | \n",
" None | \n",
"
\n",
" \n",
" 8 | \n",
" ColumnValues | \n",
" \"c2\" | \n",
" in [\"0\"] | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rule_type parameter expression\n",
"0 RowCount None between 1 and 6\n",
"1 IsComplete \"c0\" None\n",
"2 Uniqueness \"c0\" > 0.95\n",
"3 ColumnValues \"c0\" <= 2\n",
"4 IsComplete \"c1\" None\n",
"5 Uniqueness \"c1\" > 0.95\n",
"6 ColumnValues \"c1\" <= 2\n",
"7 IsComplete \"c2\" None\n",
"8 ColumnValues \"c2\" in [\"0\"]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"first_ruleset = \"ruleset_1\"\n",
"iam_role_arn = \"arn:aws:iam::...\" # IAM role assumed by the Glue Data Quality job to access resources\n",
"\n",
"df_recommended_ruleset = wr.data_quality.create_recommendation_ruleset( # Creates a recommended ruleset\n",
" name=first_ruleset,\n",
" database=glue_database,\n",
" table=glue_table,\n",
" iam_role_arn=iam_role_arn,\n",
" number_of_workers=2,\n",
")\n",
"\n",
"df_recommended_ruleset"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a34270e9",
"metadata": {},
"source": [
"## Update the recommended rules\n",
"\n",
"Recommended rulesets are not perfect and you are likely to modify them or create your own."
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "78f01dba",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rule_type | \n",
" parameter | \n",
" expression | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" RowCount | \n",
" None | \n",
" between 1 and 6 | \n",
"
\n",
" \n",
" 1 | \n",
" IsComplete | \n",
" \"c0\" | \n",
" None | \n",
"
\n",
" \n",
" 2 | \n",
" Uniqueness | \n",
" \"c0\" | \n",
" > 0.95 | \n",
"
\n",
" \n",
" 3 | \n",
" ColumnValues | \n",
" \"c0\" | \n",
" <= 2 | \n",
"
\n",
" \n",
" 4 | \n",
" IsComplete | \n",
" \"c1\" | \n",
" None | \n",
"
\n",
" \n",
" 5 | \n",
" Uniqueness | \n",
" \"c1\" | \n",
" > 0.95 | \n",
"
\n",
" \n",
" 6 | \n",
" ColumnValues | \n",
" \"c1\" | \n",
" <= 2 | \n",
"
\n",
" \n",
" 7 | \n",
" IsComplete | \n",
" \"c2\" | \n",
" None | \n",
"
\n",
" \n",
" 8 | \n",
" ColumnValues | \n",
" \"c2\" | \n",
" in [0, 1, 2] | \n",
"
\n",
" \n",
" 9 | \n",
" Uniqueness | \n",
" \"c2\" | \n",
" > 0.95 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rule_type parameter expression\n",
"0 RowCount None between 1 and 6\n",
"1 IsComplete \"c0\" None\n",
"2 Uniqueness \"c0\" > 0.95\n",
"3 ColumnValues \"c0\" <= 2\n",
"4 IsComplete \"c1\" None\n",
"5 Uniqueness \"c1\" > 0.95\n",
"6 ColumnValues \"c1\" <= 2\n",
"7 IsComplete \"c2\" None\n",
"8 ColumnValues \"c2\" in [0, 1, 2]\n",
"9 Uniqueness \"c2\" > 0.95"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Append and update rules\n",
"df_updated_ruleset = df_recommended_ruleset.append(\n",
" {\"rule_type\": \"Uniqueness\", \"parameter\": '\"c2\"', \"expression\": \"> 0.95\"}, ignore_index=True\n",
")\n",
"\n",
"df_updated_ruleset.at[8, \"expression\"] = \"in [0, 1, 2]\"\n",
"\n",
"# Update the existing ruleset (upsert)\n",
"wr.data_quality.update_ruleset(\n",
" name=first_ruleset,\n",
" df_rules=df_updated_ruleset,\n",
" mode=\"upsert\", # update existing or insert new rules to the ruleset\n",
")\n",
"\n",
"wr.data_quality.get_ruleset(name=first_ruleset)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "52ee5851",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Run a data quality task\n",
"\n",
"The ruleset can now be evaluated against the data. A cluster with 2 workers is used for the run. It returns a report with `PASS`/`FAIL` results for each rule."
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "a3aeec39",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Description | \n",
" Result | \n",
" ResultId | \n",
" EvaluationMessage | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Rule_1 | \n",
" RowCount between 1 and 6 | \n",
" PASS | \n",
" dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" Rule_2 | \n",
" IsComplete \"c0\" | \n",
" PASS | \n",
" dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 | \n",
" NaN | \n",
"
\n",
" \n",
" 2 | \n",
" Rule_3 | \n",
" Uniqueness \"c0\" > 0.95 | \n",
" PASS | \n",
" dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" Rule_4 | \n",
" ColumnValues \"c0\" <= 2 | \n",
" PASS | \n",
" dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 | \n",
" NaN | \n",
"
\n",
" \n",
" 4 | \n",
" Rule_5 | \n",
" IsComplete \"c1\" | \n",
" PASS | \n",
" dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 | \n",
" NaN | \n",
"
\n",
" \n",
" 5 | \n",
" Rule_6 | \n",
" Uniqueness \"c1\" > 0.95 | \n",
" PASS | \n",
" dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 | \n",
" NaN | \n",
"
\n",
" \n",
" 6 | \n",
" Rule_7 | \n",
" ColumnValues \"c1\" <= 2 | \n",
" PASS | \n",
" dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 | \n",
" NaN | \n",
"
\n",
" \n",
" 7 | \n",
" Rule_8 | \n",
" IsComplete \"c2\" | \n",
" PASS | \n",
" dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 | \n",
" NaN | \n",
"
\n",
" \n",
" 8 | \n",
" Rule_9 | \n",
" ColumnValues \"c2\" in [0,1,2] | \n",
" PASS | \n",
" dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 | \n",
" NaN | \n",
"
\n",
" \n",
" 9 | \n",
" Rule_10 | \n",
" Uniqueness \"c2\" > 0.95 | \n",
" FAIL | \n",
" dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 | \n",
" Value: 0.0 does not meet the constraint requir... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Description Result \\\n",
"0 Rule_1 RowCount between 1 and 6 PASS \n",
"1 Rule_2 IsComplete \"c0\" PASS \n",
"2 Rule_3 Uniqueness \"c0\" > 0.95 PASS \n",
"3 Rule_4 ColumnValues \"c0\" <= 2 PASS \n",
"4 Rule_5 IsComplete \"c1\" PASS \n",
"5 Rule_6 Uniqueness \"c1\" > 0.95 PASS \n",
"6 Rule_7 ColumnValues \"c1\" <= 2 PASS \n",
"7 Rule_8 IsComplete \"c2\" PASS \n",
"8 Rule_9 ColumnValues \"c2\" in [0,1,2] PASS \n",
"9 Rule_10 Uniqueness \"c2\" > 0.95 FAIL \n",
"\n",
" ResultId \\\n",
"0 dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 \n",
"1 dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 \n",
"2 dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 \n",
"3 dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 \n",
"4 dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 \n",
"5 dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 \n",
"6 dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 \n",
"7 dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 \n",
"8 dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 \n",
"9 dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5 \n",
"\n",
" EvaluationMessage \n",
"0 NaN \n",
"1 NaN \n",
"2 NaN \n",
"3 NaN \n",
"4 NaN \n",
"5 NaN \n",
"6 NaN \n",
"7 NaN \n",
"8 NaN \n",
"9 Value: 0.0 does not meet the constraint requir... "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.data_quality.evaluate_ruleset(\n",
" name=first_ruleset,\n",
" iam_role_arn=iam_role_arn,\n",
" number_of_workers=2,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "170cd16d",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Create ruleset from Data Quality Definition Language definition"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e973a4fd",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"The Data Quality Definition Language (DQDL) is a domain specific language that you can use to define Data Quality rules. For the full syntax reference, see [DQDL](https://docs.aws.amazon.com/glue/latest/dg/dqdl.html)."
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "f5209459",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"second_ruleset = \"ruleset_2\"\n",
"\n",
"dqdl_rules = (\n",
" \"Rules = [\"\n",
" \"RowCount between 1 and 6,\"\n",
" 'IsComplete \"c0\",'\n",
" 'Uniqueness \"c0\" > 0.95,'\n",
" 'ColumnValues \"c0\" <= 2,'\n",
" 'IsComplete \"c1\",'\n",
" 'Uniqueness \"c1\" > 0.95,'\n",
" 'ColumnValues \"c1\" <= 2,'\n",
" 'IsComplete \"c2\",'\n",
" 'ColumnValues \"c2\" <= 1'\n",
" \"]\"\n",
")\n",
"\n",
"wr.data_quality.create_ruleset(\n",
" name=second_ruleset,\n",
" database=glue_database,\n",
" table=glue_table,\n",
" dqdl_rules=dqdl_rules,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9b791ace",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Create or update a ruleset from a data frame"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "74fddded",
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"AWS SDK for pandas also enables you to create or update a ruleset from a pandas data frame."
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "f1d1905c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rule_type | \n",
" parameter | \n",
" expression | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" RowCount | \n",
" None | \n",
" between 2 and 8 | \n",
"
\n",
" \n",
" 1 | \n",
" ColumnCorrelation | \n",
" \"c0\" \"c1\" | \n",
" > 0.8 | \n",
"
\n",
" \n",
" 2 | \n",
" Uniqueness | \n",
" \"c0\" | \n",
" > 0.95 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rule_type parameter expression\n",
"0 RowCount None between 2 and 8\n",
"1 ColumnCorrelation \"c0\" \"c1\" > 0.8\n",
"2 Uniqueness \"c0\" > 0.95"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"third_ruleset = \"ruleset_3\"\n",
"\n",
"df_rules = pd.DataFrame({\n",
" \"rule_type\": [\"RowCount\", \"ColumnCorrelation\", \"Uniqueness\"],\n",
" \"parameter\": [None, '\"c0\" \"c1\"', '\"c0\"'],\n",
" \"expression\": [\"between 2 and 8\", \"> 0.8\", \"> 0.95\"],\n",
"})\n",
"\n",
"wr.data_quality.create_ruleset(\n",
" name=third_ruleset,\n",
" df_rules=df_rules,\n",
" database=glue_database,\n",
" table=glue_table,\n",
")\n",
"\n",
"wr.data_quality.get_ruleset(name=third_ruleset)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c846421e",
"metadata": {},
"source": [
"## Get multiple rulesets"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "b8b1d0b0",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rule_type | \n",
" parameter | \n",
" expression | \n",
" ruleset | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" RowCount | \n",
" None | \n",
" between 1 and 6 | \n",
" ruleset_1 | \n",
"
\n",
" \n",
" 1 | \n",
" IsComplete | \n",
" \"c0\" | \n",
" None | \n",
" ruleset_1 | \n",
"
\n",
" \n",
" 2 | \n",
" Uniqueness | \n",
" \"c0\" | \n",
" > 0.95 | \n",
" ruleset_1 | \n",
"
\n",
" \n",
" 3 | \n",
" ColumnValues | \n",
" \"c0\" | \n",
" <= 2 | \n",
" ruleset_1 | \n",
"
\n",
" \n",
" 4 | \n",
" IsComplete | \n",
" \"c1\" | \n",
" None | \n",
" ruleset_1 | \n",
"
\n",
" \n",
" 5 | \n",
" Uniqueness | \n",
" \"c1\" | \n",
" > 0.95 | \n",
" ruleset_1 | \n",
"
\n",
" \n",
" 6 | \n",
" ColumnValues | \n",
" \"c1\" | \n",
" <= 2 | \n",
" ruleset_1 | \n",
"
\n",
" \n",
" 7 | \n",
" IsComplete | \n",
" \"c2\" | \n",
" None | \n",
" ruleset_1 | \n",
"
\n",
" \n",
" 8 | \n",
" ColumnValues | \n",
" \"c2\" | \n",
" in [0, 1, 2] | \n",
" ruleset_1 | \n",
"
\n",
" \n",
" 9 | \n",
" Uniqueness | \n",
" \"c2\" | \n",
" > 0.95 | \n",
" ruleset_1 | \n",
"
\n",
" \n",
" 0 | \n",
" RowCount | \n",
" None | \n",
" between 1 and 6 | \n",
" ruleset_2 | \n",
"
\n",
" \n",
" 1 | \n",
" IsComplete | \n",
" \"c0\" | \n",
" None | \n",
" ruleset_2 | \n",
"
\n",
" \n",
" 2 | \n",
" Uniqueness | \n",
" \"c0\" | \n",
" > 0.95 | \n",
" ruleset_2 | \n",
"
\n",
" \n",
" 3 | \n",
" ColumnValues | \n",
" \"c0\" | \n",
" <= 2 | \n",
" ruleset_2 | \n",
"
\n",
" \n",
" 4 | \n",
" IsComplete | \n",
" \"c1\" | \n",
" None | \n",
" ruleset_2 | \n",
"
\n",
" \n",
" 5 | \n",
" Uniqueness | \n",
" \"c1\" | \n",
" > 0.95 | \n",
" ruleset_2 | \n",
"
\n",
" \n",
" 6 | \n",
" ColumnValues | \n",
" \"c1\" | \n",
" <= 2 | \n",
" ruleset_2 | \n",
"
\n",
" \n",
" 7 | \n",
" IsComplete | \n",
" \"c2\" | \n",
" None | \n",
" ruleset_2 | \n",
"
\n",
" \n",
" 8 | \n",
" ColumnValues | \n",
" \"c2\" | \n",
" <= 1 | \n",
" ruleset_2 | \n",
"
\n",
" \n",
" 0 | \n",
" RowCount | \n",
" None | \n",
" between 2 and 8 | \n",
" ruleset_3 | \n",
"
\n",
" \n",
" 1 | \n",
" ColumnCorrelation | \n",
" \"c0\" \"c1\" | \n",
" > 0.8 | \n",
" ruleset_3 | \n",
"
\n",
" \n",
" 2 | \n",
" Uniqueness | \n",
" \"c0\" | \n",
" > 0.95 | \n",
" ruleset_3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rule_type parameter expression ruleset\n",
"0 RowCount None between 1 and 6 ruleset_1\n",
"1 IsComplete \"c0\" None ruleset_1\n",
"2 Uniqueness \"c0\" > 0.95 ruleset_1\n",
"3 ColumnValues \"c0\" <= 2 ruleset_1\n",
"4 IsComplete \"c1\" None ruleset_1\n",
"5 Uniqueness \"c1\" > 0.95 ruleset_1\n",
"6 ColumnValues \"c1\" <= 2 ruleset_1\n",
"7 IsComplete \"c2\" None ruleset_1\n",
"8 ColumnValues \"c2\" in [0, 1, 2] ruleset_1\n",
"9 Uniqueness \"c2\" > 0.95 ruleset_1\n",
"0 RowCount None between 1 and 6 ruleset_2\n",
"1 IsComplete \"c0\" None ruleset_2\n",
"2 Uniqueness \"c0\" > 0.95 ruleset_2\n",
"3 ColumnValues \"c0\" <= 2 ruleset_2\n",
"4 IsComplete \"c1\" None ruleset_2\n",
"5 Uniqueness \"c1\" > 0.95 ruleset_2\n",
"6 ColumnValues \"c1\" <= 2 ruleset_2\n",
"7 IsComplete \"c2\" None ruleset_2\n",
"8 ColumnValues \"c2\" <= 1 ruleset_2\n",
"0 RowCount None between 2 and 8 ruleset_3\n",
"1 ColumnCorrelation \"c0\" \"c1\" > 0.8 ruleset_3\n",
"2 Uniqueness \"c0\" > 0.95 ruleset_3"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wr.data_quality.get_ruleset(name=[first_ruleset, second_ruleset, third_ruleset])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e8024a77",
"metadata": {},
"source": [
"## Evaluate Data Quality for a given partition\n",
"\n",
"A data quality evaluation run can be limited to specific partition(s) by leveraging the `pushDownPredicate` expression in the `additional_options` argument "
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "332d2e95",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Description | \n",
" Result | \n",
" ResultId | \n",
" EvaluationMessage | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" Rule_1 | \n",
" RowCount between 2 and 8 | \n",
" PASS | \n",
" dqresult-f676cfe0345aa93f492e3e3c3d6cf1ad99b84dc6 | \n",
" NaN | \n",
"
\n",
" \n",
" 1 | \n",
" Rule_2 | \n",
" ColumnCorrelation \"c0\" \"c1\" > 0.8 | \n",
" FAIL | \n",
" dqresult-f676cfe0345aa93f492e3e3c3d6cf1ad99b84dc6 | \n",
" Value: 0.5 does not meet the constraint requir... | \n",
"
\n",
" \n",
" 2 | \n",
" Rule_3 | \n",
" Uniqueness \"c0\" > 0.95 | \n",
" PASS | \n",
" dqresult-f676cfe0345aa93f492e3e3c3d6cf1ad99b84dc6 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Description Result \\\n",
"0 Rule_1 RowCount between 2 and 8 PASS \n",
"1 Rule_2 ColumnCorrelation \"c0\" \"c1\" > 0.8 FAIL \n",
"2 Rule_3 Uniqueness \"c0\" > 0.95 PASS \n",
"\n",
" ResultId \\\n",
"0 dqresult-f676cfe0345aa93f492e3e3c3d6cf1ad99b84dc6 \n",
"1 dqresult-f676cfe0345aa93f492e3e3c3d6cf1ad99b84dc6 \n",
"2 dqresult-f676cfe0345aa93f492e3e3c3d6cf1ad99b84dc6 \n",
"\n",
" EvaluationMessage \n",
"0 NaN \n",
"1 Value: 0.5 does not meet the constraint requir... \n",
"2 NaN "
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({\"c0\": [2, 0, 1], \"c1\": [1, 0, 2], \"c2\": [1, 1, 1]})\n",
"wr.s3.to_parquet(df, path, dataset=True, database=glue_database, table=glue_table, partition_cols=[\"c2\"])\n",
"\n",
"wr.data_quality.evaluate_ruleset(\n",
" name=third_ruleset,\n",
" iam_role_arn=iam_role_arn,\n",
" number_of_workers=2,\n",
" additional_options={\n",
" \"pushDownPredicate\": \"(c2 == '1')\",\n",
" },\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "awswrangler-v9JnknIF-py3.8",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
},
"vscode": {
"interpreter": {
"hash": "83297b058d59ee0acd247586c837429190a8258f15c0eea6234359f5557dde51"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}