{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b0ee9a28",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "[![AWS SDK for pandas](_static/logo.png \"AWS SDK for pandas\")](https://github.com/aws/aws-sdk-pandas)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3a2a7b51",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# 37 - Glue Data Quality"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42724a76",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    " AWS Glue Data Quality helps you evaluate and monitor the quality of your data."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Create test data\n",
    "\n",
    "First, let's start by creating test data, writing it to S3, and registering it in the Glue Data Catalog."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5237eed8",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "is_executing": true,
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import awswrangler as wr\n",
    "import pandas as pd\n",
    "\n",
    "glue_database = \"aws_sdk_pandas\"\n",
    "glue_table = \"my_glue_table\"\n",
    "path = \"s3://BUCKET_NAME/my_glue_table/\"\n",
    "\n",
    "df = pd.DataFrame({\"c0\": [0, 1, 2], \"c1\": [0, 1, 2], \"c2\": [0, 0, 0]})\n",
    "wr.s3.to_parquet(df, path, dataset=True, database=glue_database, table=glue_table, partition_cols=[\"c2\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13ba6d9d",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Start with recommended data quality rules"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "81fa434e",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "AWS Glue Data Quality can recommend a set of data quality rules so you can get started quickly.\n",
    "\n",
    "Note: Running Glue Data Quality recommendation and evaluation tasks requires an IAM role. This role must trust the Glue principal and allow permissions to various resources including the Glue table and the S3 bucket where your data is stored. Moreover, data quality IAM actions must be granted. To find out more, check [Authorization](https://docs.aws.amazon.com/glue/latest/dg/data-quality-authorization.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "a1983e63",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rule_type</th>\n",
       "      <th>parameter</th>\n",
       "      <th>expression</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>RowCount</td>\n",
       "      <td>None</td>\n",
       "      <td>between 1 and 6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>IsComplete</td>\n",
       "      <td>\"c0\"</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Uniqueness</td>\n",
       "      <td>\"c0\"</td>\n",
       "      <td>&gt; 0.95</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ColumnValues</td>\n",
       "      <td>\"c0\"</td>\n",
       "      <td>&lt;= 2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>IsComplete</td>\n",
       "      <td>\"c1\"</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Uniqueness</td>\n",
       "      <td>\"c1\"</td>\n",
       "      <td>&gt; 0.95</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>ColumnValues</td>\n",
       "      <td>\"c1\"</td>\n",
       "      <td>&lt;= 2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>IsComplete</td>\n",
       "      <td>\"c2\"</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>ColumnValues</td>\n",
       "      <td>\"c2\"</td>\n",
       "      <td>in [\"0\"]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      rule_type parameter       expression\n",
       "0      RowCount      None  between 1 and 6\n",
       "1    IsComplete      \"c0\"             None\n",
       "2    Uniqueness      \"c0\"           > 0.95\n",
       "3  ColumnValues      \"c0\"             <= 2\n",
       "4    IsComplete      \"c1\"             None\n",
       "5    Uniqueness      \"c1\"           > 0.95\n",
       "6  ColumnValues      \"c1\"             <= 2\n",
       "7    IsComplete      \"c2\"             None\n",
       "8  ColumnValues      \"c2\"         in [\"0\"]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_ruleset = \"ruleset_1\"\n",
    "iam_role_arn = \"arn:aws:iam::...\" # IAM role assumed by the Glue Data Quality job to access resources\n",
    "\n",
    "df_recommended_ruleset = wr.data_quality.create_recommendation_ruleset(  # Creates a recommended ruleset\n",
    "    name=first_ruleset,\n",
    "    database=glue_database,\n",
    "    table=glue_table,\n",
    "    iam_role_arn=iam_role_arn,\n",
    "    number_of_workers=2,\n",
    ")\n",
    "\n",
    "df_recommended_ruleset"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a34270e9",
   "metadata": {},
   "source": [
    "## Update the recommended rules\n",
    "\n",
    "Recommended rulesets are not perfect and you are likely to modify them or create your own."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "78f01dba",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rule_type</th>\n",
       "      <th>parameter</th>\n",
       "      <th>expression</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>RowCount</td>\n",
       "      <td>None</td>\n",
       "      <td>between 1 and 6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>IsComplete</td>\n",
       "      <td>\"c0\"</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Uniqueness</td>\n",
       "      <td>\"c0\"</td>\n",
       "      <td>&gt; 0.95</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ColumnValues</td>\n",
       "      <td>\"c0\"</td>\n",
       "      <td>&lt;= 2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>IsComplete</td>\n",
       "      <td>\"c1\"</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Uniqueness</td>\n",
       "      <td>\"c1\"</td>\n",
       "      <td>&gt; 0.95</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>ColumnValues</td>\n",
       "      <td>\"c1\"</td>\n",
       "      <td>&lt;= 2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>IsComplete</td>\n",
       "      <td>\"c2\"</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>ColumnValues</td>\n",
       "      <td>\"c2\"</td>\n",
       "      <td>in [0, 1, 2]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Uniqueness</td>\n",
       "      <td>\"c2\"</td>\n",
       "      <td>&gt; 0.95</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      rule_type parameter       expression\n",
       "0      RowCount      None  between 1 and 6\n",
       "1    IsComplete      \"c0\"             None\n",
       "2    Uniqueness      \"c0\"           > 0.95\n",
       "3  ColumnValues      \"c0\"             <= 2\n",
       "4    IsComplete      \"c1\"             None\n",
       "5    Uniqueness      \"c1\"           > 0.95\n",
       "6  ColumnValues      \"c1\"             <= 2\n",
       "7    IsComplete      \"c2\"             None\n",
       "8  ColumnValues      \"c2\"     in [0, 1, 2]\n",
       "9    Uniqueness      \"c2\"           > 0.95"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Append and update rules\n",
    "df_updated_ruleset = df_recommended_ruleset.append(\n",
    "    {\"rule_type\": \"Uniqueness\", \"parameter\": '\"c2\"', \"expression\": \"> 0.95\"}, ignore_index=True\n",
    ")\n",
    "\n",
    "df_updated_ruleset.at[8, \"expression\"] = \"in [0, 1, 2]\"\n",
    "\n",
    "# Update the existing ruleset (upsert)\n",
    "wr.data_quality.update_ruleset(\n",
    "    name=first_ruleset,\n",
    "    df_rules=df_updated_ruleset,\n",
    "    mode=\"upsert\",  # update existing or insert new rules to the ruleset\n",
    ")\n",
    "\n",
    "wr.data_quality.get_ruleset(name=first_ruleset)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "52ee5851",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Run a data quality task\n",
    "\n",
    "The ruleset can now be evaluated against the data. A cluster with 2 workers is used for the run. It returns a report with `PASS`/`FAIL` results for each rule."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "a3aeec39",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Description</th>\n",
       "      <th>Result</th>\n",
       "      <th>ResultId</th>\n",
       "      <th>EvaluationMessage</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Rule_1</td>\n",
       "      <td>RowCount between 1 and 6</td>\n",
       "      <td>PASS</td>\n",
       "      <td>dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Rule_2</td>\n",
       "      <td>IsComplete \"c0\"</td>\n",
       "      <td>PASS</td>\n",
       "      <td>dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Rule_3</td>\n",
       "      <td>Uniqueness \"c0\" &gt; 0.95</td>\n",
       "      <td>PASS</td>\n",
       "      <td>dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Rule_4</td>\n",
       "      <td>ColumnValues \"c0\" &lt;= 2</td>\n",
       "      <td>PASS</td>\n",
       "      <td>dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Rule_5</td>\n",
       "      <td>IsComplete \"c1\"</td>\n",
       "      <td>PASS</td>\n",
       "      <td>dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Rule_6</td>\n",
       "      <td>Uniqueness \"c1\" &gt; 0.95</td>\n",
       "      <td>PASS</td>\n",
       "      <td>dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Rule_7</td>\n",
       "      <td>ColumnValues \"c1\" &lt;= 2</td>\n",
       "      <td>PASS</td>\n",
       "      <td>dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Rule_8</td>\n",
       "      <td>IsComplete \"c2\"</td>\n",
       "      <td>PASS</td>\n",
       "      <td>dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Rule_9</td>\n",
       "      <td>ColumnValues \"c2\" in [0,1,2]</td>\n",
       "      <td>PASS</td>\n",
       "      <td>dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Rule_10</td>\n",
       "      <td>Uniqueness \"c2\" &gt; 0.95</td>\n",
       "      <td>FAIL</td>\n",
       "      <td>dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5</td>\n",
       "      <td>Value: 0.0 does not meet the constraint requir...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Name                   Description Result  \\\n",
       "0   Rule_1      RowCount between 1 and 6   PASS   \n",
       "1   Rule_2               IsComplete \"c0\"   PASS   \n",
       "2   Rule_3        Uniqueness \"c0\" > 0.95   PASS   \n",
       "3   Rule_4        ColumnValues \"c0\" <= 2   PASS   \n",
       "4   Rule_5               IsComplete \"c1\"   PASS   \n",
       "5   Rule_6        Uniqueness \"c1\" > 0.95   PASS   \n",
       "6   Rule_7        ColumnValues \"c1\" <= 2   PASS   \n",
       "7   Rule_8               IsComplete \"c2\"   PASS   \n",
       "8   Rule_9  ColumnValues \"c2\" in [0,1,2]   PASS   \n",
       "9  Rule_10        Uniqueness \"c2\" > 0.95   FAIL   \n",
       "\n",
       "                                            ResultId  \\\n",
       "0  dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5   \n",
       "1  dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5   \n",
       "2  dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5   \n",
       "3  dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5   \n",
       "4  dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5   \n",
       "5  dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5   \n",
       "6  dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5   \n",
       "7  dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5   \n",
       "8  dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5   \n",
       "9  dqresult-be413b527c0e5520ad843323fecd9cf2e2edbdd5   \n",
       "\n",
       "                                   EvaluationMessage  \n",
       "0                                                NaN  \n",
       "1                                                NaN  \n",
       "2                                                NaN  \n",
       "3                                                NaN  \n",
       "4                                                NaN  \n",
       "5                                                NaN  \n",
       "6                                                NaN  \n",
       "7                                                NaN  \n",
       "8                                                NaN  \n",
       "9  Value: 0.0 does not meet the constraint requir...  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wr.data_quality.evaluate_ruleset(\n",
    "    name=first_ruleset,\n",
    "    iam_role_arn=iam_role_arn,\n",
    "    number_of_workers=2,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "170cd16d",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Create ruleset from Data Quality Definition Language definition"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e973a4fd",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "The Data Quality Definition Language (DQDL) is a domain specific language that you can use to define Data Quality rules. For the full syntax reference, see [DQDL](https://docs.aws.amazon.com/glue/latest/dg/dqdl.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "f5209459",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "second_ruleset = \"ruleset_2\"\n",
    "\n",
    "dqdl_rules = (\n",
    "    \"Rules = [\"\n",
    "    \"RowCount between 1 and 6,\"\n",
    "    'IsComplete \"c0\",'\n",
    "    'Uniqueness \"c0\" > 0.95,'\n",
    "    'ColumnValues \"c0\" <= 2,'\n",
    "    'IsComplete \"c1\",'\n",
    "    'Uniqueness \"c1\" > 0.95,'\n",
    "    'ColumnValues \"c1\" <= 2,'\n",
    "    'IsComplete \"c2\",'\n",
    "    'ColumnValues \"c2\" <= 1'\n",
    "    \"]\"\n",
    ")\n",
    "\n",
    "wr.data_quality.create_ruleset(\n",
    "    name=second_ruleset,\n",
    "    database=glue_database,\n",
    "    table=glue_table,\n",
    "    dqdl_rules=dqdl_rules,\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9b791ace",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Create or update a ruleset from a data frame"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "74fddded",
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "AWS SDK for pandas also enables you to create or update a ruleset from a pandas data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "f1d1905c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rule_type</th>\n",
       "      <th>parameter</th>\n",
       "      <th>expression</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>RowCount</td>\n",
       "      <td>None</td>\n",
       "      <td>between 2 and 8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ColumnCorrelation</td>\n",
       "      <td>\"c0\" \"c1\"</td>\n",
       "      <td>&gt; 0.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Uniqueness</td>\n",
       "      <td>\"c0\"</td>\n",
       "      <td>&gt; 0.95</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           rule_type  parameter       expression\n",
       "0           RowCount       None  between 2 and 8\n",
       "1  ColumnCorrelation  \"c0\" \"c1\"            > 0.8\n",
       "2         Uniqueness       \"c0\"           > 0.95"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "third_ruleset = \"ruleset_3\"\n",
    "\n",
    "df_rules = pd.DataFrame({\n",
    "    \"rule_type\": [\"RowCount\", \"ColumnCorrelation\", \"Uniqueness\"],\n",
    "    \"parameter\": [None, '\"c0\" \"c1\"', '\"c0\"'],\n",
    "    \"expression\": [\"between 2 and 8\", \"> 0.8\", \"> 0.95\"],\n",
    "})\n",
    "\n",
    "wr.data_quality.create_ruleset(\n",
    "    name=third_ruleset,\n",
    "    df_rules=df_rules,\n",
    "    database=glue_database,\n",
    "    table=glue_table,\n",
    ")\n",
    "\n",
    "wr.data_quality.get_ruleset(name=third_ruleset)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c846421e",
   "metadata": {},
   "source": [
    "## Get multiple rulesets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "b8b1d0b0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rule_type</th>\n",
       "      <th>parameter</th>\n",
       "      <th>expression</th>\n",
       "      <th>ruleset</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>RowCount</td>\n",
       "      <td>None</td>\n",
       "      <td>between 1 and 6</td>\n",
       "      <td>ruleset_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>IsComplete</td>\n",
       "      <td>\"c0\"</td>\n",
       "      <td>None</td>\n",
       "      <td>ruleset_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Uniqueness</td>\n",
       "      <td>\"c0\"</td>\n",
       "      <td>&gt; 0.95</td>\n",
       "      <td>ruleset_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ColumnValues</td>\n",
       "      <td>\"c0\"</td>\n",
       "      <td>&lt;= 2</td>\n",
       "      <td>ruleset_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>IsComplete</td>\n",
       "      <td>\"c1\"</td>\n",
       "      <td>None</td>\n",
       "      <td>ruleset_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Uniqueness</td>\n",
       "      <td>\"c1\"</td>\n",
       "      <td>&gt; 0.95</td>\n",
       "      <td>ruleset_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>ColumnValues</td>\n",
       "      <td>\"c1\"</td>\n",
       "      <td>&lt;= 2</td>\n",
       "      <td>ruleset_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>IsComplete</td>\n",
       "      <td>\"c2\"</td>\n",
       "      <td>None</td>\n",
       "      <td>ruleset_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>ColumnValues</td>\n",
       "      <td>\"c2\"</td>\n",
       "      <td>in [0, 1, 2]</td>\n",
       "      <td>ruleset_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Uniqueness</td>\n",
       "      <td>\"c2\"</td>\n",
       "      <td>&gt; 0.95</td>\n",
       "      <td>ruleset_1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>RowCount</td>\n",
       "      <td>None</td>\n",
       "      <td>between 1 and 6</td>\n",
       "      <td>ruleset_2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>IsComplete</td>\n",
       "      <td>\"c0\"</td>\n",
       "      <td>None</td>\n",
       "      <td>ruleset_2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Uniqueness</td>\n",
       "      <td>\"c0\"</td>\n",
       "      <td>&gt; 0.95</td>\n",
       "      <td>ruleset_2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ColumnValues</td>\n",
       "      <td>\"c0\"</td>\n",
       "      <td>&lt;= 2</td>\n",
       "      <td>ruleset_2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>IsComplete</td>\n",
       "      <td>\"c1\"</td>\n",
       "      <td>None</td>\n",
       "      <td>ruleset_2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Uniqueness</td>\n",
       "      <td>\"c1\"</td>\n",
       "      <td>&gt; 0.95</td>\n",
       "      <td>ruleset_2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>ColumnValues</td>\n",
       "      <td>\"c1\"</td>\n",
       "      <td>&lt;= 2</td>\n",
       "      <td>ruleset_2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>IsComplete</td>\n",
       "      <td>\"c2\"</td>\n",
       "      <td>None</td>\n",
       "      <td>ruleset_2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>ColumnValues</td>\n",
       "      <td>\"c2\"</td>\n",
       "      <td>&lt;= 1</td>\n",
       "      <td>ruleset_2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>RowCount</td>\n",
       "      <td>None</td>\n",
       "      <td>between 2 and 8</td>\n",
       "      <td>ruleset_3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ColumnCorrelation</td>\n",
       "      <td>\"c0\" \"c1\"</td>\n",
       "      <td>&gt; 0.8</td>\n",
       "      <td>ruleset_3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Uniqueness</td>\n",
       "      <td>\"c0\"</td>\n",
       "      <td>&gt; 0.95</td>\n",
       "      <td>ruleset_3</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           rule_type  parameter       expression    ruleset\n",
       "0           RowCount       None  between 1 and 6  ruleset_1\n",
       "1         IsComplete       \"c0\"             None  ruleset_1\n",
       "2         Uniqueness       \"c0\"           > 0.95  ruleset_1\n",
       "3       ColumnValues       \"c0\"             <= 2  ruleset_1\n",
       "4         IsComplete       \"c1\"             None  ruleset_1\n",
       "5         Uniqueness       \"c1\"           > 0.95  ruleset_1\n",
       "6       ColumnValues       \"c1\"             <= 2  ruleset_1\n",
       "7         IsComplete       \"c2\"             None  ruleset_1\n",
       "8       ColumnValues       \"c2\"     in [0, 1, 2]  ruleset_1\n",
       "9         Uniqueness       \"c2\"           > 0.95  ruleset_1\n",
       "0           RowCount       None  between 1 and 6  ruleset_2\n",
       "1         IsComplete       \"c0\"             None  ruleset_2\n",
       "2         Uniqueness       \"c0\"           > 0.95  ruleset_2\n",
       "3       ColumnValues       \"c0\"             <= 2  ruleset_2\n",
       "4         IsComplete       \"c1\"             None  ruleset_2\n",
       "5         Uniqueness       \"c1\"           > 0.95  ruleset_2\n",
       "6       ColumnValues       \"c1\"             <= 2  ruleset_2\n",
       "7         IsComplete       \"c2\"             None  ruleset_2\n",
       "8       ColumnValues       \"c2\"             <= 1  ruleset_2\n",
       "0           RowCount       None  between 2 and 8  ruleset_3\n",
       "1  ColumnCorrelation  \"c0\" \"c1\"            > 0.8  ruleset_3\n",
       "2         Uniqueness       \"c0\"           > 0.95  ruleset_3"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wr.data_quality.get_ruleset(name=[first_ruleset, second_ruleset, third_ruleset])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e8024a77",
   "metadata": {},
   "source": [
    "## Evaluate Data Quality for a given partition\n",
    "\n",
    "A data quality evaluation run can be limited to specific partition(s) by leveraging the `pushDownPredicate` expression in the `additional_options` argument "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "332d2e95",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Description</th>\n",
       "      <th>Result</th>\n",
       "      <th>ResultId</th>\n",
       "      <th>EvaluationMessage</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Rule_1</td>\n",
       "      <td>RowCount between 2 and 8</td>\n",
       "      <td>PASS</td>\n",
       "      <td>dqresult-f676cfe0345aa93f492e3e3c3d6cf1ad99b84dc6</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Rule_2</td>\n",
       "      <td>ColumnCorrelation \"c0\" \"c1\" &gt; 0.8</td>\n",
       "      <td>FAIL</td>\n",
       "      <td>dqresult-f676cfe0345aa93f492e3e3c3d6cf1ad99b84dc6</td>\n",
       "      <td>Value: 0.5 does not meet the constraint requir...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Rule_3</td>\n",
       "      <td>Uniqueness \"c0\" &gt; 0.95</td>\n",
       "      <td>PASS</td>\n",
       "      <td>dqresult-f676cfe0345aa93f492e3e3c3d6cf1ad99b84dc6</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Name                        Description Result  \\\n",
       "0  Rule_1           RowCount between 2 and 8   PASS   \n",
       "1  Rule_2  ColumnCorrelation \"c0\" \"c1\" > 0.8   FAIL   \n",
       "2  Rule_3             Uniqueness \"c0\" > 0.95   PASS   \n",
       "\n",
       "                                            ResultId  \\\n",
       "0  dqresult-f676cfe0345aa93f492e3e3c3d6cf1ad99b84dc6   \n",
       "1  dqresult-f676cfe0345aa93f492e3e3c3d6cf1ad99b84dc6   \n",
       "2  dqresult-f676cfe0345aa93f492e3e3c3d6cf1ad99b84dc6   \n",
       "\n",
       "                                   EvaluationMessage  \n",
       "0                                                NaN  \n",
       "1  Value: 0.5 does not meet the constraint requir...  \n",
       "2                                                NaN  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({\"c0\": [2, 0, 1], \"c1\": [1, 0, 2], \"c2\": [1, 1, 1]})\n",
    "wr.s3.to_parquet(df, path, dataset=True, database=glue_database, table=glue_table, partition_cols=[\"c2\"])\n",
    "\n",
    "wr.data_quality.evaluate_ruleset(\n",
    "    name=third_ruleset,\n",
    "    iam_role_arn=iam_role_arn,\n",
    "    number_of_workers=2,\n",
    "    additional_options={\n",
    "        \"pushDownPredicate\": \"(c2 == '1')\",\n",
    "    },\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "awswrangler-v9JnknIF-py3.8",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "vscode": {
   "interpreter": {
    "hash": "83297b058d59ee0acd247586c837429190a8258f15c0eea6234359f5557dde51"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}