[Back to main guide](../README.md) | [Next](activity8.md) ___ ## 7. Observe the data pattern and duplicates in data You can now query data in Data Lake using Amazon Athena. Perform the below steps: a) Login as **dlanalyst** if not already b) Navigate to **Lake Formation Console → Data Catalog → Tables** c) Select the **rawdata** table → **Actions → View data** d) In the **Athena console** → select database as **patientdb → tables rawdata → options Preview** e) You can download the csv by removing the **limit** clause in the select statement and clicking on the **Download the results** icon in the Results section

NOTE: After landing on Athena console, if you get an error or query doesn’t run, click on the **Set up a query result location in Amazon S3** link and enter the value as **s3://\<\\>/query/** **Note the trailing slash in the above path!**

f) Click on **Save** and run the query. You can download entire dataset by removing the **“limit 10”** clause from the SQL, running the query again and by clicking on the **download** icon as highlighted below.

g) Open the downloaded file in excel, sort by patient_id and observe the duplicates

As highlighted with different colors in the above table with identifying different groups that includes the original patient record grouped with its duplicates. The patient_id values are generated in a specific format that helps us identify the such groups. The format is “rec-\-org/dup-\” followed by FEBRL data gen tool. As a next step, we will create, teach and tune AWS Lake Formation FindMatches ML Transform and then use it in the Glue ETL job to find matches and/or remove the duplicates. ___ [Back to main guide](../README.md) | [Next](activity8.md)