+++
title = "Optional - Text Analytics using Athena UDFs "
date = 2020-10-22T16:16:01+01:00
weight = 46
chapter = true
pre = "5.8. "
+++
## Introduction
You already know how to use Amazon Athena to transform data in Amazon S3 using simple SQL commands and the built-in functions in Athena. Now you can also use Athena to translate and analyze text fields, thanks to [Amazon Translate](https://aws.amazon.com/translate/), [Amazon Comprehend](https://aws.amazon.com/comprehend/), and the power of [Athena User Defined Functions (UDFs)](https://docs.aws.amazon.com/athena/latest/ug/querying-udf.html).
If your dataset does not have text fields such as customer reviews, [you can skip this lab](/en/visualization.html). However, if your dataset has text fields (similar to the below) you can continue with this lab to further enrich your dataset.

In this lab, you will see how to use Athena UDFs to do the following:
- Detect the dominant language of a text field
- Detect the prevailing sentiment expressed—positive, negative, neither, or both
- Detect or redact entities (such as items, places, or quantities)
You can also use the UDFs for other Text Analytics use cases, such as but not limited to:
- Detect or redact PII
- Translate text from one language to another
For more information on different use-cases, check this blog post on [Athena Text Analytics](https://aws.amazon.com/blogs/machine-learning/translate-and-analyze-text-using-sql-functions-with-amazon-athena-amazon-translate-and-amazon-comprehend/)
## Architecture
Athena UDF uses AWS Lambda to implement the function capability:

The Lambda function used will invoke Comprehend APIs to detect [language](https://docs.aws.amazon.com/comprehend/latest/dg/API_BatchDetectDominantLanguage.html), [sentiment](https://docs.aws.amazon.com/comprehend/latest/dg/API_BatchDetectSentiment.html) and [extract entities](https://docs.aws.amazon.com/comprehend/latest/dg/API_BatchDetectEntities.html) from text fields.
## Installing the Text Analytics UDF
We will be using a pre-build Lambda Function that is available on [GitHub](https://github.com/aws-samples/aws-athena-udfs-textanalytics). Follow the steps below to deploy the function:
1. Navigate to the [TextAnalyticsUDFHandler](https://eu-west-1.console.aws.amazon.com/lambda/home?region=eu-west-1#/create/app?applicationId=arn:aws:serverlessrepo:us-east-1:912625584728:applications/TextAnalyticsUDFHandler) application in the AWS Serverless Application Repository.
2. Switch to the region you are using.
3. In the Application settings section, keep the settings at their defaults.
4. Select I acknowledge that this app creates custom IAM roles.
5. Choose Deploy.

And that’s it! Now you have a new Lambda function called **_textanalytics-udf_**. We are ready to try some text analytics queries in Athena!
## Analyzing Text
{{% notice warning %}}
In addition to Athena query costs, the text analytics UDF incurs usage costs from Lambda and Amazon Comprehend. The amount you pay is a factor of the total number of records and characters that you process with the UDF. For more information, see [AWS Lambda pricing](https://aws.amazon.com/lambda/pricing/), and[Amazon Comprehend pricing](https://aws.amazon.com/comprehend/pricing/). **Please limit the number of rows that are passed to the UDF**
{{% /notice %}}
Open the [AWS Management Console for Athena](https://console.aws.amazon.com/athena/home) in the same region you are using.
### Detecting the language
This seb-section we will create a new table, with one new column added: **_language_**. The LIMIT clause limits the number of records to 2,000 to limit the number of rows passed to the UDF.
1. In the Athena **Query Editor**, modify the below query by replacing:
- **__**: The new table name with language column
- **__**: The column name with text
- **__**: The table with text field
```sql
CREATE TABLE WITH (format='parquet') AS
USING EXTERNAL FUNCTION detect_dominant_language(col1 VARCHAR) RETURNS VARCHAR LAMBDA 'textanalytics-udf'
SELECT *, detect_dominant_language() AS language
FROM >
LIMIT 2000
```
1. Choose **Run Query**
2. Run the below query to list the all languages detected and sort them by the number of reviews:
```sql
SELECT language, count(*) AS count FROM GROUP BY language ORDER BY count DESC
```

#### What just happened
We invoked the **_textanalytics-udf_** Lambda function and used _detect_dominant_language_ method inside the function to detect the language in 2000 fields inside __ and results written into a new column named _language_.
The Lambda function invoked Comprehend APIs detect the dominant language.
### Detecting Sentiment and Entities
This seb-section we will create a new table, with two new columns added: **_sentiment_** and **_entities_**. We will not be using the LIMIT clause because we will be using [created in the previous section](#detecting-the-language) as the input.
1. In the Athena **Query Editor**, modify the below query by replacing:
- **__**: The new table that will be created with the 2 new columns.
- **__**: The column name with text
- **__**: The new table that was created in the previous section
```sql
CREATE TABLE WITH (format='parquet') AS
USING
EXTERNAL FUNCTION detect_sentiment_all(col1 VARCHAR, lang VARCHAR) RETURNS VARCHAR LAMBDA 'textanalytics-udf',
EXTERNAL FUNCTION detect_entities_all(col1 VARCHAR, lang VARCHAR) RETURNS VARCHAR LAMBDA 'textanalytics-udf'
SELECT *,
detect_sentiment_all(, language) AS sentiment,
detect_entities_all(, language) AS entities
FROM
WHERE language IN ('ar', 'hi', 'ko', 'zh-TW', 'ja', 'zh', 'de', 'pt', 'en', 'it', 'fr', 'es')
```
2. Choose **Run Query**
3. Run the below query to preview the new table:
```sql
SELECT * FROM limit 10;
```
4. Scroll sideways to the two new columns: _sentiment_ and _entities_.
- Sentiment

- Entities

You will see that both columns are nested JSON and not well prepared for analysis. So let's fix this
#### Preparing sentiment for analysis
1. Run the following SQL query to create a new table containing sentiment and sentiment scores expanded into separate columns:
```sql
CREATE TABLE WITH (format='parquet') AS
SELECT
, , , , language,
CAST(JSON_EXTRACT(sentiment,'$.sentiment') AS VARCHAR) AS sentiment,
CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.positive') AS DOUBLE ) AS positive_score,
CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.negative') AS DOUBLE ) AS negative_score,
CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.neutral') AS DOUBLE ) AS neutral_score,
CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.mixed') AS DOUBLE ) AS mixed_score,
FROM
```
Below is an example:
```sql
CREATE TABLE sentiment_results_final WITH (format='parquet') AS
SELECT
review_date, year, product_title, star_rating, language,
CAST(JSON_EXTRACT(sentiment,'$.sentiment') AS VARCHAR) AS sentiment,
CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.positive') AS DOUBLE ) AS positive_score,
CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.negative') AS DOUBLE ) AS negative_score,
CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.neutral') AS DOUBLE ) AS neutral_score,
CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.mixed') AS DOUBLE ) AS mixed_score,
review_headline, review_body
FROM amazon_reviews_with_text_analysis
```
2. Preview the new _,sentiment_results_final>_ table

#### Preparing entities for analysis
1. Run the following SQL query to create a new table containing detected entities unnested into separate rows (inner subquery), with each field in a separate column (outer query):
```sql
CREATE TABLE WITH (format='parquet') AS
SELECT
, , , , language,
CAST(JSON_EXTRACT(entity_element, '$.text') AS VARCHAR ) AS entity,
CAST(JSON_EXTRACT(entity_element, '$.type') AS VARCHAR ) AS category,
CAST(JSON_EXTRACT(entity_element, '$.score') AS DOUBLE ) AS score,
CAST(JSON_EXTRACT(entity_element, '$.beginOffset') AS INTEGER ) AS beginoffset,
CAST(JSON_EXTRACT(entity_element, '$.endOffset') AS INTEGER ) AS endoffset,
FROM
(
SELECT *
FROM
(
SELECT *,
CAST(JSON_PARSE(entities) AS ARRAY(json)) AS entities_array
FROM
)
CROSS JOIN UNNEST(entities_array) AS t(entity_element)
)
```
Below is an example:
```sql
CREATE TABLE entities_results_final WITH (format='parquet') AS
SELECT
review_date, year, product_title, star_rating, language,
CAST(JSON_EXTRACT(entity_element, '$.text') AS VARCHAR ) AS entity,
CAST(JSON_EXTRACT(entity_element, '$.type') AS VARCHAR ) AS category,
CAST(JSON_EXTRACT(entity_element, '$.score') AS DOUBLE ) AS score,
CAST(JSON_EXTRACT(entity_element, '$.beginOffset') AS INTEGER ) AS beginoffset,
CAST(JSON_EXTRACT(entity_element, '$.endOffset') AS INTEGER ) AS endoffset,
review_headline, review_body
FROM
(
SELECT *
FROM
(
SELECT *,
CAST(JSON_PARSE(entities) AS ARRAY(json)) AS entities_array
FROM amazon_reviews_with_text_analysis
)
CROSS JOIN UNNEST(entities_array) AS t(entity_element)
)
```
2. Preview the new __ table

#### What just happened
We invoked the **_textanalytics-udf_** Lambda function and used _detect_sentiment_all_ and _detect_entities_all_ methods inside the function to detect sentiment and extract entities respectively. The results were written into 2 new columns; _sentiment_ and _entities_. Both columns were nested JSON, so to prepare them for analysis we unested both columns into 2 new tables; __ and __.
Both tables could be input to the QuickSight Dataset that we will create in the [next lab](<(/en/visualization.html)>).