+++ title = "Optional - Text Analytics using Athena UDFs " date = 2020-10-22T16:16:01+01:00 weight = 46 chapter = true pre = "5.8. " +++ ## Introduction You already know how to use Amazon Athena to transform data in Amazon S3 using simple SQL commands and the built-in functions in Athena. Now you can also use Athena to translate and analyze text fields, thanks to [Amazon Translate](https://aws.amazon.com/translate/), [Amazon Comprehend](https://aws.amazon.com/comprehend/), and the power of [Athena User Defined Functions (UDFs)](https://docs.aws.amazon.com/athena/latest/ug/querying-udf.html). If your dataset does not have text fields such as customer reviews, [you can skip this lab](/en/visualization.html). However, if your dataset has text fields (similar to the below) you can continue with this lab to further enrich your dataset. ![image](/athena_img/athena_udf_0.png) In this lab, you will see how to use Athena UDFs to do the following: - Detect the dominant language of a text field - Detect the prevailing sentiment expressed—positive, negative, neither, or both - Detect or redact entities (such as items, places, or quantities) You can also use the UDFs for other Text Analytics use cases, such as but not limited to: - Detect or redact PII - Translate text from one language to another For more information on different use-cases, check this blog post on [Athena Text Analytics](https://aws.amazon.com/blogs/machine-learning/translate-and-analyze-text-using-sql-functions-with-amazon-athena-amazon-translate-and-amazon-comprehend/) ## Architecture Athena UDF uses AWS Lambda to implement the function capability: ![image](/athena_img/athena_udf_1.png) The Lambda function used will invoke Comprehend APIs to detect [language](https://docs.aws.amazon.com/comprehend/latest/dg/API_BatchDetectDominantLanguage.html), [sentiment](https://docs.aws.amazon.com/comprehend/latest/dg/API_BatchDetectSentiment.html) and [extract entities](https://docs.aws.amazon.com/comprehend/latest/dg/API_BatchDetectEntities.html) from text fields. ## Installing the Text Analytics UDF We will be using a pre-build Lambda Function that is available on [GitHub](https://github.com/aws-samples/aws-athena-udfs-textanalytics). Follow the steps below to deploy the function: 1. Navigate to the [TextAnalyticsUDFHandler](https://eu-west-1.console.aws.amazon.com/lambda/home?region=eu-west-1#/create/app?applicationId=arn:aws:serverlessrepo:us-east-1:912625584728:applications/TextAnalyticsUDFHandler) application in the AWS Serverless Application Repository. 2. Switch to the region you are using. 3. In the Application settings section, keep the settings at their defaults. 4. Select I acknowledge that this app creates custom IAM roles. 5. Choose Deploy. ![image](/athena_img/athena_udf_2.png) And that’s it! Now you have a new Lambda function called **_textanalytics-udf_**. We are ready to try some text analytics queries in Athena! ## Analyzing Text {{% notice warning %}} In addition to Athena query costs, the text analytics UDF incurs usage costs from Lambda and Amazon Comprehend. The amount you pay is a factor of the total number of records and characters that you process with the UDF. For more information, see [AWS Lambda pricing](https://aws.amazon.com/lambda/pricing/), and[Amazon Comprehend pricing](https://aws.amazon.com/comprehend/pricing/). **Please limit the number of rows that are passed to the UDF** {{% /notice %}} Open the [AWS Management Console for Athena](https://console.aws.amazon.com/athena/home) in the same region you are using. ### Detecting the language This seb-section we will create a new table, with one new column added: **_language_**. The LIMIT clause limits the number of records to 2,000 to limit the number of rows passed to the UDF. 1. In the Athena **Query Editor**, modify the below query by replacing: - **__**: The new table name with language column - **__**: The column name with text - **__**: The table with text field ```sql CREATE TABLE WITH (format='parquet') AS USING EXTERNAL FUNCTION detect_dominant_language(col1 VARCHAR) RETURNS VARCHAR LAMBDA 'textanalytics-udf' SELECT *, detect_dominant_language() AS language FROM > LIMIT 2000 ``` 1. Choose **Run Query** 2. Run the below query to list the all languages detected and sort them by the number of reviews: ```sql SELECT language, count(*) AS count FROM GROUP BY language ORDER BY count DESC ``` ![image](/athena_img/athena_udf_3.png) #### What just happened We invoked the **_textanalytics-udf_** Lambda function and used _detect_dominant_language_ method inside the function to detect the language in 2000 fields inside __ and results written into a new column named _language_. The Lambda function invoked Comprehend APIs detect the dominant language. ### Detecting Sentiment and Entities This seb-section we will create a new table, with two new columns added: **_sentiment_** and **_entities_**. We will not be using the LIMIT clause because we will be using [created in the previous section](#detecting-the-language) as the input. 1. In the Athena **Query Editor**, modify the below query by replacing: - **__**: The new table that will be created with the 2 new columns. - **__**: The column name with text - **__**: The new table that was created in the previous section ```sql CREATE TABLE WITH (format='parquet') AS USING EXTERNAL FUNCTION detect_sentiment_all(col1 VARCHAR, lang VARCHAR) RETURNS VARCHAR LAMBDA 'textanalytics-udf', EXTERNAL FUNCTION detect_entities_all(col1 VARCHAR, lang VARCHAR) RETURNS VARCHAR LAMBDA 'textanalytics-udf' SELECT *, detect_sentiment_all(, language) AS sentiment, detect_entities_all(, language) AS entities FROM WHERE language IN ('ar', 'hi', 'ko', 'zh-TW', 'ja', 'zh', 'de', 'pt', 'en', 'it', 'fr', 'es') ``` 2. Choose **Run Query** 3. Run the below query to preview the new table: ```sql SELECT * FROM limit 10; ``` 4. Scroll sideways to the two new columns: _sentiment_ and _entities_. - Sentiment ![image](/athena_img/athena_udf_4.png) - Entities ![image](/athena_img/athena_udf_5.png) You will see that both columns are nested JSON and not well prepared for analysis. So let's fix this #### Preparing sentiment for analysis 1. Run the following SQL query to create a new table containing sentiment and sentiment scores expanded into separate columns: ```sql CREATE TABLE WITH (format='parquet') AS SELECT , , , , language, CAST(JSON_EXTRACT(sentiment,'$.sentiment') AS VARCHAR) AS sentiment, CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.positive') AS DOUBLE ) AS positive_score, CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.negative') AS DOUBLE ) AS negative_score, CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.neutral') AS DOUBLE ) AS neutral_score, CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.mixed') AS DOUBLE ) AS mixed_score, FROM ``` Below is an example: ```sql CREATE TABLE sentiment_results_final WITH (format='parquet') AS SELECT review_date, year, product_title, star_rating, language, CAST(JSON_EXTRACT(sentiment,'$.sentiment') AS VARCHAR) AS sentiment, CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.positive') AS DOUBLE ) AS positive_score, CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.negative') AS DOUBLE ) AS negative_score, CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.neutral') AS DOUBLE ) AS neutral_score, CAST(JSON_EXTRACT(sentiment,'$.sentimentScore.mixed') AS DOUBLE ) AS mixed_score, review_headline, review_body FROM amazon_reviews_with_text_analysis ``` 2. Preview the new _,sentiment_results_final>_ table ![image](/athena_img/athena_udf_6.png) #### Preparing entities for analysis 1. Run the following SQL query to create a new table containing detected entities unnested into separate rows (inner subquery), with each field in a separate column (outer query): ```sql CREATE TABLE WITH (format='parquet') AS SELECT , , , , language, CAST(JSON_EXTRACT(entity_element, '$.text') AS VARCHAR ) AS entity, CAST(JSON_EXTRACT(entity_element, '$.type') AS VARCHAR ) AS category, CAST(JSON_EXTRACT(entity_element, '$.score') AS DOUBLE ) AS score, CAST(JSON_EXTRACT(entity_element, '$.beginOffset') AS INTEGER ) AS beginoffset, CAST(JSON_EXTRACT(entity_element, '$.endOffset') AS INTEGER ) AS endoffset, FROM ( SELECT * FROM ( SELECT *, CAST(JSON_PARSE(entities) AS ARRAY(json)) AS entities_array FROM ) CROSS JOIN UNNEST(entities_array) AS t(entity_element) ) ``` Below is an example: ```sql CREATE TABLE entities_results_final WITH (format='parquet') AS SELECT review_date, year, product_title, star_rating, language, CAST(JSON_EXTRACT(entity_element, '$.text') AS VARCHAR ) AS entity, CAST(JSON_EXTRACT(entity_element, '$.type') AS VARCHAR ) AS category, CAST(JSON_EXTRACT(entity_element, '$.score') AS DOUBLE ) AS score, CAST(JSON_EXTRACT(entity_element, '$.beginOffset') AS INTEGER ) AS beginoffset, CAST(JSON_EXTRACT(entity_element, '$.endOffset') AS INTEGER ) AS endoffset, review_headline, review_body FROM ( SELECT * FROM ( SELECT *, CAST(JSON_PARSE(entities) AS ARRAY(json)) AS entities_array FROM amazon_reviews_with_text_analysis ) CROSS JOIN UNNEST(entities_array) AS t(entity_element) ) ``` 2. Preview the new __ table ![image](/athena_img/athena_udf_7.png) #### What just happened We invoked the **_textanalytics-udf_** Lambda function and used _detect_sentiment_all_ and _detect_entities_all_ methods inside the function to detect sentiment and extract entities respectively. The results were written into 2 new columns; _sentiment_ and _entities_. Both columns were nested JSON, so to prepare them for analysis we unested both columns into 2 new tables; __ and __. Both tables could be input to the QuickSight Dataset that we will create in the [next lab](<(/en/visualization.html)>).