## Boost transcription accuracy with Amazon Translate custom vocabulary and localize transcripts with Amazon Translate custom terminology

This is the accompanying notebook for the re:Invent 2021 workshop AIM317 - Uncover insights from your customer conversations. Please run this notebook after reviewing the **[prerequisites and instructions](https://studio.us-east-1.prod.workshops.aws/preview/076e45e5-760d-41cf-bd22-a86c46ee462c/builds/83c4ddb7-fbc6-4e72-b5da-967f8fe7cfcb/en-US/1-transcribe-translate-calls)** from the workshop. 

## Prerequisites for this notebook

In [None]:
# First, let's install dependencies for the transcript word utility we will use in this notebook
!pip install python-docx --quiet
!pip install matplotlib --quiet
!pip install scipy --quiet

### Import libraries and initialize variables

In [None]:
import io
import os
import re
import uuid
import json
import time
import boto3
import pprint
import botocore
import sagemaker
import subprocess
from sagemaker import get_execution_role
from datetime import datetime, timezone

In [None]:
bucket = '' # Add your bucket name here

region = boto3.session.Session().region_name

# Amazon S3 (S3) client
s3 = boto3.client('s3', region)
s3_resource = boto3.resource('s3')
try:
 s3.head_bucket(Bucket=bucket)
except:
 print("The S3 bucket name {} you entered seems to be incorrect, please try again".format(bucket))

In [None]:
INPUT_PATH_TRANSCRIBE = 'transcribe/input'
OUTPUT_PATH_TRANSCRIBE = 'transcribe/output'
INPUT_PATH_TRANSLATE = 'translate/input'
OUTPUT_PATH_TRANSLATE = 'translate/output'

In [None]:
region = boto3.session.Session().region_name
bucket_region = s3.head_bucket(Bucket=bucket)['ResponseMetadata']['HTTPHeaders']['x-amz-bucket-region']
assert bucket_region == region, "Your S3 bucket {} and this notebook need to be in the same region.".format(bucket)
# Amazon Transcribe client
transcribe_client = boto3.client("transcribe")
# Amazon Translate client
translate_client = boto3.client("translate")

In [None]:
# This is the execution role that will be used to call Amazon Transcribe and Amazon Translate
role = get_execution_role()
display(role)

## Amazon Transcribe Custom

### Create custom vocabulary

You can give Amazon Transcribe more information about how to process speech in your input file by creating a custom vocabulary in text file format. A custom vocabulary is a list of specific words that you want Amazon Transcribe to recognize in your audio input. These are generally domain-specific words and phrases, words that Amazon Transcribe isn't recognizing, or proper nouns.

In [None]:
# First lets view our vocabulary files
!pygmentize 'input/custom-vocabulary-EN.txt'

In [None]:
# First lets view our vocabulary files - Uncomment line below to view if you like
#!pygmentize 'input/custom-vocabulary-ES.txt'

#### Custom vocabularies can be in table or list formats

Each vocabulary file can be in either table or list format; table format is strongly recommended because it gives you more options for and more control over the input and output of words within your custom vocabulary. As you saw above, we used the table format for this workshop. When you use the table format, it has 4 columns as explain below:

1. **Phrase**
The word or phrase that should be recognized. If the entry is a phrase, separate the words with a hyphen (-). For example, you type Los Angeles as Los-Angeles. The Phrase field is required

1. **IPA**
The pronunciation of your word or phrase using IPA characters. You can include characters in the International Phonetic Alphabet (IPA) in this field.

1. **SoundsLike**
The pronunciation of your word or phrase using the standard orthography of the language to mimic the way that the word sounds.

1. **DisplayAs**
Defines the how the word or phrase looks when it's output. For example, if the word or phrase is Los-Angeles, you can specify the display form as "Los Angeles" so that the hyphen is not present in the output.

In [None]:
# Next we will upload our vocabulary files to our S3 bucket
cust_vocab_en = 'custom-vocabulary-EN.txt'
cust_vocab_es = 'custom-vocabulary-ES.txt'
s3.upload_file('input/' + cust_vocab_en,bucket,INPUT_PATH_TRANSCRIBE + '/' + cust_vocab_en)
s3.upload_file('input/' + cust_vocab_es,bucket,INPUT_PATH_TRANSCRIBE + '/' + cust_vocab_es)

In [None]:
# Create the custom vocabulary in Transcribe
# The name of your custom vocabulary must be unique!
vocab_EN = 'custom-vocab-EN-' + str(uuid.uuid4())
vocab_ES = 'custom-vocab-ES-' + str(uuid.uuid4())

In [None]:
vocab_response_EN = transcribe_client.create_vocabulary(
 VocabularyName=vocab_EN,
 LanguageCode='en-US',
 VocabularyFileUri='s3://' + bucket + '/'+ INPUT_PATH_TRANSCRIBE + '/' + cust_vocab_en
)

In [None]:
vocab_response_ES = transcribe_client.create_vocabulary(
 VocabularyName=vocab_ES,
 LanguageCode='es-US',
 VocabularyFileUri='s3://' + bucket + '/'+ INPUT_PATH_TRANSCRIBE + '/' + cust_vocab_es
)

### Check Vocabulary status in Amazon Transcribe console

[Go to Amazon Transcribe Console](https://console.aws.amazon.com/transcribe/home?region=us-east-1#vocabulary)

This will take 3 to 5 minutes. Go to the **Perform Transcription** step below once the vocabulary has been created and is ready for use.
 

### Perform Transcription

In [None]:
# First let us list our audio files and then upload them to the S3 bucket
audio_dir = 'input/audio-recordings'

for subdir, dirs, files in os.walk(audio_dir):
 for file in files:
 s3.upload_file(os.path.join(subdir, file), bucket, 'transcribe/' + os.path.join(subdir, file))
 print("Uploaded to: " + "s3://" + bucket + '/transcribe/' + os.path.join(subdir, file))

In [None]:
# Define the method that will perform transcription

def transcribe(job_name, job_uri, lang_code, vocab_name):
 """Transcribe audio files to text.
 Args:
 job_name (str): the name of the job that you specify;
 the output json will be job_name.json
 job_uri (str): input path (in s3) to the file being transcribed
 in_bucket (str): s3 bucket prefix where the input audio files are present
 out_bucket (str): s3 bucket name that you want the output json
 to be placed in
 vocab_name (str): name of custom vocabulary used;
 """
 try:
 transcribe_client.start_transcription_job(
 TranscriptionJobName=job_name,
 LanguageCode=lang_code,
 Media={"MediaFileUri": job_uri},
 Settings={'VocabularyName': vocab_name, 'MaxSpeakerLabels': 2, 'ShowSpeakerLabels': True}
 )
 
 time.sleep(2)
 
 print(transcribe_client.get_transcription_job(TranscriptionJobName=job_name)['TranscriptionJob']['TranscriptionJobStatus'])

 except Exception as e:
 print(e)


**Note:** As you can see in our code below we are determining the language code to send to Amazon Transcribe. However this is not required if you set the [IdentifyLanguage to True](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/transcribe.html#TranscribeService.Client.start_transcription_job). In our case we needed to select either the English or Spanish Custom Vocabulary file to use for transcribing audio files and hence we went with specific language codes. 

In [None]:
# Now we will loop through the recordings in our bucket to submit the transcription jobs
now = datetime.now()
time_now = now.strftime("%H.%M.%S")

paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket, Prefix='transcribe/input/audio-recordings')
job_name_list = []

for page in pages:
 for obj in page['Contents']:
 audio_name = obj['Key'].split('/')[3].split('.')[0]
 job_name = audio_name + '-' + time_now
 job_name_list.append(job_name)
 job_uri = f"s3://{bucket}/{obj['Key']}"
 print('Submitting transcription for audio: ' + job_name)
 vocab = ''
 lang_code = ''
 if audio_name.split('-')[2] == 'EN':
 vocab = vocab_EN
 lang_code = 'en-US'
 elif audio_name.split('-')[2] == 'ES':
 vocab = vocab_ES
 lang_code = 'es-US'
 # submit the transcription job now, we will provide our current bucket name as the output bucket
 transcribe(job_name, job_uri, lang_code,vocab)

### Check Transcription job status in Amazon Transcribe console

[Go to Amazon Transcribe Console](https://console.aws.amazon.com/transcribe/home?region=us-east-1#jobs)

This will be complete in about 5 to 8 minutes in total for all the jobs. Go to the **Process Transcription output** step below once the transcription jobs show status as complete otherwise you will get an error. No worries, just try again in a minute or so.
 

### Process Transcription output

#### Clone Transcribe helper repo

From a terminal window in your notebook instance, navigate to the current directory where this notebook resides, and execute the command `git clone https://github.com/aws-samples/amazon-transcribe-output-word-document` before executing the cell below. The steps you will have to follow are:

1. From Jupyter notebook home page on the right, select New --> Terminal
1. In the terminal window, type `cd SageMaker`
1. Now type `cd aim317-uncover-insights-customer-conversations`
1. Now type `cd notebooks`
1. Now type `cd 1-Transcribe-Translate-Calls`
1. Finally type the command `git clone https://github.com/aws-samples/amazon-transcribe-output-word-document`


#### Create a word document from call transcript

We will generate a word document from the Amazon Transcribe response JSON so we review the transcript. Once you execute the code in the next cell, go to your notebook folder and **you will see the word document created with the Transcribe job name. Select this word document, click download and you can open it to review the transcript**.

In [None]:
!python amazon-transcribe-output-word-document/python/ts-to-word.py --inputJob {job_name_list[0]}

### Get Call Segments
We will get the call segments and speaker information to derive additional insights we can visualize in QuickSight. 

In [None]:
import pandas as pd

def upload_segments(transcript):
 # Get the speaker segments
 cols = ['transcript_name', 'start_time', 'end_time', 'speaker_label']
 spk_df = pd.DataFrame(columns=cols)
 for seg in original['results']['speaker_labels']['segments']:
 for item in seg['items']:
 spk_df.loc[len(spk_df.index)] = [transcript['jobName'], item['start_time'], item['end_time'], item['speaker_label']]
 # Get the speaker content
 icols = ['transcript_name', 'start_time', 'end_time', 'confidence', 'content']
 item_df = pd.DataFrame(columns=icols)
 for itms in original['results']['items']:
 if itms.get('start_time') is not None:
 item_df.loc[len(item_df.index)] = [transcript['jobName'], itms['start_time'], itms['end_time'], itms['alternatives'][0]['confidence'], itms['alternatives'][0]['content']]

 # Merge the two on transcript name, start time and end time
 full_df = pd.merge(spk_df, item_df, how='left', left_on=['transcript_name', 'start_time', 'end_time'], right_on = ['transcript_name', 'start_time', 'end_time'])
 # We will use the Transcribe Job Name for the CSV file name
 csv_file = transcript['jobName'] + '.csv'
 full_df.to_csv(csv_file, index=False)
 s3.upload_file(csv_file, bucket, 'quicksight/data/transcripts/' + csv_file)
 # The print below is too verbose so commenting for now - feel free to uncomment if needed
 #print("CSV file with speaker segments created and uploaded for visualization input to: " + "s3://" + bucket + "/" + "quicksight/data/transcripts/" + csv_file)

#### Upload transcript text files to S3 bucket

We will now get the full transcript from all the calls and send them to our S3 bucket in preparation for our translation tasks


In [None]:
# First we need an output directory
dir = os.getcwd()+'/output'
if not os.path.exists(dir):
 os.makedirs(dir)

In [None]:
# Our transcript is in a presigned URL in Transcribe's S3 bucket, let us download it and get the text we need
import urllib3

for job in job_name_list:
 response = transcribe_client.get_transcription_job(
 TranscriptionJobName=job 
 )
 file_name = response['TranscriptionJob']['Transcript']['TranscriptFileUri']
 http = urllib3.PoolManager()
 transcribed_data = http.request('GET', file_name)
 original = json.loads(transcribed_data.data.decode('utf-8'))
 # Extract the speaker segments, confidence scores for each call
 # Send it to the QuickSight folder in the S3 bucket
 # We will use this during visualization
 upload_segments(original)
 entire_transcript = ''
 entire_transcript = original["results"]["transcripts"]
 outfile = 'output/'+job+'.txt'
 with open(outfile, 'w') as out:
 out.write(entire_transcript[0]['transcript'])
 s3.upload_file(outfile,bucket,OUTPUT_PATH_TRANSCRIBE+'/'+job+'.txt')
 print("Transcript uploaded to: " + f's3://{bucket}/{OUTPUT_PATH_TRANSCRIBE}/{job}.txt')

## Amazon Translate with Custom Terminology

[Amazon Translate](https://aws.amazon.com/translate/) is a fully managed, neural machine translation service that delivers high quality and affordable language translation in seventy-one languages. Using [custom terminology](https://docs.aws.amazon.com/translate/latest/dg/how-custom-terminology.html) with your translation requests enables you to make sure that your brand names, character names, model names, and other unique content is translated exactly the way you need it, regardless of its context and the Amazon Translate algorithm’s decision. It's easy to set up a terminology file and attach it to your Amazon Translate account. When you translate text, you simply choose to use the custom terminology as well, and any examples of your source word are translated as you want them.

### Translate the Spanish transcripts

We will first create a custom terminology file that consists of examples that show how you want words to be translated. In our case we are using a CSV file as the format, but it supports [TMX as well](https://docs.aws.amazon.com/translate/latest/dg/creating-custom-terminology.html). It includes a collection of words or terminologies in a source language, and for each example, it contains the desired translation output in one or more target languages. We created a sample custom terminology file for our use case which is available in the input folder of this notebook `translate-custom-terminology.txt` to create a translation of our Spanish transcripts. We will now review this file and proceed with setting up a Custom Translation job.

#### Review custom terminology file

In [None]:
# Lets first review our custom terminology file 
# We created a sample file for this workshop that we can use - uncomment below to check
#!pygmentize 'input/translate-custom-terminology.txt'

In [None]:
# Change extension to CSV and upload to S3 bucket
term_prefix = 'translate/custom-terminology/'
pd_filename = 'translate-custom-terminology'
s3.upload_file('input/' + pd_filename + '.txt', bucket, term_prefix + '/' + pd_filename + '.csv')

#### Import custom terminology to Amazon Translate

In [None]:
# read the custom terminology csv file we uploaded
temp = s3_resource.Object(bucket, term_prefix + '/' + pd_filename + '.csv')
term_file = temp.get()['Body'].read().decode('utf-8')

In [None]:
# import the custom terminology file to Translate
term_name = 'aim317-custom-terminology'
response = translate_client.import_terminology(
 Name=term_name,
 MergeStrategy='OVERWRITE',
 TerminologyData={
 'File': term_file,
 'Format': 'CSV'
 }
)

#### Get the Spanish transcripts

In [None]:
# Review the list of transcripts to pick Spanish transcripts
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket, Prefix=OUTPUT_PATH_TRANSCRIBE)

s3_resource = boto3.resource('s3')
# Now copy the Spanish transcripts to Translate Input folder
for page in pages:
 for obj in page['Contents']:
 lang = ''
 ts_file = obj['Key'].split('/')[2]
 tscript = ts_file.split('-')
 if len(tscript) > 1:
 lang = tscript[2]
 if lang == 'ES':
 copy_source = {'Bucket': bucket,'Key': obj['Key']}
 s3_resource.meta.client.copy(copy_source, bucket, INPUT_PATH_TRANSLATE + '/' + ts_file)

#### Run translation synchronously

**Note:** For the purposes of this workshop we are running this translate synchronously as we have only 2 call transcripts to be translated. For large scale translation requirements, you should use [start_text_translation_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/translate.html#Translate.Client.start_text_translation_job) and for batch custom translation processing requirements you should use the [Parallel Data File with Translate Active Custom Translation](https://docs.aws.amazon.com/translate/latest/dg/customizing-translations-parallel-data.html) 

In [None]:
# Read the spanish transcripts from the Translate input folder in S3 bucket
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket, Prefix=INPUT_PATH_TRANSLATE)
for page in pages:
 for obj in page['Contents']:
 temp = s3_resource.Object(bucket, obj['Key'])
 trans_input = temp.get()['Body'].read().decode('utf-8')
 if len(trans_input) > 0:
 # Translate the Spanish transcripts
 trans_response = translate_client.translate_text(
 Text=trans_input,
 TerminologyNames=[term_name],
 SourceLanguageCode='es',
 TargetLanguageCode='en'
 )
 # Write the translated text to a temporary file
 with open('temp_translate.txt', 'w') as outfile:
 outfile.write(trans_response['TranslatedText'])
 # Upload the translated text to S3 bucket 
 s3.upload_file('temp_translate.txt', bucket, OUTPUT_PATH_TRANSLATE + '/en-' + obj['Key'].split('/')[2])
 print("Translated text file uploaded to: " + 's3://' + bucket + '/' + OUTPUT_PATH_TRANSLATE + '/en-' + obj['Key'].split('/')[2])
 

### Prepare Comprehend inputs

We will now collect the original English transcripts and the translated Spanish language transcripts and move them to the Comprehend input folder in our S3 bucket in preparation for next steps in the workshop.

In [None]:
# First copy the English transcripts
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket, Prefix=OUTPUT_PATH_TRANSCRIBE)

s3_resource = boto3.resource('s3')

for page in pages:
 for obj in page['Contents']:
 ts_file1 = obj['Key'].split('/')[2]
 tscript = ts_file1.split('-')
 if len(tscript) > 1:
 lang = tscript[2]
 if lang == 'EN':
 copy_source = {'Bucket': bucket,'Key': obj['Key']}
 s3_resource.meta.client.copy(copy_source, bucket, 'comprehend/input/' + ts_file1)

# Now copy the Spanish transcripts that were translated to English
pages = paginator.paginate(Bucket=bucket, Prefix=OUTPUT_PATH_TRANSLATE)

for page in pages:
 for obj in page['Contents']:
 ts_file2 = obj['Key'].split('/')[2]
 if 'txt' in ts_file2:
 copy_source = {'Bucket': bucket,'Key': obj['Key']}
 s3_resource.meta.client.copy(copy_source, bucket, 'comprehend/input/' + ts_file2) 

Let us review if all the text files are ready for Comprehend custom inference. We should have 7 files in total with two calls that were transcribed in Spanish and translated to English, and 5 English calls that we transcribed. 

In [None]:
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucket, Prefix='comprehend/input')
for page in pages:
 for obj in page['Contents']:
 print(obj['Key'])

## End of notebook, go back to your workshop instructions