# Audio Transcribe With Custom Library for Content Moderation

Audio and video files can have profanity as part of their content. When transcribing the files for closed captions etc. there is a need to mask or remove words which are considered profane, and not suitable for the audience. Amazon Transcribe with custom vocabulary provides the ability to filter out unsuitable words. 
Learn more about Amazon transcribe [here](https://aws.amazon.com/transcribe/).

In this tutorial we will learn how to create and use a custom vocabulary to mask out unsuitable words from a transcription.
You can learn more [here](https://docs.aws.amazon.com/transcribe/latest/dg/vocabulary-filtering.html).

![audio-moderation-arch](../images/TranscribeAudioModeration.png)

- [Step 1: Setup Notebook](#step1)
- [Step 2: Setup Variables and import packages](#step2)
- [Step 3: Setup input audio file & Run Transcribe without moderation ](#step3)
- [Step 4: Create a custom vocabulary file](#step4)
- [Step 5: Run transcribe Job with Custom Vocabulary Filter](#step5)

# Step 1: Setup Notebook 
Run the below cell to install/update Python dependencies if you run the lab using a local IDE. It is optional if you use a SageMaker Studio Juypter Notebook, which already includes the dependencies in the kernel. 

In [None]:
# install the below packages if you are using a local IDE and they are not installed in your env
%pip install -qU pip
%pip install boto3 -qU

# Step 2: Setup Variables and import packages 

In [None]:
import boto3
import sagemaker as sm
import os
import datetime
import time
import json

# variables
data_bucket = sm.Session().default_bucket()
region = boto3.session.Session().region_name

os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sm.get_execution_role()
#The role should have SagemakerFullAccess and TranscribeFullAccess
print(f"SageMaker role is: {role}\nDefault SageMaker Bucket: s3://{data_bucket}")

s3=boto3.client('s3')
transcribe_client=boto3.client('transcribe', region_name=region)

# Step 3: Setup input audio file and run Transcribe without moderation 
Run the below cell to upload a sample audio file (mp3) to the default S3 bucket for Transcribe to access.

In [None]:
#Here we use a file from the datasets directory and upload it to S3. You can modify the name/path for s3_key, location of audio file to upload. 
#Also change the name of job_name if you are rerunning the job.
s3_key = 'content-moderation-im/audio-moderation/moderation-audio-speech.mp3'
s3.upload_file('../datasets/moderation-audio-speech.mp3', data_bucket, s3_key)
file_uri = 's3://'+data_bucket+'/'+s3_key
print(file_uri)

Call Transcribe **StartTranscriptionJob** API to transcribe the audio to text. Amazon Transcribe StartTranscriptionJob is an asynchronous API that will start a job managed by Transcribe. We will then call the **GetTranscriptionJob** API to check the job status until the job completes.

In [None]:
import uuid
job_name = f'audio_moderation_{str(uuid.uuid1())[0:4]}'

transcribe_client.start_transcription_job(
 TranscriptionJobName = job_name,
 Media = {
 'MediaFileUri': file_uri
 },
 OutputBucketName = data_bucket,
 OutputKey = 'content-moderation-im/audio-moderation/my-output-files/',
 MediaFormat = 'mp3',
 LanguageCode = 'en-US'
 )
max_tries = 60
while max_tries > 0:
 max_tries -= 1
 job = transcribe_client.get_transcription_job(TranscriptionJobName = job_name)
 job_status = job['TranscriptionJob']['TranscriptionJobStatus']
 if job_status in ['COMPLETED', 'FAILED']:
 print(f"Job {job_name} is {job_status}.")
 if job_status == 'COMPLETED':
 print(
 f"Download the transcript from\n"
 f"\t{job['TranscriptionJob']['Transcript']['TranscriptFileUri']}.")
 break
 else:
 print(f"Waiting for {job_name}. Current status is {job_status}.")
 time.sleep(10)

The job is completed. Transcribe stored the output data to a JSON file in the S3 path specified in the `OutputKey` parameter. We now open the transcribed output JSON file from S3 and check the transcription accuracy. Notice that some of the words may be inappropriate. Identify those words and make a list. 

In [None]:
filename = 'content-moderation-im/audio-moderation/my-output-files/'+job_name+'.json'
s3_clientobj = s3.get_object(Bucket=data_bucket, Key=filename)
s3_clientdata = s3_clientobj["Body"].read().decode("utf-8")
original = json.loads(s3_clientdata)
output_transcript = original["results"]["transcripts"]
print(output_transcript)

# Step 4: Create a custom vocabulary file 
Amazon Transcribe allows you to create [Custom Vocabulary Filters](https://docs.aws.amazon.com/transcribe/latest/dg/vocabulary-filtering.html) to delete, mask, or flag words. This section will customize a profanity filter to redact offensive terms while transcribing the audio.

Open the "my-vocabulary-filter.txt" file in the datasets directory and see the list of words entered there. If you wish to add additional terms, please do so and save the file. Execute this cell afterwords. We then will upload it to the default S3 bucket for Transcribe to access.

In [None]:
# Upload the text file with profane terms to the default S3 bucket
s3_vocabfile_key = 'content-moderation-im/audio-moderation/my-vocabulary-filters/my-vocabulary-filter.txt'
s3.upload_file('../datasets/my-vocabulary-filter.txt', data_bucket, s3_vocabfile_key)
vocab_file_uri = 's3://'+data_bucket+'/'+s3_vocabfile_key
print(vocab_file_uri)

Now, let's call Transcribe **CreateVocabularyFilter** API to create a new vocabulary filter.

In [None]:
# Change the name for vocab_name if you have created a custom vocabulary already
vocab_name = "audio-moderation-filter"
response = transcribe_client.create_vocabulary_filter(
 LanguageCode = 'en-US',
 VocabularyFilterName = vocab_name,
 VocabularyFilterFileUri = vocab_file_uri
)

# Step 5: Run transcribe Job with Custom Vocabulary Filter 
In this step, we will call the same Transcribe `StartTranscriptionJob` API but pass the Vocabulary Filter created in the previous step as an additional parameter under `Settings`. So Transcribe will transcribe the audio to text and apply the profane filter and masking based on the setting.

In [None]:
#In this case we are running the transcribe job with option to "mask" the words. You can also run the job with "remove" or "tag". 
#In each case, give a new job_name and change the VocabularyFilterMethod to the desired option

job_name = f'transcription_with_audio_moderation_mask_{str(uuid.uuid1())[0:4]}'

transcribe_client.start_transcription_job(
 TranscriptionJobName = job_name,
 Media = {
 'MediaFileUri': file_uri
 },
 OutputBucketName = data_bucket,
 OutputKey = 'content-moderation-im/audio-moderation/my-output-files/', 
 LanguageCode = 'en-US', 
 Settings = {
 'VocabularyFilterName': vocab_name,
 'VocabularyFilterMethod': 'mask' 
 }
)

while True:
 status = transcribe_client.get_transcription_job(TranscriptionJobName = job_name)
 if status['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
 break
 print("Not ready yet...")
 time.sleep(5)
print(status)

The Transcribe job is completed. Now, let's download & open the JSON file from the output directory in S3 and check if the words from the custom vocabulary list have been masked/removed as per the selected option. 

In [None]:
filename = 'content-moderation-im/audio-moderation/my-output-files/'+job_name+'.json'
s3_clientobj = s3.get_object(Bucket=data_bucket, Key=filename)
s3_clientdata = s3_clientobj["Body"].read().decode("utf-8")
original = json.loads(s3_clientdata)
output_transcript = original["results"]["transcripts"]
print(output_transcript)

As you can see in the above text, Transcribe masked the profane words based on the vocabulary filter we uploaded earlier.

# Cleanup : 
We will delete the audio file in the S3 bucket and the Audio Filter in Transcribe to clean up resources and prevent unnecessary costs.

In [None]:
transcribe_client.delete_vocabulary_filter(VocabularyFilterName=vocab_name)

In [None]:
s3.delete_object(Bucket=data_bucket, Key=s3_key)

# Conclusion : 
In this lab, we learned how to use the custom vocabulary with Amazon transcribe to filter out offensive and profane words.