{ "cells": [ { "cell_type": "markdown", "id": "04837ae6", "metadata": {}, "source": [ "## Boost transcription accuracy with Amazon Translate custom vocabulary and localize transcripts with Amazon Translate custom terminology\n", "\n", "This is the accompanying notebook for the re:Invent 2021 workshop AIM317 - Uncover insights from your customer conversations. Please run this notebook after reviewing the **[prerequisites and instructions](https://studio.us-east-1.prod.workshops.aws/preview/076e45e5-760d-41cf-bd22-a86c46ee462c/builds/83c4ddb7-fbc6-4e72-b5da-967f8fe7cfcb/en-US/1-transcribe-translate-calls)** from the workshop. " ] }, { "cell_type": "markdown", "id": "f591749b", "metadata": {}, "source": [ "## Prerequisites for this notebook" ] }, { "cell_type": "code", "execution_count": null, "id": "49637499", "metadata": {}, "outputs": [], "source": [ "# First, let's install dependencies for the transcript word utility we will use in this notebook\n", "!pip install python-docx --quiet\n", "!pip install matplotlib --quiet\n", "!pip install scipy --quiet" ] }, { "cell_type": "markdown", "id": "c8906dc9", "metadata": {}, "source": [ "### Import libraries and initialize variables" ] }, { "cell_type": "code", "execution_count": null, "id": "7bedcea1", "metadata": {}, "outputs": [], "source": [ "import io\n", "import os\n", "import re\n", "import uuid\n", "import json\n", "import time\n", "import boto3\n", "import pprint\n", "import botocore\n", "import sagemaker\n", "import subprocess\n", "from sagemaker import get_execution_role\n", "from datetime import datetime, timezone" ] }, { "cell_type": "code", "execution_count": null, "id": "40d91817", "metadata": {}, "outputs": [], "source": [ "bucket = '' # Add your bucket name here\n", "\n", "region = boto3.session.Session().region_name\n", "\n", "# Amazon S3 (S3) client\n", "s3 = boto3.client('s3', region)\n", "s3_resource = boto3.resource('s3')\n", "try:\n", " s3.head_bucket(Bucket=bucket)\n", "except:\n", " print(\"The S3 bucket name {} you entered seems to be incorrect, please try again\".format(bucket))" ] }, { "cell_type": "code", "execution_count": null, "id": "935d324a", "metadata": {}, "outputs": [], "source": [ "INPUT_PATH_TRANSCRIBE = 'transcribe/input'\n", "OUTPUT_PATH_TRANSCRIBE = 'transcribe/output'\n", "INPUT_PATH_TRANSLATE = 'translate/input'\n", "OUTPUT_PATH_TRANSLATE = 'translate/output'" ] }, { "cell_type": "code", "execution_count": null, "id": "f32f26e5", "metadata": {}, "outputs": [], "source": [ "region = boto3.session.Session().region_name\n", "bucket_region = s3.head_bucket(Bucket=bucket)['ResponseMetadata']['HTTPHeaders']['x-amz-bucket-region']\n", "assert bucket_region == region, \"Your S3 bucket {} and this notebook need to be in the same region.\".format(bucket)\n", "# Amazon Transcribe client\n", "transcribe_client = boto3.client(\"transcribe\")\n", "# Amazon Translate client\n", "translate_client = boto3.client(\"translate\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d190c821", "metadata": {}, "outputs": [], "source": [ "# This is the execution role that will be used to call Amazon Transcribe and Amazon Translate\n", "role = get_execution_role()\n", "display(role)" ] }, { "cell_type": "markdown", "id": "36e8d4a4", "metadata": {}, "source": [ "## Amazon Transcribe Custom" ] }, { "cell_type": "markdown", "id": "f03eb05a", "metadata": {}, "source": [ "### Create custom vocabulary\n", "\n", "You can give Amazon Transcribe more information about how to process speech in your input file by creating a custom vocabulary in text file format. A custom vocabulary is a list of specific words that you want Amazon Transcribe to recognize in your audio input. These are generally domain-specific words and phrases, words that Amazon Transcribe isn't recognizing, or proper nouns." ] }, { "cell_type": "code", "execution_count": null, "id": "5a22d56c", "metadata": {}, "outputs": [], "source": [ "# First lets view our vocabulary files\n", "!pygmentize 'input/custom-vocabulary-EN.txt'" ] }, { "cell_type": "code", "execution_count": null, "id": "51c9f86b", "metadata": {}, "outputs": [], "source": [ "# First lets view our vocabulary files - Uncomment line below to view if you like\n", "#!pygmentize 'input/custom-vocabulary-ES.txt'" ] }, { "cell_type": "markdown", "id": "fdce93dc", "metadata": {}, "source": [ "#### Custom vocabularies can be in table or list formats\n", "\n", "Each vocabulary file can be in either table or list format; table format is strongly recommended because it gives you more options for and more control over the input and output of words within your custom vocabulary. As you saw above, we used the table format for this workshop. When you use the table format, it has 4 columns as explain below:\n", "\n", "1. **Phrase**\n", "The word or phrase that should be recognized. If the entry is a phrase, separate the words with a hyphen (-). For example, you type Los Angeles as Los-Angeles. The Phrase field is required\n", "\n", "1. **IPA**\n", "The pronunciation of your word or phrase using IPA characters. You can include characters in the International Phonetic Alphabet (IPA) in this field.\n", "\n", "1. **SoundsLike**\n", "The pronunciation of your word or phrase using the standard orthography of the language to mimic the way that the word sounds.\n", "\n", "1. **DisplayAs**\n", "Defines the how the word or phrase looks when it's output. For example, if the word or phrase is Los-Angeles, you can specify the display form as \"Los Angeles\" so that the hyphen is not present in the output." ] }, { "cell_type": "code", "execution_count": null, "id": "dd14c202", "metadata": {}, "outputs": [], "source": [ "# Next we will upload our vocabulary files to our S3 bucket\n", "cust_vocab_en = 'custom-vocabulary-EN.txt'\n", "cust_vocab_es = 'custom-vocabulary-ES.txt'\n", "s3.upload_file('input/' + cust_vocab_en,bucket,INPUT_PATH_TRANSCRIBE + '/' + cust_vocab_en)\n", "s3.upload_file('input/' + cust_vocab_es,bucket,INPUT_PATH_TRANSCRIBE + '/' + cust_vocab_es)" ] }, { "cell_type": "code", "execution_count": null, "id": "2aba0fbe", "metadata": {}, "outputs": [], "source": [ "# Create the custom vocabulary in Transcribe\n", "# The name of your custom vocabulary must be unique!\n", "vocab_EN = 'custom-vocab-EN-' + str(uuid.uuid4())\n", "vocab_ES = 'custom-vocab-ES-' + str(uuid.uuid4())" ] }, { "cell_type": "code", "execution_count": null, "id": "70145b3e", "metadata": {}, "outputs": [], "source": [ "vocab_response_EN = transcribe_client.create_vocabulary(\n", " VocabularyName=vocab_EN,\n", " LanguageCode='en-US',\n", " VocabularyFileUri='s3://' + bucket + '/'+ INPUT_PATH_TRANSCRIBE + '/' + cust_vocab_en\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "9cf568bd", "metadata": {}, "outputs": [], "source": [ "vocab_response_ES = transcribe_client.create_vocabulary(\n", " VocabularyName=vocab_ES,\n", " LanguageCode='es-US',\n", " VocabularyFileUri='s3://' + bucket + '/'+ INPUT_PATH_TRANSCRIBE + '/' + cust_vocab_es\n", ")" ] }, { "cell_type": "markdown", "id": "735d85ac", "metadata": {}, "source": [ "### Check Vocabulary status in Amazon Transcribe console\n", "\n", "[Go to Amazon Transcribe Console](https://console.aws.amazon.com/transcribe/home?region=us-east-1#vocabulary)\n", "\n", "This will take 3 to 5 minutes. Go to the **Perform Transcription** step below once the vocabulary has been created and is ready for use.\n", " " ] }, { "cell_type": "markdown", "id": "6ac52da7", "metadata": {}, "source": [ "### Perform Transcription" ] }, { "cell_type": "code", "execution_count": null, "id": "ad29b8c6", "metadata": {}, "outputs": [], "source": [ "# First let us list our audio files and then upload them to the S3 bucket\n", "audio_dir = 'input/audio-recordings'\n", "\n", "for subdir, dirs, files in os.walk(audio_dir):\n", " for file in files:\n", " s3.upload_file(os.path.join(subdir, file), bucket, 'transcribe/' + os.path.join(subdir, file))\n", " print(\"Uploaded to: \" + \"s3://\" + bucket + '/transcribe/' + os.path.join(subdir, file))" ] }, { "cell_type": "code", "execution_count": null, "id": "fa357c3a", "metadata": {}, "outputs": [], "source": [ "# Define the method that will perform transcription\n", "\n", "def transcribe(job_name, job_uri, lang_code, vocab_name):\n", " \"\"\"Transcribe audio files to text.\n", " Args:\n", " job_name (str): the name of the job that you specify;\n", " the output json will be job_name.json\n", " job_uri (str): input path (in s3) to the file being transcribed\n", " in_bucket (str): s3 bucket prefix where the input audio files are present\n", " out_bucket (str): s3 bucket name that you want the output json\n", " to be placed in\n", " vocab_name (str): name of custom vocabulary used;\n", " \"\"\"\n", " try:\n", " transcribe_client.start_transcription_job(\n", " TranscriptionJobName=job_name,\n", " LanguageCode=lang_code,\n", " Media={\"MediaFileUri\": job_uri},\n", " Settings={'VocabularyName': vocab_name, 'MaxSpeakerLabels': 2, 'ShowSpeakerLabels': True}\n", " )\n", " \n", " time.sleep(2)\n", " \n", " print(transcribe_client.get_transcription_job(TranscriptionJobName=job_name)['TranscriptionJob']['TranscriptionJobStatus'])\n", "\n", " except Exception as e:\n", " print(e)\n" ] }, { "cell_type": "markdown", "id": "abfd025d", "metadata": {}, "source": [ "**Note:** As you can see in our code below we are determining the language code to send to Amazon Transcribe. However this is not required if you set the [IdentifyLanguage to True](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/transcribe.html#TranscribeService.Client.start_transcription_job). In our case we needed to select either the English or Spanish Custom Vocabulary file to use for transcribing audio files and hence we went with specific language codes. " ] }, { "cell_type": "code", "execution_count": null, "id": "75653a8e", "metadata": {}, "outputs": [], "source": [ "# Now we will loop through the recordings in our bucket to submit the transcription jobs\n", "now = datetime.now()\n", "time_now = now.strftime(\"%H.%M.%S\")\n", "\n", "paginator = s3.get_paginator('list_objects_v2')\n", "pages = paginator.paginate(Bucket=bucket, Prefix='transcribe/input/audio-recordings')\n", "job_name_list = []\n", "\n", "for page in pages:\n", " for obj in page['Contents']:\n", " audio_name = obj['Key'].split('/')[3].split('.')[0]\n", " job_name = audio_name + '-' + time_now\n", " job_name_list.append(job_name)\n", " job_uri = f\"s3://{bucket}/{obj['Key']}\"\n", " print('Submitting transcription for audio: ' + job_name)\n", " vocab = ''\n", " lang_code = ''\n", " if audio_name.split('-')[2] == 'EN':\n", " vocab = vocab_EN\n", " lang_code = 'en-US'\n", " elif audio_name.split('-')[2] == 'ES':\n", " vocab = vocab_ES\n", " lang_code = 'es-US'\n", " # submit the transcription job now, we will provide our current bucket name as the output bucket\n", " transcribe(job_name, job_uri, lang_code,vocab)" ] }, { "cell_type": "markdown", "id": "275266e9", "metadata": {}, "source": [ "### Check Transcription job status in Amazon Transcribe console\n", "\n", "[Go to Amazon Transcribe Console](https://console.aws.amazon.com/transcribe/home?region=us-east-1#jobs)\n", "\n", "This will be complete in about 5 to 8 minutes in total for all the jobs. Go to the **Process Transcription output** step below once the transcription jobs show status as complete otherwise you will get an error. No worries, just try again in a minute or so.\n", " " ] }, { "cell_type": "markdown", "id": "ffd8e4a4", "metadata": {}, "source": [ "### Process Transcription output" ] }, { "cell_type": "markdown", "id": "823fb45c", "metadata": {}, "source": [ "#### Clone Transcribe helper repo\n", "\n", "From a terminal window in your notebook instance, navigate to the current directory where this notebook resides, and execute the command `git clone https://github.com/aws-samples/amazon-transcribe-output-word-document` before executing the cell below. The steps you will have to follow are:\n", "\n", "1. From Jupyter notebook home page on the right, select New --> Terminal\n", "1. In the terminal window, type `cd SageMaker`\n", "1. Now type `cd aim317-uncover-insights-customer-conversations`\n", "1. Now type `cd notebooks`\n", "1. Now type `cd 1-Transcribe-Translate-Calls`\n", "1. Finally type the command `git clone https://github.com/aws-samples/amazon-transcribe-output-word-document`\n" ] }, { "cell_type": "markdown", "id": "55d6cc7c", "metadata": {}, "source": [ "#### Create a word document from call transcript\n", "\n", "We will generate a word document from the Amazon Transcribe response JSON so we review the transcript. Once you execute the code in the next cell, go to your notebook folder and **you will see the word document created with the Transcribe job name. Select this word document, click download and you can open it to review the transcript**." ] }, { "cell_type": "code", "execution_count": null, "id": "c314aea5", "metadata": {}, "outputs": [], "source": [ "!python amazon-transcribe-output-word-document/python/ts-to-word.py --inputJob {job_name_list[0]}" ] }, { "cell_type": "markdown", "id": "82267b4f", "metadata": {}, "source": [ "### Get Call Segments\n", "We will get the call segments and speaker information to derive additional insights we can visualize in QuickSight. " ] }, { "cell_type": "code", "execution_count": null, "id": "75b324a0", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "def upload_segments(transcript):\n", " # Get the speaker segments\n", " cols = ['transcript_name', 'start_time', 'end_time', 'speaker_label']\n", " spk_df = pd.DataFrame(columns=cols)\n", " for seg in original['results']['speaker_labels']['segments']:\n", " for item in seg['items']:\n", " spk_df.loc[len(spk_df.index)] = [transcript['jobName'], item['start_time'], item['end_time'], item['speaker_label']]\n", " # Get the speaker content\n", " icols = ['transcript_name', 'start_time', 'end_time', 'confidence', 'content']\n", " item_df = pd.DataFrame(columns=icols)\n", " for itms in original['results']['items']:\n", " if itms.get('start_time') is not None:\n", " item_df.loc[len(item_df.index)] = [transcript['jobName'], itms['start_time'], itms['end_time'], itms['alternatives'][0]['confidence'], itms['alternatives'][0]['content']]\n", "\n", " # Merge the two on transcript name, start time and end time\n", " full_df = pd.merge(spk_df, item_df, how='left', left_on=['transcript_name', 'start_time', 'end_time'], right_on = ['transcript_name', 'start_time', 'end_time'])\n", " # We will use the Transcribe Job Name for the CSV file name\n", " csv_file = transcript['jobName'] + '.csv'\n", " full_df.to_csv(csv_file, index=False)\n", " s3.upload_file(csv_file, bucket, 'quicksight/data/transcripts/' + csv_file)\n", " # The print below is too verbose so commenting for now - feel free to uncomment if needed\n", " #print(\"CSV file with speaker segments created and uploaded for visualization input to: \" + \"s3://\" + bucket + \"/\" + \"quicksight/data/transcripts/\" + csv_file)" ] }, { "cell_type": "markdown", "id": "51c363b8", "metadata": {}, "source": [ "#### Upload transcript text files to S3 bucket\n", "\n", "We will now get the full transcript from all the calls and send them to our S3 bucket in preparation for our translation tasks\n" ] }, { "cell_type": "code", "execution_count": null, "id": "f86e1c56", "metadata": {}, "outputs": [], "source": [ "# First we need an output directory\n", "dir = os.getcwd()+'/output'\n", "if not os.path.exists(dir):\n", " os.makedirs(dir)" ] }, { "cell_type": "code", "execution_count": null, "id": "73269b2d", "metadata": {}, "outputs": [], "source": [ "# Our transcript is in a presigned URL in Transcribe's S3 bucket, let us download it and get the text we need\n", "import urllib3\n", "\n", "for job in job_name_list:\n", " response = transcribe_client.get_transcription_job(\n", " TranscriptionJobName=job \n", " )\n", " file_name = response['TranscriptionJob']['Transcript']['TranscriptFileUri']\n", " http = urllib3.PoolManager()\n", " transcribed_data = http.request('GET', file_name)\n", " original = json.loads(transcribed_data.data.decode('utf-8'))\n", " # Extract the speaker segments, confidence scores for each call\n", " # Send it to the QuickSight folder in the S3 bucket\n", " # We will use this during visualization\n", " upload_segments(original)\n", " entire_transcript = ''\n", " entire_transcript = original[\"results\"][\"transcripts\"]\n", " outfile = 'output/'+job+'.txt'\n", " with open(outfile, 'w') as out:\n", " out.write(entire_transcript[0]['transcript'])\n", " s3.upload_file(outfile,bucket,OUTPUT_PATH_TRANSCRIBE+'/'+job+'.txt')\n", " print(\"Transcript uploaded to: \" + f's3://{bucket}/{OUTPUT_PATH_TRANSCRIBE}/{job}.txt')" ] }, { "cell_type": "markdown", "id": "19f1202e", "metadata": {}, "source": [ "## Amazon Translate with Custom Terminology\n", "\n", "[Amazon Translate](https://aws.amazon.com/translate/) is a fully managed, neural machine translation service that delivers high quality and affordable language translation in seventy-one languages. Using [custom terminology](https://docs.aws.amazon.com/translate/latest/dg/how-custom-terminology.html) with your translation requests enables you to make sure that your brand names, character names, model names, and other unique content is translated exactly the way you need it, regardless of its context and the Amazon Translate algorithm’s decision. It's easy to set up a terminology file and attach it to your Amazon Translate account. When you translate text, you simply choose to use the custom terminology as well, and any examples of your source word are translated as you want them." ] }, { "cell_type": "markdown", "id": "98cfd5aa", "metadata": {}, "source": [ "### Translate the Spanish transcripts\n", "\n", "We will first create a custom terminology file that consists of examples that show how you want words to be translated. In our case we are using a CSV file as the format, but it supports [TMX as well](https://docs.aws.amazon.com/translate/latest/dg/creating-custom-terminology.html). It includes a collection of words or terminologies in a source language, and for each example, it contains the desired translation output in one or more target languages. We created a sample custom terminology file for our use case which is available in the input folder of this notebook `translate-custom-terminology.txt` to create a translation of our Spanish transcripts. We will now review this file and proceed with setting up a Custom Translation job." ] }, { "cell_type": "markdown", "id": "7826ef83", "metadata": {}, "source": [ "#### Review custom terminology file" ] }, { "cell_type": "code", "execution_count": null, "id": "6e143fdc", "metadata": {}, "outputs": [], "source": [ "# Lets first review our custom terminology file \n", "# We created a sample file for this workshop that we can use - uncomment below to check\n", "#!pygmentize 'input/translate-custom-terminology.txt'" ] }, { "cell_type": "code", "execution_count": null, "id": "0271a2b2", "metadata": {}, "outputs": [], "source": [ "# Change extension to CSV and upload to S3 bucket\n", "term_prefix = 'translate/custom-terminology/'\n", "pd_filename = 'translate-custom-terminology'\n", "s3.upload_file('input/' + pd_filename + '.txt', bucket, term_prefix + '/' + pd_filename + '.csv')" ] }, { "cell_type": "markdown", "id": "58ffe46e", "metadata": {}, "source": [ "#### Import custom terminology to Amazon Translate" ] }, { "cell_type": "code", "execution_count": null, "id": "d5a844e8", "metadata": {}, "outputs": [], "source": [ "# read the custom terminology csv file we uploaded\n", "temp = s3_resource.Object(bucket, term_prefix + '/' + pd_filename + '.csv')\n", "term_file = temp.get()['Body'].read().decode('utf-8')" ] }, { "cell_type": "code", "execution_count": null, "id": "b7e81491", "metadata": {}, "outputs": [], "source": [ "# import the custom terminology file to Translate\n", "term_name = 'aim317-custom-terminology'\n", "response = translate_client.import_terminology(\n", " Name=term_name,\n", " MergeStrategy='OVERWRITE',\n", " TerminologyData={\n", " 'File': term_file,\n", " 'Format': 'CSV'\n", " }\n", ")" ] }, { "cell_type": "markdown", "id": "7349b0fe", "metadata": {}, "source": [ "#### Get the Spanish transcripts" ] }, { "cell_type": "code", "execution_count": null, "id": "29a9b1fb", "metadata": {}, "outputs": [], "source": [ "# Review the list of transcripts to pick Spanish transcripts\n", "paginator = s3.get_paginator('list_objects_v2')\n", "pages = paginator.paginate(Bucket=bucket, Prefix=OUTPUT_PATH_TRANSCRIBE)\n", "\n", "s3_resource = boto3.resource('s3')\n", "# Now copy the Spanish transcripts to Translate Input folder\n", "for page in pages:\n", " for obj in page['Contents']:\n", " lang = ''\n", " ts_file = obj['Key'].split('/')[2]\n", " tscript = ts_file.split('-')\n", " if len(tscript) > 1:\n", " lang = tscript[2]\n", " if lang == 'ES':\n", " copy_source = {'Bucket': bucket,'Key': obj['Key']}\n", " s3_resource.meta.client.copy(copy_source, bucket, INPUT_PATH_TRANSLATE + '/' + ts_file)" ] }, { "cell_type": "markdown", "id": "d823c05d", "metadata": {}, "source": [ "#### Run translation synchronously\n", "\n", "**Note:** For the purposes of this workshop we are running this translate synchronously as we have only 2 call transcripts to be translated. For large scale translation requirements, you should use [start_text_translation_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/translate.html#Translate.Client.start_text_translation_job) and for batch custom translation processing requirements you should use the [Parallel Data File with Translate Active Custom Translation](https://docs.aws.amazon.com/translate/latest/dg/customizing-translations-parallel-data.html) " ] }, { "cell_type": "code", "execution_count": null, "id": "d221d433", "metadata": {}, "outputs": [], "source": [ "# Read the spanish transcripts from the Translate input folder in S3 bucket\n", "paginator = s3.get_paginator('list_objects_v2')\n", "pages = paginator.paginate(Bucket=bucket, Prefix=INPUT_PATH_TRANSLATE)\n", "for page in pages:\n", " for obj in page['Contents']:\n", " temp = s3_resource.Object(bucket, obj['Key'])\n", " trans_input = temp.get()['Body'].read().decode('utf-8')\n", " if len(trans_input) > 0:\n", " # Translate the Spanish transcripts\n", " trans_response = translate_client.translate_text(\n", " Text=trans_input,\n", " TerminologyNames=[term_name],\n", " SourceLanguageCode='es',\n", " TargetLanguageCode='en'\n", " )\n", " # Write the translated text to a temporary file\n", " with open('temp_translate.txt', 'w') as outfile:\n", " outfile.write(trans_response['TranslatedText'])\n", " # Upload the translated text to S3 bucket \n", " s3.upload_file('temp_translate.txt', bucket, OUTPUT_PATH_TRANSLATE + '/en-' + obj['Key'].split('/')[2])\n", " print(\"Translated text file uploaded to: \" + 's3://' + bucket + '/' + OUTPUT_PATH_TRANSLATE + '/en-' + obj['Key'].split('/')[2])\n", " " ] }, { "cell_type": "markdown", "id": "33118b19", "metadata": {}, "source": [ "### Prepare Comprehend inputs\n", "\n", "We will now collect the original English transcripts and the translated Spanish language transcripts and move them to the Comprehend input folder in our S3 bucket in preparation for next steps in the workshop." ] }, { "cell_type": "code", "execution_count": null, "id": "76c148bf", "metadata": {}, "outputs": [], "source": [ "# First copy the English transcripts\n", "paginator = s3.get_paginator('list_objects_v2')\n", "pages = paginator.paginate(Bucket=bucket, Prefix=OUTPUT_PATH_TRANSCRIBE)\n", "\n", "s3_resource = boto3.resource('s3')\n", "\n", "for page in pages:\n", " for obj in page['Contents']:\n", " ts_file1 = obj['Key'].split('/')[2]\n", " tscript = ts_file1.split('-')\n", " if len(tscript) > 1:\n", " lang = tscript[2]\n", " if lang == 'EN':\n", " copy_source = {'Bucket': bucket,'Key': obj['Key']}\n", " s3_resource.meta.client.copy(copy_source, bucket, 'comprehend/input/' + ts_file1)\n", "\n", "# Now copy the Spanish transcripts that were translated to English\n", "pages = paginator.paginate(Bucket=bucket, Prefix=OUTPUT_PATH_TRANSLATE)\n", "\n", "for page in pages:\n", " for obj in page['Contents']:\n", " ts_file2 = obj['Key'].split('/')[2]\n", " if 'txt' in ts_file2:\n", " copy_source = {'Bucket': bucket,'Key': obj['Key']}\n", " s3_resource.meta.client.copy(copy_source, bucket, 'comprehend/input/' + ts_file2) " ] }, { "cell_type": "markdown", "id": "9b4d9a2c", "metadata": {}, "source": [ "Let us review if all the text files are ready for Comprehend custom inference. We should have 7 files in total with two calls that were transcribed in Spanish and translated to English, and 5 English calls that we transcribed. " ] }, { "cell_type": "code", "execution_count": null, "id": "5d8fcbd6", "metadata": {}, "outputs": [], "source": [ "paginator = s3.get_paginator('list_objects_v2')\n", "pages = paginator.paginate(Bucket=bucket, Prefix='comprehend/input')\n", "for page in pages:\n", " for obj in page['Contents']:\n", " print(obj['Key'])" ] }, { "cell_type": "markdown", "id": "f5cfcf5f", "metadata": {}, "source": [ "## End of notebook, go back to your workshop instructions" ] } ], "metadata": { "interpreter": { "hash": "9ddb102edfbd95000dbbd260d8bbcf82701cc06b4dcf114fa04ba84aab75adcb" }, "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 5 }