# Multilingual Transcription Pipeline
In this notebook, you will:
1. Deploy the trained model to a sagemaker endpoint
2. Create transcription pipeline that feeds model predictions into Amazon Transcribe

### Import libraries and load AWS credentials

In [None]:
!pip install -U sagemaker

In [None]:
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker.serializers import JSONSerializer
from sagemaker.predictor import Predictor
import json
import tarfile
import os
import pandas as pd
import boto3
import time

In [None]:
role = sagemaker.get_execution_role()
sess = sagemaker.session.Session()
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

bucket = sess.default_bucket()

Insert your training job ID here

In [None]:
training_job_id = ''

### Deploy model

In [None]:
image_uri = f'{account_id}.dkr.ecr.{region}.amazonaws.com/spoken-language-detection'
model_path = f's3://{bucket}/models/{training_job_id}/output/model.tar.gz'

In [None]:
model = sagemaker.Model(
 image_uri=image_uri,
 model_data=model_path,
 role=role
)

model.deploy(1, 'ml.m4.xlarge')

In [None]:
predictor = Predictor(model.endpoint_name, serializer=JSONSerializer())

### Detect language and run transcription jobs

Download sample audio clips from [Audio Lingua](audio-lingua.eu).

In [None]:
!mkdir samples
!wget -O samples/french-sample.wav https://audio-lingua.eu/spip.php?article7143
!wget -O samples/english-sample.wav https://audio-lingua.eu/spip.php?article6968
!wget -O samples/russian-sample.wav https://audio-lingua.eu/spip.php?article7109
!wget -O samples/spanish-sample.wav https://audio-lingua.eu/spip.php?article7103
!wget -O samples/italian-sample.wav https://audio-lingua.eu/spip.php?article7139

Show files to be transcribed from "samples" folder. Each file is a different language.

In [None]:
files = [os.path.join('samples', f) for f in os.listdir('samples')]
files

Loop through the files, upload each to s3, predict the language, and pass the predicted language and audio file to Transcribe. Go to Transcribe in the AWS console to see the transcription jobs. **Make sure that your role has access to Transcribe (attach AmazonTranscribeFullAccess policy to your role)**

In [None]:
lang_code_dict = {
 'en' : 'en-US',
 'es' : 'es-ES',
 'it' : 'it-IT',
 'fr' : 'fr-FR',
 'ru' : 'ru-RU',
 'de' : 'de-DE'
}

transcribe = boto3.client('transcribe')

for f in files:
 s3_path = sess.upload_data(f, key_prefix='samples')
 
 pred = predictor.predict([s3_path])
 pred = json.loads(pred)[0]

 print('Detected language : {}'.format(pred))
 job_name = f.split('/')[-1].split('.wav')[0]

 transcribe.start_transcription_job(
 TranscriptionJobName=job_name,
 Media={'MediaFileUri': s3_path},
 MediaFormat='wav',
 LanguageCode=lang_code_dict[pred]
 )

Be sure to delete the endpoint after evaluation

In [None]:
predictor.delete_endpoint()