## Convert video to text with Speech-to-text model and sentence embedding model

In this notebook, we will extract information from video/audio files with [Whipser model](https://github.com/openai/whisper). Be leveraging multilingual support, we can extract tanscripts from videos files mixed different languages, even for one video file with different languanges. We provide the following options for whisper inference:
- Batch inference with SageMaker Processing job, we can process massive data and store them into vector database for RAG solution.
- Real-time inference with SageMaker Endpoint, we can leverage it to do summarizaton or QA with a short video/audio file (less than 6MB).

In [None]:
!pip install -U sagemaker -q

## Set up

In [None]:
from sagemaker.huggingface import HuggingFaceProcessor
from sagemaker import get_execution_role
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.huggingface import HuggingFaceModel
import sagemaker
import boto3
import json

try:
 role = sagemaker.get_execution_role()
except ValueError:
 iam = boto3.client('iam')
 role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.session.Session()
bucket = sess.default_bucket()
prefix = "sagemaker/rag_video"
folder_name = "genai_workshop"
s3_input = f"s3://{bucket}/{prefix}/raw_data/{folder_name}" # Directory for video files
s3_output_clips = f"s3://{bucket}/{prefix}/clips" # Directory for video clips
s3_output_transcript = f"s3://{bucket}/{prefix}/transcript" # Directory for transcripts

In [None]:
%store s3_output_transcript

## Upload test data to S3 bucket

Download data from YouTube.

In [None]:
# Download data from YouTube
!pip install pytube

In [None]:
from pytube import YouTube

VIDEO_SAVE_DIRECTORY = "./videos"
AUDIO_SAVE_DIRECTORY = "./audio"

def download(video_url):
 video = YouTube(video_url)
 video = video.streams.get_highest_resolution()

 try:
 video.download(VIDEO_SAVE_DIRECTORY)
 except:
 print("Failed to download video")

 print("video was downloaded successfully")
 
def download_audio(video_url):
 video = YouTube(video_url)
 audio = video.streams.filter(only_audio = True).first()

 try:
 audio.download(AUDIO_SAVE_DIRECTORY)
 except:
 print("Failed to download audio")

 print("audio was downloaded successfully")

In [None]:
# JAWS-UG AI/ML (Japanese) #16 Generative AI: https://www.youtube.com/watch?v=PkZenNAXtYs
# New York Summit 2023 AIML: https://www.youtube.com/watch?v=1PkABWCJINM Totally 36mins

In [None]:
download("https://www.youtube.com/watch?v=dBzCGcwYCJo")

In [None]:
!aws s3 cp videos/genai_interview.mp4 {s3_input}/

## Batch inference with SageMaker Processing

In [None]:
hfp = HuggingFaceProcessor(
 role=get_execution_role(), 
 instance_count=1,
 instance_type='ml.p3.2xlarge',
 transformers_version='4.28.1',
 pytorch_version='2.0.0', 
 base_job_name='frameworkprocessor-hf',
 py_version="py310"
)

In [None]:
hfp.run(
 code='preprocessing.py',
 source_dir="data_preparation",
 inputs=[
 ProcessingInput(source=s3_input, destination="/opt/ml/processing/input")
 ], 
 outputs=[
 ProcessingOutput(source='/opt/ml/processing/output_clips', destination=s3_output_clips),
 ProcessingOutput(source='/opt/ml/processing/transcripts', destination=s3_output_transcript),
 ],
 arguments=[
 "--whisper-model", "whisper-large-v2",
 "--target-language", "en",
 "--sentence-embedding-model", "all-mpnet-base-v2",
 "--order", "5"
 ]
)

## Deploy Whipser model to SageMaker for real-time inference

In [None]:
endpoint_name="whisper-large-v2"
# Hub Model configuration. https://huggingface.co/models
hub = {
 'HF_MODEL_ID':'openai/whisper-large-v2',
 'HF_TASK':'automatic-speech-recognition',
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
 transformers_version='4.26.0',
 pytorch_version='1.13.1',
 py_version='py39',
 
 env=hub,
 role=role
)

In [None]:
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
 endpoint_name=endpoint_name,
 initial_instance_count=1, # number of instances
 instance_type='ml.g5.xlarge' # ec2 instance type
)

In [None]:
client = boto3.client('runtime.sagemaker')
file = "test_raw_data/test.webm"
with open(file, "rb") as f:
 data = f.read()

In [None]:
response = client.invoke_endpoint(EndpointName=endpoint_name, ContentType='audio/x-audio', Body=data)
output = json.loads(response['Body'].read())
print(f"Extracted text from the audio file:\n {output['text']}")

You can follow section for `Example - Build a multi-functional chatbot with Amazon SageMaker` in [REAMDE](./README.md) to build a multi-functional chatbot with whipser endpoint.
Please delete endpoint once you don't it.

In [None]:
predictor.delete_endpoint()