# Fine-tune and deploy Wav2Vec2 model for speech recognition with Hugging Face and SageMaker

## Background

Wav2Vec2 is a transformer-based architecture for ASR tasks and was released in September 2020. We show its simplified architecture diagram below. For more details, see the [original paper](https://arxiv.org/abs/2006.11477). The model is composed of a multi-layer convolutional network (CNN) as feature extractor, which takes input audio signal and outputs audio representations, also considered as features. They are fed into a transformer network to generate contextualized representations. This part of training can be self-supervised, it means that the transformer can be trained with a mass of unlabeled speech and learn from them. Then the model is fine-tuned on labeled data with Connectionist Temporal Classification (CTC) algorithm for specific ASR tasks. The base model we use in this post is [Wav2Vec2-Base-960h](https://huggingface.co/facebook/wav2vec2-base-960h), it is fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. 
<img src="images/wav2vec2.png">

Connectionist Temporal Classification (CTC) is character-based algorithm. During the training, it’s able to demarcate each character of the transcription in the speech automatically, so the timeframe alignment is not required between audio signal and transcription. For example, one audio clip says “Hello Worldâ€, we don’t need to know in which second word “hello†is located. It saves a lot of labeling effort for ASR use cases. If you are interested in how the algorithm works underneath, see [this article](https://distill.pub/2017/ctc/) for more information. 


## Notebook Overview 

In this notebook, we use [SUPERB 
(Speech processing Universal PERformance Benchmark) dataset](https://huggingface.co/datasets/superb) that available from Hugging Face Datasets library, and fine-tune the Wav2Vec2 model and deploy it as SageMaker endpoint for real-time inference for an ASR task. 
<img src="images/solution_overview.png">

First of all, we show how to load and preprocess the SUPERB dataset in SageMaker environment in order to obtain tokenizer and feature extractor, which are required for fine-tuning the Wav2Vec2 model. Then we use SageMaker Script Mode for training and inference steps, that allows you to define and use custom training and inference scripts and SageMaker provides supported Hugging Face framework Docker containers. For more information about training and serving Hugging Face models on SageMaker, see Use [Hugging Face with Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/hugging-face.html). This functionality is available through the development of Hugging Face [AWS Deep Learning Container (DLC)](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/what-is-dlc.html). 

This notebook is tested in both SageMaker Studio and SageMaker Notebook environments. Below shows detailed setup. 
- SageMaker Studio: **ml.m5.xlarge** instance with **Data Science** kernel.
- SageMaker Notebook: **ml.m5.xlarge** instance with **conda_python3** kernel. 


## Set up 
First, install the dependencies.

In [None]:
!pip install sagemaker --upgrade
!pip install "transformers>=4.4.2" 
!pip install s3fs --upgrade
!pip install datasets --upgrade 
!pip install librosa
!pip install torch # framework is required for transformer 

**soundfile** library will be used to read raw audio files and convert them into arrays. Before installing **soundfile** python library, package **libsndfile** needs to be installed. 

In [None]:
!conda install -c conda-forge libsndfile -y
!pip install soundfile

Following let's import common python libraries. Create a S3 bucket in AWS console for this project, and replace **[BUCKET_NAME]** with your bucket. 
Get the execution role which allows training and servering jobs to access your data. 

In [None]:
import json
import time
import boto3
import numpy as np
import random
import soundfile 
import sagemaker
import sagemaker.huggingface

BUCKET="[BUCKET_NAME]" # please use your bucket name
PREFIX = "huggingface-blog" 
ROLE = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=BUCKET)

print(f"sagemaker role arn: {ROLE}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

## Data Pre-processing
We are using SUPERB dataset for this notebook, which can be loaded from Hugging Face [dataset library](https://huggingface.co/datasets/superb) directly using `load_dataset` function. SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. It also includes speaker_id and chapter_id etc., these columns are removed from the dataset, and we only keep audio files and transcriptions to fine-tune the Wav2Vec2 model for an audio recognition task, which transcribes speech to text. 

In [None]:
from datasets import load_dataset, DatasetDict
data = load_dataset("superb", 'asr', ignore_verifications=True) 
data = data.remove_columns(['speaker_id', 'chapter_id', 'id'])
# reduce the data volume for this example. only take the test data from the original dataset for fine-tune
data = data['test'] 

train_test = data.train_test_split(test_size=0.2)
dataset = DatasetDict({
 'train': train_test['train'],
 'test': train_test['test']})

# helper function to remove special characters and convert texts to lower case
def remove_special_characters(batch):
 import re
 chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'
 
 batch["text"] = re.sub(chars_to_ignore_regex, '', batch["text"]).lower()
 return batch

dataset = dataset.map(remove_special_characters)
print(dataset)

In [None]:
dataset['train'][0]

### Build vocabulary file 
Wav2Vec2 model is using [CTC](https://en.wikipedia.org/wiki/Connectionist_temporal_classification) algorithm to train deep neural networks in sequence problems, and its output is a single letter or blank. It uses a character-based tokenizer. Hence, we extract distinct letters from the dataset and build the vocabulary file. 

In [None]:
def extract_characters(batch):
 texts = " ".join(batch["text"])
 vocab = list(set(texts))
 return {"vocab": [vocab], "texts": [texts]}

vocabs = dataset.map(extract_characters, batched=True, batch_size=-1, 
 keep_in_memory=True, remove_columns=dataset.column_names["train"])

vocab_list = list(set(vocabs["train"]["vocab"][0]) | set(vocabs["test"]["vocab"][0]))

vocab_dict = {v: k for k, v in enumerate(vocab_list)}

vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

vocab_dict["[UNK]"] = len(vocab_dict) # add "unknown" token 
vocab_dict["[PAD]"] = len(vocab_dict) # add a padding token that corresponds to CTC's "blank token"

with open('vocab.json', 'w') as vocab_file:
 json.dump(vocab_dict, vocab_file)
 
# vocab.json file will be used in training container, hence upload it to s3 bucket for later steps 
s3 = boto3.client('s3')
s3.upload_file('vocab.json', BUCKET, f'{PREFIX}/vocab.json')

### Create tokenizer with vocabulary file and feature extractor 
Wav2Vec2 model contains tokenizer and feature extractor. We can use vocab.json that created from previous step to create the Wav2Vec2CTCTokenizer. Wav2Vec2FeatureExtractor is to make sure that the dataset used in fine-tune has the same audio sampling rate as the dataset used for pretraining. Finally, create a Wav2Vec2 processor can wrap the feature extractor and the tokenizer into one single processor.


In [None]:
from transformers import Wav2Vec2CTCTokenizer,Wav2Vec2FeatureExtractor, Wav2Vec2Processor

# create Wav2Vec2 tokenizer
tokenizer = Wav2Vec2CTCTokenizer("vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

# create Wav2Vec2 feature extractor
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, 
 padding_value=0.0, do_normalize=True, return_attention_mask=False)
# create a processor pipeline 
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

### Prepare train and test datasets

In [None]:
def extract_array_samplingrate(batch):
 batch["speech"] = batch['audio']['array'].tolist()
 batch["sampling_rate"] = batch['audio']['sampling_rate']
 batch["target_text"] = batch["text"]
 return batch

dataset = dataset.map(extract_array_samplingrate, remove_columns=dataset.column_names["train"])

In [None]:
dataset

In [None]:
# check one audio file from the training dataset
import IPython.display as ipd

rand_int = random.randint(0, len(dataset["train"]))
print(dataset["train"][rand_int]["target_text"])
ipd.Audio(data=np.asarray(dataset["train"][rand_int]["speech"]), autoplay=True, rate=16000)

In [None]:
# process the dataset with processor pipeline that created above
def process_dataset(batch): 
 batch["input_values"] = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0]).input_values

 with processor.as_target_processor():
 batch["labels"] = processor(batch["target_text"]).input_ids
 return batch

data_processed = dataset.map(process_dataset, remove_columns=dataset.column_names["train"], batch_size=8, batched=True)

train_dataset = data_processed['train']
test_dataset = data_processed['test']

Next we upload train and test data to S3. 

In [None]:
from datasets.filesystems import S3FileSystem
s3 = S3FileSystem()

# save train_dataset to s3
training_input_path = f's3://{BUCKET}/{PREFIX}/train'
train_dataset.save_to_disk(training_input_path,fs=s3)

# save test_dataset to s3
test_input_path = f's3://{BUCKET}/{PREFIX}/test'
test_dataset.save_to_disk(test_input_path,fs=s3)

## Fine-tune the HuggingFace model (Wav2Vec2)

### Training script

Here we are using SageMaker HuggingFace DLC (Deep Learning Container) script mode to construct the training and inference job, which allows you to write custom trianing and serving code and using HuggingFace framework containers that maintained and supported by AWS. 

When we create a training job using the script mode, the `entry_point` script, hyperparameters, its dependencies (inside requirements.txt) and input data (train and test datasets) will be copied into the container. Then it invokes the `entry_point` training script, where the train and test datasets will be loaded, training steps will be executed and model artifacts will be saved in `/opt/ml/model` in the container. After training, artifacts in this directory are uploaded to S3 for later model hosting.

This script is saved in directory `scripts`, and you can inspect the training script by running the next cell. 

In [None]:
!pygmentize scripts/train.py

### Creating an Estimator and start a training job

Worth to highlight that, when you create a Hugging Face Estimator, you can configure hyperparameters and provide a custom parameter into the training script, such as `vocab_url` in this example. Also you can specify the metrics in the Estimator, and parse the logs of metrics and send them to CloudWatch to monitor and track the training performance. 

In [None]:
from sagemaker.huggingface import HuggingFace

#create an unique id to tag training job, model name and endpoint name. 
id = int(time.time())

TRAINING_JOB_NAME = f"huggingface-wav2vec2-training-{id}"
print('Training job name: ', TRAINING_JOB_NAME)

vocab_url = f"s3://{BUCKET}/{PREFIX}/vocab.json"
hyperparameters = {'epochs':10, # you can increase the epoch number to improve model accuracy
 'train_batch_size': 8,
 'model_name': "facebook/wav2vec2-base",
 'vocab_url': vocab_url
 }

# define metrics definitions
metric_definitions=[
 {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
 {'Name': 'eval_wer', 'Regex': "'eval_wer': ([0-9]+(.|e\-)[0-9]+),?"},
 {'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?"},
 {'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?"},
 {'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}]

We use the [HuggingFace estimator class](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html) to train our model. When creating the estimator, the following parameters need to specify. 

* **entry_point**: the name of the training script. It loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model. 
* **source_dir**: the location of the training scripts. 
* **transformers_version**: the Hugging Face transformers library version we want to use.
* **pytorch_version**: the pytorch version that compatible with transformers library. 

**Instance Selection**: For this use case and dataset, we use one ml.p3.2xlarge instance and the training job is able to finish within two hours. You can select a more powerful instance to reduce the training time, however it will generate more cost. 

In [None]:
OUTPUT_PATH= f's3://{BUCKET}/{PREFIX}/{TRAINING_JOB_NAME}/output/'

huggingface_estimator = HuggingFace(entry_point='train.py',
 source_dir='./scripts',
 output_path= OUTPUT_PATH, 
 instance_type='ml.p3.2xlarge',
 instance_count=1,
 transformers_version='4.6.1',
 pytorch_version='1.7.1',
 py_version='py36',
 role=ROLE,
 hyperparameters = hyperparameters,
 metric_definitions = metric_definitions,
 )

#Starts the training job using the fit function, training takes approximately 2 hours to complete.
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path},
 job_name=TRAINING_JOB_NAME)

From the training logs you can see that, after 10 epochs of training, and model evaluation metrics wer can achieve around 0.32 for the subset of SUPERB dataset. You can increase the number of epochs or use the full dataset to improve the model further. 

## Deploy the model as endpoint on SageMaker and inference the model

### Inference script

We are using [SageMaker HuggingFace inference tool kit](https://github.com/aws/sagemaker-huggingface-inference-toolkit) to host our fine-tuned model. It provides default functions of preprocessing, predict and postprocessing for certain tasks. However, the default capabilities are not able to inference our model properly. Hence, we defined below functions in `inference.py` script to override the default settings with custom requirements.

* `model_fn(model_dir)`: overrides the default method for loading the model, the return value model will be used in the predict() for predicitions. It receives argument the model_dir, the path to your unzipped model.tar.gz.
* `input_fn(input_data, content_type)`: overrides the default method for prerprocessing, the return value data will be used in the predict() method for predicitions. The input is input_data, the raw body of your request and content_type, the content type form the request Header.
* `predict_fn(processed_data, model)`: overrides the default method for predictions, the return value predictions will be used in the postprocess() method. The input is processed_data, the result of the preprocess() method.
* `output_fn(prediction, accept)`: overrides the default method for postprocessing, the return value result will be the respond of your request(e.g.JSON). The inputs are predictions, the result of the predict() method and accept the return accept type from the HTTP Request, e.g. application/json

**Note**: Inference tool kit can inference tasks from architectures that ending with: 'TapasForQuestionAnswering', 'ForQuestionAnswering', 'ForTokenClassification', 'ForSequenceClassification', 'ForMultipleChoice', 'ForMaskedLM', 'ForCausalLM', 'ForConditionalGeneration', 'MTModel', 'EncoderDecoderModel', 'GPT2LMHeadModel', 'T5WithLMHeadModel' as of Jan2022. 

This script is saved in directory `scripts`, you can inspect the inference script by running the next cell. 

In [None]:
!pygmentize scripts/inference.py

### Create a HuggingFaceModel from the estimator 

We use the [HuggingFaceModel class](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-model) to create a model object, which can be deployed to a SageMaker endpoint. When creating the model, the following parameters need to specify. 

* **entry_point**: the name of the inference script. The methods defined in the inference script will be implemented to the endpoint. 
* **source_dir**: the location of the inference scripts. 
* **transformers_version**: the Hugging Face transformers library version we want to use. It should be consistent with training step. 
* **pytorch_version**: the pytorch version that compatible with transformers library. It should be consistent with training step.
* **model_data**: the Amazon S3 location of a SageMaker model data `.tar.gz` file


In [None]:
from sagemaker.huggingface import HuggingFaceModel

huggingface_model = HuggingFaceModel(
 entry_point = 'inference.py',
 source_dir='./scripts',
 name = f'huggingface-wav2vec2-model-{id}',
 transformers_version='4.6.1', 
 pytorch_version='1.7.1', 
 py_version='py36',
 model_data=huggingface_estimator.model_data,
 role=ROLE,
 )

### Deploy the model on an endpoint 

Next, we create a predictor by using the `model.deploy` function. You can change the instance count and instance type based on your performance requirements. 

In [None]:
predictor = huggingface_model.deploy(
 initial_instance_count=1,
 instance_type="ml.g4dn.xlarge", 
 endpoint_name = f'huggingface-wav2vec2-endpoint-{id}'
)

### Inference audio files 

After the endpoint is deployed, you can run below prediction tests to check the model performance. 

In [None]:
# inference audio file that download from S3 bucket or inference local audio file 
import soundfile

# s3.download_file(BUCKET, 'huggingface-blog/sample_audio/xxxxxx.wav', 'downloaded.wav')
# file_name ='downloaded.wav'

# download a sample audio file by using below link
!wget https://datashare.ed.ac.uk/bitstream/handle/10283/343/MKH800_19_0001.wav
 
file_name ='MKH800_19_0001.wav'

speech_array, sampling_rate = soundfile.read(file_name)

ipd.Audio(data=np.asarray(speech_array), autoplay=True, rate=16000)

In [None]:
%%time
json_request_data = {"speech_array": speech_array.tolist(),
 "sampling_rate": sampling_rate}

prediction = predictor.predict(json_request_data)
print(prediction)

**Please note**, as we are using real-time inference endpoint, the maximum payload size is 6MB. If you see any error message like "Received client error (413) from primary and could not load the entire response body", please use blow code to check your payload size. 

In [None]:
import sys
sys.getsizeof(speech_array) 

## Cleanup

Finally, please remember to delete the Amazon SageMaker endpoint to avoid charges:

In [None]:
predictor.delete_endpoint()