# Training the Spoken Language Classifier
In this notebook, you will:
1. Create a custom container that will be used for training and deploying
2. Train the spoken language classifier by intiating a training job

### Import libraries and load AWS credentials

In [None]:
!pip install -U sagemaker

In [14]:
import sagemaker
from sagemaker.estimator import Estimator
import boto3
import os

In [None]:
role = sagemaker.get_execution_role()
bucket = sagemaker.Session().default_bucket()
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

### Create estimator using custom container

Build a custom container that will be used for training and deploying. **Make sure that your role has access to ECR (attach AmazonEC2ContainerRegistryFullAccess policy to your role)**

In [None]:
!sh build_and_push.sh

Define locations to image container, voxforge dataset in s3, and path to save models to.

In [None]:
image_uri = f'{account_id}.dkr.ecr.{region}.amazonaws.com/spoken-language-detection'

s3_voxforge_prefix = os.path.join('s3://' + bucket, 'voxforge')
s3_output_path = os.path.join('s3://' + bucket, 'models')

Create estimator and define hyperparameters (default values provided below are optimal):
* sample_rate - sample rate to load audio files
* n_samples - number of samples to truncate audio files to (n_samples / sample_rate = duration of audio observed by model)
* languages - list of languages to train model on
* feature_type - type of features to extract from audio files (melspectrogram vs. MFCC)
* normalize - whether or not to normalize the features between 0 and 1
* standardize - whether or not to standardize the features to have zero mean and unit variance
* standardize_mean - if standardize = True, subtract this value from the features
* standardize_std - if standardize = True, divide the features by this value
* random_pitch_shift - whether or not to randomly pitch shift the audio
* pitch_shift_range - if random_pitch_shift = True, defines the range in semitones to randomly shift the audio
* random_crop - if the audio file > n_samples, whether or not to randomly crop the audio by n_samples or take the first n_samples of the audio
* frequency_masking - whether or not to randomly set a band of frequencies to zero (data augmentation)
* time_masking - whether or not to randomly set a band of time steps to zero (data augmentation)
* source_weighted_normalization - whether or not to randomly sample the training dataset such that each speaker is seen by the model equally, as well as each language. If False, the training dataset will only be randomly sampled such that each language is seen equally
* source_normalized_evaluation - whether or not to evaluate the model based on metrics on a per speaker basis. This prevents the evaluation from being biased towards speakers who may have many more audio files than others
* epochs - number of epochs to train
* batch_size - size of each training batch
* lr - initial learning rate
* hidden_dim - hidden dimension before final dense layer
* logging_iters - interval of training iterations to display training stats
* early_stopping_epochs - training will stop if model does not improve after this many epochs
* s3_prefix - s3 path of the VoxForge dataset

In [None]:
estimator = Estimator(
 image_uri=image_uri,
 role=role,
 instance_count=1,
 instance_type='ml.m5.xlarge',
 output_path=s3_output_path,
 max_run=5*24*60*60,
 hyperparameters={
 'sample_rate' : 16000,
 'n_samples' : 80000,
 'languages' : ['en', 'es', 'it', 'fr', 'de', 'ru'],
 'feature_type' : 'mel',
 'normalize' : True,
 'standardize' : True,
 'standardize_mean' : 0.4630,
 'standardize_std' : 0.2031,
 'random_pitch_shift' : True,
 'pitch_shift_range' : (-5, 5),
 'random_crop' : True,
 'frequency_masking' : True,
 'time_masking' : True,
 'source_weighted_normalization' : True,
 'source_normalized_evaluation' : True,
 'epochs' : 50,
 'batch_size' : 32,
 'lr' : 0.0001,
 'hidden_dim' : 512,
 'logging_iters' : 100,
 'early_stopping_epochs' : 5,
 
 's3_prefix' : s3_voxforge_prefix
 }
)

### Train model
Starts the training job. If random_pitch_shift is set to True, training can take a few days since the current python librosa implementation of pitch_shift is very slow. However, pitch shifting will lead to the best model performance so it is not recommended to set to False.

Once the training job is started, this notebook can be shutdown.

In [None]:
estimator.fit({
 'train' : os.path.join(s3_voxforge_prefix, 'train_manifest.csv'),
 'validation' : os.path.join(s3_voxforge_prefix, 'val_manifest.csv')
})