# Data Preparation
In this notebook, you will:
1. Download the [VoxForge](http://www.voxforge.org/home/downloads) dataset locally
2. Extract metadata about the dataset
3. Create the train, validation, and test splits
4. Upload everything to s3

### Install dependencies

Install torchaudio, which is a pytorch library with tools for working with audio data

In [None]:
!pip install torchaudio -f https://download.pytorch.org/whl/torch_stable.html

Torchaudio uses libsndfile as the backend so we will need to ensure that this is installed (this does not come by default on sagemaker notebook instances).

In [None]:
%%bash

cd ../
wget http://www.mega-nerd.com/libsndfile/files/libsndfile-1.0.28.tar.gz
tar -xzf libsndfile-1.0.28.tar.gz
cd libsndfile-1.0.28
./configure --prefix=/usr --disable-static --docdir=/usr/share/doc/libsndfile-1.0.28
sudo make install
cd ../
rm libsndfile-1.0.28.tar.gz

Import libraries and make sure that torchaudio imports correctly

In [None]:
import os
import pandas as pd
import tarfile
from collections import Counter
from sklearn.model_selection import train_test_split

import torch
import torchaudio

### Download dataset

Define a location to download VoxForge dataset

In [None]:
voxforge_dir = 'voxforge'

Download dataset. This will take some time. Downloading and building the dataset locally requires 100GB (only 60GB is used permanently) so attaching an EBS drive may be required if using a notebook instance.

In [None]:
%%bash -s "$voxforge_dir"

mkdir $1
cd $1

# download a text file that contains the URLs to all of the audio data
wget https://storage.googleapis.com/tfds-data/downloads/voxforge/voxforge_urls.txt

# download each URL
wget -i voxforge_urls.txt -x

### Extract data and collect metadata

The dataset is downloaded as .tgz files so we need to extract them. This will result in a mixture of .wav and .flac audio files.

In [None]:
downloads_dir = os.path.join(voxforge_dir, 'www.repository.voxforge1.org/downloads')

ct = 0
files = {}
for lang in os.listdir(downloads_dir):
 files[lang] = []
 tar_path = os.path.join(os.path.join(downloads_dir, lang), 'Trunk/Audio/Main/16kHz_16bit')
 tar_files = [x for x in os.listdir(tar_path) if '.tgz' in x]

 for tar_name in tar_files:
 tar = tarfile.open(os.path.join(tar_path, tar_name))
 tar.extractall(tar_path)
 tar.close()
 audio_dir = os.path.join(os.path.join(tar_path, tar_name.split('.tgz')[0]), 'wav')
 if not os.path.exists(audio_dir):
 audio_dir = os.path.join(os.path.join(tar_path, tar_name.split('.tgz')[0]), 'flac')

 extracted_files = [
 os.path.relpath(os.path.join(audio_dir, f), voxforge_dir) for f in os.listdir(audio_dir)]
 files[lang] += extracted_files
 
 ct += len(extracted_files)
 print('Extracted audio files: {}'.format(ct), flush=True, end='\r')

The files are embedded in a complex directory tree and so to simplify, let's collect some metadata on the files, such as duration and file path location. Also, some of the files are broken (all zeros or NaN values) so we should identify these files before doing any preprocessing.

In [None]:
ct = 0
metadata = []
for lang in files.keys():
 for f in files[lang]:
 x, sr = torchaudio.load(os.path.join(voxforge_dir, f))
 t = (x.shape[-1] / sr)
 is_nan = torch.isnan(x).any().item()
 is_zero = (x.sum() == 0.0).item()
 metadata.append({
 'fname' : f,
 'class' : lang,
 'time' : t,
 'is_nan' : is_nan,
 'is_zero' : is_zero
 })
 ct += 1
 print('Files checked : {}'.format(ct), flush=True, end='\r')

Save this metadata into a single CSV

In [None]:
metadata = pd.DataFrame(metadata, columns=['fname', 'class', 'time', 'is_zero', 'is_nan'])
metadata.to_csv(os.path.join(voxforge_dir, 'voxforge_metadata.csv'), index=False)

### Train, val, and test split

We will now split the metadata into train, validation, and test splits and also filter out any broken audio files or audio files that are too short.

In [None]:
test_split = 0.2
val_split = 0.1
min_seconds = 2.0

In [None]:
metadata = pd.read_csv(os.path.join(voxforge_dir, 'voxforge_metadata.csv'))
metadata = metadata[(metadata.time > min_seconds) & (~metadata.is_zero) & (~metadata.is_nan)]\
 .reset_index(drop=True)

The audio data was recorded by a number of speakers where each speaker may have recorded multiple files. Some speakers only recorded one audio file whereas others have recorded thousands. This is imbalance in files per speaker may lead the model to learning biases towards certain speakers. Let's identify these speakers now by creating a "source" column.

In [None]:
metadata['source'] = metadata['fname']\
 .apply(lambda x : '/'.join(x.split('/')[:-3]) + '/' + x.split('/')[-3].split('-')[0])

When splitting the dataset into our train, validation, and test splits, we have to make sure that the same speakers do not occur across the dataset splits, otherwise our evaluation may be biased towards these speakers. Thus, we will perform a train-validation-test split based on the speaker source.

In [None]:
def train_val_test_split_by_source(df, val_split=0.1, test_split=0.2):
 """ Splits the dataset by speaker source such that the same speaker doesn't occur accross data splits """
 
 train_df, val_df, test_df = [], [], []
 
 # loop through each language and create splits such that each split has equal proportion of languages
 for lang in df['class'].unique():
 temp = df[df['class'] == lang]
 
 # get list of unique speakers
 sources = temp['source'].unique()

 # create train, val, test splits on speakers
 train_sources, test_sources = train_test_split(
 sources, test_size=test_split, random_state=2
 )
 train_sources, val_sources = train_test_split(
 train_sources, test_size=val_split/(1 - test_split), random_state=2
 )

 train_df.append(temp[temp['source'].isin(train_sources)].reset_index(drop=True))
 val_df.append(temp[temp['source'].isin(val_sources)].reset_index(drop=True))
 test_df.append(temp[temp['source'].isin(test_sources)].reset_index(drop=True))

 train_df = pd.concat(train_df, ignore_index=True)[['fname', 'source', 'class', 'time']]\
 .sample(frac=1, random_state=0).reset_index(drop=True)
 val_df = pd.concat(val_df, ignore_index=True)[['fname', 'source', 'class', 'time']]\
 .sample(frac=1, random_state=0).reset_index(drop=True)
 test_df = pd.concat(test_df, ignore_index=True)[['fname', 'source', 'class', 'time']]\
 .sample(frac=1, random_state=0).reset_index(drop=True)
 return train_df, val_df, test_df

def describe(df):
 """ Function to list basic statistics about the data """
 print(f'# rows : {len(df)}')

 largest_source_df = df[['class', 'source']]\
 .groupby('class')\
 .agg({'source' : lambda x : max(Counter(x).values())})\
 .reset_index(drop=False)\
 .rename(columns={'source' : 'largest_source'})
 
 n_files_df = pd.DataFrame(list(df['class'].value_counts().items()), columns=['class', 'n_files'])
 
 n_sources_df = df.groupby('class')\
 .agg({'source' : lambda x : len(set(x))})\
 .reset_index(drop=False)\
 .rename(columns={'source' : 'n_sources'})
 
 stats = pd.merge(
 pd.merge(n_files_df, n_sources_df, how='inner', on='class'), 
 largest_source_df, how='inner', on='class')
 
 stats = stats.sort_values('class')
 
 print(stats.to_string(index=False))

In [None]:
describe(metadata)

In [None]:
train_df, val_df, test_df = train_val_test_split_by_source(metadata, val_split=val_split, test_split=test_split)

In [None]:
describe(train_df)
describe(val_df)
describe(test_df)

Save the metadata for the train, validation, and test splits to CSV. These manifests only contain the metadata but will be used to load the audio files during training.

In [None]:
train_df.to_csv(os.path.join(voxforge_dir, 'train_manifest.csv'), index=False)
val_df.to_csv(os.path.join(voxforge_dir, 'val_manifest.csv'), index=False)
test_df.to_csv(os.path.join(voxforge_dir, 'test_manifest.csv'), index=False)

### Upload data to S3

Upload all data (metadata + audio files) to the default s3 bucket.

In [None]:
import sagemaker

In [None]:
sess = sagemaker.Session() 
bucket_name = sess.default_bucket() 
print(f"Bucket name : {bucket_name}")

In [None]:
sess.upload_data(voxforge_dir, key_prefix=voxforge_dir) 