{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Preparation\n",
    "In this notebook, you will:\n",
    "1. Download the [VoxForge](http://www.voxforge.org/home/downloads) dataset locally\n",
    "2. Extract metadata about the dataset\n",
    "3. Create the train, validation, and test splits\n",
    "4. Upload everything to s3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Install torchaudio, which is a pytorch library with tools for working with audio data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install torchaudio -f https://download.pytorch.org/whl/torch_stable.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Torchaudio uses libsndfile as the backend so we will need to ensure that this is installed (this does not come by default on sagemaker notebook instances)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd ../\n",
    "wget http://www.mega-nerd.com/libsndfile/files/libsndfile-1.0.28.tar.gz\n",
    "tar -xzf libsndfile-1.0.28.tar.gz\n",
    "cd libsndfile-1.0.28\n",
    "./configure --prefix=/usr --disable-static --docdir=/usr/share/doc/libsndfile-1.0.28\n",
    "sudo make install\n",
    "cd ../\n",
    "rm libsndfile-1.0.28.tar.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import libraries and make sure that torchaudio imports correctly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "import tarfile\n",
    "from collections import Counter\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "import torch\n",
    "import torchaudio"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a location to download VoxForge dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "voxforge_dir = 'voxforge'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download dataset.  This will take some time.  Downloading and building the dataset locally requires 100GB (only 60GB is used permanently) so attaching an EBS drive may be required if using a notebook instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash -s \"$voxforge_dir\"\n",
    "\n",
    "mkdir $1\n",
    "cd $1\n",
    "\n",
    "# download a text file that contains the URLs to all of the audio data\n",
    "wget https://storage.googleapis.com/tfds-data/downloads/voxforge/voxforge_urls.txt\n",
    "\n",
    "# download each URL\n",
    "wget -i voxforge_urls.txt -x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract data and collect metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dataset is downloaded as .tgz files so we need to extract them.  This will result in a mixture of .wav and .flac audio files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "downloads_dir = os.path.join(voxforge_dir, 'www.repository.voxforge1.org/downloads')\n",
    "\n",
    "ct = 0\n",
    "files = {}\n",
    "for lang in os.listdir(downloads_dir):\n",
    "    files[lang] = []\n",
    "    tar_path = os.path.join(os.path.join(downloads_dir, lang), 'Trunk/Audio/Main/16kHz_16bit')\n",
    "    tar_files = [x for x in os.listdir(tar_path) if '.tgz' in x]\n",
    "\n",
    "    for tar_name in tar_files:\n",
    "        tar = tarfile.open(os.path.join(tar_path, tar_name))\n",
    "        tar.extractall(tar_path)\n",
    "        tar.close()\n",
    "        audio_dir = os.path.join(os.path.join(tar_path, tar_name.split('.tgz')[0]), 'wav')\n",
    "        if not os.path.exists(audio_dir):\n",
    "            audio_dir = os.path.join(os.path.join(tar_path, tar_name.split('.tgz')[0]), 'flac')\n",
    "\n",
    "        extracted_files = [\n",
    "            os.path.relpath(os.path.join(audio_dir, f), voxforge_dir) for f in os.listdir(audio_dir)]\n",
    "        files[lang] += extracted_files\n",
    "        \n",
    "        ct += len(extracted_files)\n",
    "        print('Extracted audio files: {}'.format(ct), flush=True, end='\\r')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The files are embedded in a complex directory tree and so to simplify, let's collect some metadata on the files, such as duration and file path location.  Also, some of the files are broken (all zeros or NaN values) so we should identify these files before doing any preprocessing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ct = 0\n",
    "metadata = []\n",
    "for lang in files.keys():\n",
    "    for f in files[lang]:\n",
    "        x, sr = torchaudio.load(os.path.join(voxforge_dir, f))\n",
    "        t = (x.shape[-1] / sr)\n",
    "        is_nan = torch.isnan(x).any().item()\n",
    "        is_zero = (x.sum() == 0.0).item()\n",
    "        metadata.append({\n",
    "            'fname' : f,\n",
    "            'class' : lang,\n",
    "            'time' : t,\n",
    "            'is_nan' : is_nan,\n",
    "            'is_zero' : is_zero\n",
    "        })\n",
    "        ct += 1\n",
    "        print('Files checked : {}'.format(ct), flush=True, end='\\r')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save this metadata into a single CSV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metadata = pd.DataFrame(metadata, columns=['fname', 'class', 'time', 'is_zero', 'is_nan'])\n",
    "metadata.to_csv(os.path.join(voxforge_dir, 'voxforge_metadata.csv'), index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train, val, and test split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now split the metadata into train, validation, and test splits and also filter out any broken audio files or audio files that are too short."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_split = 0.2\n",
    "val_split = 0.1\n",
    "min_seconds = 2.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metadata = pd.read_csv(os.path.join(voxforge_dir, 'voxforge_metadata.csv'))\n",
    "metadata = metadata[(metadata.time > min_seconds) & (~metadata.is_zero) & (~metadata.is_nan)]\\\n",
    "    .reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The audio data was recorded by a number of speakers where each speaker may have recorded multiple files.  Some speakers only recorded one audio file whereas others have recorded thousands.  This is imbalance in files per speaker may lead the model to learning biases towards certain speakers.  Let's identify these speakers now by creating a \"source\" column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metadata['source'] = metadata['fname']\\\n",
    "    .apply(lambda x : '/'.join(x.split('/')[:-3]) + '/' + x.split('/')[-3].split('-')[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When splitting the dataset into our train, validation, and test splits, we have to make sure that the same speakers do not occur across the dataset splits, otherwise our evaluation may be biased towards these speakers.  Thus, we will perform a train-validation-test split based on the speaker source."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_val_test_split_by_source(df, val_split=0.1, test_split=0.2):\n",
    "    \"\"\" Splits the dataset by speaker source such that the same speaker doesn't occur accross data splits \"\"\"\n",
    "    \n",
    "    train_df, val_df, test_df = [], [], []\n",
    "    \n",
    "    # loop through each language and create splits such that each split has equal proportion of languages\n",
    "    for lang in df['class'].unique():\n",
    "        temp = df[df['class'] == lang]\n",
    "        \n",
    "        # get list of unique speakers\n",
    "        sources = temp['source'].unique()\n",
    "\n",
    "        # create train, val, test splits on speakers\n",
    "        train_sources, test_sources = train_test_split(\n",
    "            sources, test_size=test_split, random_state=2\n",
    "        )\n",
    "        train_sources, val_sources = train_test_split(\n",
    "            train_sources, test_size=val_split/(1 - test_split), random_state=2\n",
    "        )\n",
    "\n",
    "        train_df.append(temp[temp['source'].isin(train_sources)].reset_index(drop=True))\n",
    "        val_df.append(temp[temp['source'].isin(val_sources)].reset_index(drop=True))\n",
    "        test_df.append(temp[temp['source'].isin(test_sources)].reset_index(drop=True))\n",
    "\n",
    "    train_df = pd.concat(train_df, ignore_index=True)[['fname', 'source', 'class', 'time']]\\\n",
    "        .sample(frac=1, random_state=0).reset_index(drop=True)\n",
    "    val_df = pd.concat(val_df, ignore_index=True)[['fname', 'source', 'class', 'time']]\\\n",
    "        .sample(frac=1, random_state=0).reset_index(drop=True)\n",
    "    test_df = pd.concat(test_df, ignore_index=True)[['fname', 'source', 'class', 'time']]\\\n",
    "        .sample(frac=1, random_state=0).reset_index(drop=True)\n",
    "    return train_df, val_df, test_df\n",
    "\n",
    "def describe(df):\n",
    "    \"\"\" Function to list basic statistics about the data \"\"\"\n",
    "    print(f'# rows : {len(df)}')\n",
    "\n",
    "    largest_source_df = df[['class', 'source']]\\\n",
    "        .groupby('class')\\\n",
    "        .agg({'source' : lambda x : max(Counter(x).values())})\\\n",
    "        .reset_index(drop=False)\\\n",
    "        .rename(columns={'source' : 'largest_source'})\n",
    "    \n",
    "    n_files_df = pd.DataFrame(list(df['class'].value_counts().items()), columns=['class', 'n_files'])\n",
    "    \n",
    "    n_sources_df = df.groupby('class')\\\n",
    "        .agg({'source' : lambda x : len(set(x))})\\\n",
    "        .reset_index(drop=False)\\\n",
    "        .rename(columns={'source' : 'n_sources'})\n",
    "    \n",
    "    stats = pd.merge(\n",
    "        pd.merge(n_files_df, n_sources_df, how='inner', on='class'), \n",
    "        largest_source_df, how='inner', on='class')\n",
    "    \n",
    "    stats = stats.sort_values('class')\n",
    "    \n",
    "    print(stats.to_string(index=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "describe(metadata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df, val_df, test_df = train_val_test_split_by_source(metadata, val_split=val_split, test_split=test_split)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "describe(train_df)\n",
    "describe(val_df)\n",
    "describe(test_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save the metadata for the train, validation, and test splits to CSV.  These manifests only contain the metadata but will be used to load the audio files during training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.to_csv(os.path.join(voxforge_dir, 'train_manifest.csv'), index=False)\n",
    "val_df.to_csv(os.path.join(voxforge_dir, 'val_manifest.csv'), index=False)\n",
    "test_df.to_csv(os.path.join(voxforge_dir, 'test_manifest.csv'), index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Upload data to S3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Upload all data (metadata + audio files) to the default s3 bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess = sagemaker.Session()  \n",
    "bucket_name = sess.default_bucket()  \n",
    "print(f\"Bucket name : {bucket_name}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess.upload_data(voxforge_dir, key_prefix=voxforge_dir)    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_pytorch_p36",
   "language": "python",
   "name": "conda_pytorch_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}