{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Training a custom AR-CNN model \n", "In this Jupyter notebook, we guide you through several steps of the data science life cycle. We explain how to acquire the data that you use for this project, \n", "provide some exploratory data analysis (**EDA**), and show how we augment the data during training. \n", "\n", "We also walk through the project's model architecture. Finally, we explain how to use the trained model to perform inference, the results of which you can submit to the *Spin the model* Chartbusters challenge.\n", "\n", "## Getting started\n", "Music generation using machine learning (ML) has been an active area of research for the ML community.\n", "\n", "### Prerequisites\n", "If you aren’t familiar with generative AI or the autoregressive convolutional neural network (AR-CNN) model, we recommend reading the following before using this notebook: \n", "\n", "1. [Learn the basics of generative AI](https://d32g4xocucupjo.cloudfront.net/#welcome)\n", "2. [Introduction to autoregressive convolutional neural networks](https://console.aws.amazon.com/deepcomposer/home?region=us-east-1#learningCapsules/autoregressive)\n", "3. [A deep dive into training an AR-CNN model](https://console.aws.amazon.com/deepcomposer/home?region=us-east-1#learningCapsules/deeperDiveIntoARCNN)\n", "\n", "\n", "\n", "## Using generative AI to create music \n", "There have been two primary approaches to generating music using ML techniques. In the first approach, the problem of music generation is treated as an image generation problem. In the second approach, music generation is treated like a time-series problem. To solve these kinds of problems, musicians and data scientists have traditionally used convolutional neural network (CNN) modeling techniques. For example: \n", "\n", "- The [Google Bach Doodle](https://www.google.com/doodles/celebrating-johann-sebastian-bach) used an algorithm called [CocoNET](https://magenta.tensorflow.org/coconet) to generate music using generative AI.\n", "\n", "- The [MuseGAN](https://openai.com/blog/musegan/) model is based on a generative adversarial network (GAN) approach.\n", "\n", "### Using autoregressive approaches for image generation\n", "In a traditional autoregressive approach, you condition an upcoming value on the values that came before it. This kind of approach helps create realistic images. For example:\n", "\n", "- The **PixelCNN** is a type of autoregressive generative model. It predicts an image pixel based off of all previously generated image pixels.\n", "\n", "- The **Orderless NADE** approach is similar to the PixelCNN approach except that generation of the pixels is *ordering invariant*, meaning that the generation of the next pixel doesn't necessarily have to be linear.\n", "\n", "### The AWS DeepComposer approach to generating music \n", "Autoregressive-based approaches are prone to accumulate errors during training. To help mitigate this problem, we train our AR-CNN model so that it can detect and then fix mistakes, including those made by the model itself.\n", "\n", "We do this by treating music generation as a series of *edit events*, which can be either the addition or removal of a note. An *edit sequence* is a series of edit events. Every edit sequence can directly correspond to a piano roll.\n", "\n", "By training our model to view the problem as edit events rather than as an entire image or just the addition of notes, we found that it can offset the accumulation of errors and generate higher quality music.\n", "\n", "Now that you understand the basic theory behind our approach, let’s dive into the code. In the next section, we show examples of the piano roll format that we use for training the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing dependencies\n", "First, let's install and import all of the Python packages that we will use in this tutorial." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "# The MIT-Zero License\n", "\n", "# Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.\n", "\n", "# Permission is hereby granted, free of charge, to any person obtaining a copy\n", "# of this software and associated documentation files (the \"Software\"), to deal\n", "# in the Software without restriction, including without limitation the rights\n", "# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n", "# copies of the Software, and to permit persons to whom the Software is\n", "# furnished to do so.\n", "\n", "# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n", "# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n", "# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n", "# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n", "# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n", "# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\n", "# THE SOFTWARE.\n", "\n", "\n", "# Create the environment and install required packages\n", "!pip install -r requirements.txt" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "# Imports\n", "import os\n", "import glob\n", "import json\n", "import numpy as np\n", "import keras\n", "from enum import Enum\n", "from keras.models import Model\n", "from keras.layers import Input, Conv2D, MaxPooling2D, UpSampling2D, concatenate, BatchNormalization, Dropout\n", "from keras.optimizers import Adam, RMSprop\n", "from keras import backend as K\n", "from random import randrange\n", "import random\n", "import math\n", "import pypianoroll\n", "from utils.midi_utils import play_midi, plot_pianoroll, get_music_metrics, process_pianoroll, process_midi\n", "from constants import Constants\n", "from augmentation import AddAndRemoveAPercentageOfNotes\n", "from data_generator import PianoRollGenerator\n", "from utils.generate_training_plots import GenerateTrainingPlots\n", "from inference import Inference\n", "from model import OptimizerType\n", "from model import ArCnnModel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing the data \n", "In this tutorial, we use the [`JSB-Chorales-dataset`](http://www-etud.iro.umontreal.ca/~boulanni/icml2012). The link contains pickled files that you can convert to MIDI. If you don't want to convert the files we recommend using the *jsb-chorales midis* repository. Below you will find the steps on how to upload these files. \n", "\n", "- A chorale is a hymn usually with one voice singing a simple melody and three lower voices providing harmony. \n", "\n", "In this dataset, the voices are represented by four individual piano tracks.\n", "\n", "### Uploading files from the `jsb-chorales-midis` repository\n", "\n", "1. Use this link, [http://www-etud.iro.umontreal.ca/~boulanni/JSB%20Chorales.zip](http://www-etud.iro.umontreal.ca/~boulanni/JSB%20Chorales.zip) to download a .zip file containing the .mid files from the dataset.\n", "\n", "2. Upload the data into the data directory using the Jupyter console.\n", "\n", "3. Unzip the contents of that directory by uncommenting and then running the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#!unzip data/JSB\\ Chorales.zip -d data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Change the string in the the `data_dir` variable to the correct file path. \n", " \n", "5. Run the code cell below\n", "\n", "If your dataset has been successfully uploaded, you should be able to play a track after you have ran the next code cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "#Import the MIDI files from the data_dir and save them with the midi_files variable \n", "data_dir = 'JSB Chorales/**/*.mid'\n", "midi_files = glob.glob(data_dir)\n", "\n", "#Finds our random MIDI file from the midi_files variable and then plays it\n", "#Note: To listen to multiple samples from the Bach dataset, you can run this cell over and over again. \n", "random_midi = randrange(len(midi_files))\n", "play_midi(midi_files[random_midi])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing the data into the *piano roll* format\n", "In this tutorial, we represent music from the JSB-Chorales-dataset using a [piano roll format](https://en.wikipedia.org/wiki/Piano_roll). A *piano roll* is an image-based representation of music that shows music as a two-dimensional matrix, where *time* is on the horizontal axis and *pitch* is on the vertical axis.\n", "\n", "The presence of a pixel in a cell in the grid indicates that a note is played at that particular time and pitch interval.\n", "\n", "### Reviewing sample piano rolls \n", "Let's look at a few piano rolls from our dataset. Each track comprises 128 discrete timesteps with a variable number of pitches.\n", "\n", "\"Dataset\n", "\n", "### Merging piano rolls into a merged piano roll\n", "To train the AR-CNN, we need to merge each of these piano roll tracks into a single *merged piano roll*.\n", "\n", "\"Merged\n", "\n", "Using an image-based representation of music is common in many machine learning (ML) applications that involve music. We explain how we apply the piano roll images later on in the notebook. \n", "\n", "\n", "### Why do we use 128 timesteps?\n", "In this tutorial, we use 8-[bar](https://en.wikipedia.org/wiki/Bar_(music)) samples from the dataset. We subdivide those 8 bars into 128 timesteps. That's because each of the 8 bars contains 4 beats. We further divide each beat into 4 timesteps. \n", "\n", "This yields 128 timesteps:\n", "\n", "$$ \\frac{4\\;timesteps}{1\\;beat} * \\frac{4\\;beats}{1\\;bar} * \\frac{8\\;bars}{1} = 128\\;timesteps $$\n", "\n", "We found that this level of resolution is sufficient to capture the musical details in our dataset.\n", "\n", "### Creating samples of uniform size (shape) for model training \n", "\n", "For model training, the *input piano rolls* must be the same size. As you saw when we used the `play_midi` function, each sample isn't the same length. We use two functions to create *target piano rolls* that are the same size: `process_midi` and `process_pianoroll`. These functions are wrapped in a larger function, `generate_samples`, which also takes in constants that are related to subdividing the .mid files.\n", "\n", "#### In the code cells below:\n", "- `generate_samples` is a function used to ingest the midi files and break the files down into a uniform shape\n", "- `plot_pianoroll` uses a built in function `plot_track` from the [`pypianoroll`](https://salu133445.github.io/pypianoroll/visualization.html) library to plot a piano roll track from the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate MIDI file samples\n", "def generate_samples(midi_files, bars, beats_per_bar, beat_resolution, bars_shifted_per_sample):\n", " \"\"\"\n", " dataset_files: All files in the dataset\n", " return: piano roll samples sized to X bars\n", " \"\"\"\n", " timesteps_per_nbars = bars * beats_per_bar * beat_resolution\n", " time_steps_shifted_per_sample = bars_shifted_per_sample * beats_per_bar * beat_resolution\n", " samples = []\n", " for midi_file in midi_files:\n", " pianoroll = process_midi(midi_file, beat_resolution) # Parse the MIDI file and get the piano roll\n", " samples.extend(process_pianoroll(pianoroll, time_steps_shifted_per_sample, timesteps_per_nbars))\n", " return samples" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Saving the generated samples into a dataset variable \n", "dataset_samples = generate_samples(midi_files, Constants.bars, Constants.beats_per_bar,Constants.beat_resolution, Constants.bars_shifted_per_sample)\n", "# Shuffle the dataset\n", "random.shuffle(dataset_samples);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Visualize a random piano roll from the dataset \n", "random_pianoroll = dataset_samples[randrange(len(dataset_samples))]\n", "plot_pianoroll(pianoroll = random_pianoroll,\n", " beat_resolution = 4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Augmenting data during training\n", "In computer vision, data augmentation is traditionally used to increase the number of training samples and make the training data more robust. In the AR-CNN model, data augmentation is a critical component of training. We use data augmentation to generate piano rolls that teach the model to add and remove notes to generate compositions.\n", "\n", "To augment the data during training, the model uses *input-target piano roll* pairs.\n", "\n", "The data generator, a built in function of our model, creates the modified *input piano rolls* by adding and removing notes from the original *target piano roll*. With this approach, the model learns where *edit events* are needed during training in order to recreate the *target piano roll*. An *edit event* is simply an opportunity for the model to add or remove a note. The input piano rolls used during training represent an input melody that you might provide (which might have missing or off-tune notes).\n", "\n", "### Adding or removing notes during training\n", "The data generator creates the input-target piano roll pairs by randomly adding and removing notes from the target piano rolls creating new input piano rolls. During each [epoch](https://docs.aws.amazon.com/deepcomposer/latest/devguide/deepcomposer-basic-concepts.html#term-epoch) or training iteration, the data generator adds or removes different notes from the target piano rolls to create new randomly generated input piano rolls. You can control how many input piano rolls are generated per target piano roll by changing the value for the `samples_per_ground_truth_data_item` variable in the `constants.py` file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Removing random notes from a target piano roll to create input piano rolls\n", "To create the input piano rolls, the data generator randomly removes notes from the target piano roll. During training, the model tries to learn that it needs to *add* these notes back into the input piano rolls. \n", "\n", "The model removes a percentage of notes by sampling from a uniform distribution between *a lower and an upper bound*: \n", "\n", "- The default lower bound, `sampling_lower_bound_remove`, is set to 0%. We choose a lower bound of 0% so the model can detect an input piano roll that is identical to a target piano roll.\n", " \n", "- The default upper bound, `sampling_upper_bound_remove`, is set to 100%. Choosing an upper bound of 100% allows the model to learn how to generate music from scratch when it encounters input piano rolls from which all of the notes have been removed.\n", "\n", ">For more information about this process, see the [AR-CNN deeper diver learning capsule](https://console.aws.amazon.com/deepcomposer/home?region=us-east-1#learningCapsules/deeperDiveIntoARCNN).\n", "\n", "![SegmentLocal](images/removenotes.gif \"segment\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding random notes to the target piano roll to create input piano rolls \n", "\n", "To create the input piano roll, the data generator also randomly adds notes to the target piano roll. The model learns that it needs to remove these notes from the input piano roll to recreate the target piano roll.\n", "\n", "The model add a percentage of notes by sampling from a uniform distribution between *a lower and an upper bound*: \n", "\n", "- The default lower bound, `sampling_lower_bound_add`, is set to 0%. This represents an input piano roll that is no different from the target piano roll.\n", "- The default upper bound, `sampling_upper_bound_add`, is set to 1.5%, based on experimentation. This might seem small, but because the percentage is based on the total number of empty pixels (which are usually far greater than the number of notes), the upper bound ends up being sufficiently large. \n", " \n", ">To learn more about this process, see the [A deep dive into training an AR-CNN model](https://console.aws.amazon.com/deepcomposer/home?region=us-east-1#learningCapsules/deeperDiveIntoARCNN).\n", "\n", "![SegmentLocal](images/addnotes.gif \"segment\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sampling from a uniform distribution\n", "When randomly adding or removing notes, the model samples from a uniform distribution from a potentially different input piano roll. By sampling from a uniform distribution, the model learns how to fill in or remove different percentages of notes. This helps the model learn how to recreate the target piano roll from the input piano roll no matter what the current state of the input piano roll is. This is useful during the iterative inference process which we describe in more detail in the Inference section\n", "\n", "If you want to change the percentages of notes added or removed, edit and run the following cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sampling_lower_bound_remove = 0 \n", "sampling_upper_bound_remove = 100\n", "sampling_lower_bound_add = 1\n", "sampling_upper_bound_add = 1.5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculating the loss function\n", "\n", "During data augmentation, the model both adds and removes notes from the *target piano roll*. We want the model to correctly pick the next *edit event*. An edit event is simply an opportunity for the model to add or remove a note from the *input piano roll*. These edit events occur when the model should update the distribution in the *input piano roll* to sound more like the distribution of the *target piano roll*.\n", "\n", ">**NOTE**: The model can pick any of the notes that were added or removed during data augmentation.\n", "\n", "The notes that were added or removed during data augmentation represent the *symetric difference* between the input piano roll and the target piano roll. For example, imagine your have two bowls of candy. In bowl one, you have three blue candies and one red candy. In bowl two, you have only three blue candies. In this example, the symetric difference is the single red candy. In our model, notes that have been added or removed are \"red candies.\" \n", "\n", "This difference in distributions between the distribution representing the *input piano roll* and the uniform distribution can be calculated as the *Kullback–Leibler divergence*. Our loss function is the difference between the model’s output and the uniform distribution of all of the pixel (note) probabilities in the symmetric difference." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Customized loss function\n", "class Loss():\n", " @staticmethod \n", " def built_in_softmax_kl_loss(target, output):\n", " '''\n", " Custom Loss Function\n", " :param target: ground truth values\n", " :param output: predicted values\n", " :return kullback_leibler_divergence loss\n", " '''\n", " target = K.flatten(target)\n", " output = K.flatten(output)\n", " target = target / K.sum(target)\n", " output = K.softmax(output)\n", " return keras.losses.kullback_leibler_divergence(target, output)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model architecture\n", "\n", "Our model architecture is adapted from the *U-Net architecture* pictured below. This architecture is a popular CNN that is used for computer vision. It consists of three major components:\n", "\n", "- An *encoder network* that takes an input piano roll as input and encodes it in a lower-dimensional *latent space*\n", "- A *decoder network* that decodes that smaller *latent space* back out to the merged piano roll input\n", "- A *single-track piano roll input*, which is a single-melody track of size (128, 128, 1) => (TimeStep, NumPitches, NumTracks) that is provided as the input to the model.\n", "\n", ">**NOTE**: If you're using the Bach dataset, this input is a merged piano roll track as previously discussed. However, if you're training the model using samples that are all only a single track, the input really is a single-track piano roll input.\n", "\n", "\"Model\n", "\n", "### Creating the model architecture\n", "\n", "The code for the U-Net architecture that we have developed for this AR-CNN is in `model.py`. For a higher-level view of the layers used in this neural network, see the output of the `model = MusicModel.build_model()` code cell below." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training\n", "\n", "To train the model, we split the dataset into training and validation sets. We hold out 10% of the training data for the validation set. You can change this parameter by changing the `training_validation_split` variable in `constants.py`. \n", "\n", "As stated previously, the *input-target piano roll* pairs are generated during training. During each training epoch, different notes are added to and removed from the target piano roll to create different input piano rolls." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset_size = len(dataset_samples)\n", "dataset_split = math.floor(dataset_size * Constants.training_validation_split) \n", "\n", "training_samples = dataset_samples[0:dataset_split]\n", "print(\"training samples length: {}\".format(len(training_samples)))\n", "validation_samples = dataset_samples[dataset_split + 1:dataset_size]\n", "print(\"validation samples length: {}\".format(len(validation_samples)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specifying training hyperparameters \n", "\n", "The AR-CNN model's hyperparameters are defined in the following cell. If you're unfamiliar with the structure of CNNs, see the convolutional neural networks topic in [Dive into Deep Learning](https://d2l.ai/chapter_convolutional-neural-networks/index.html). \n", "\n", "You might consider changing the dropout values for the encoder and decoder layers: `dropout_rate_encoder` and `dropout_rate_decoder`, respectively. When training a neural network, dropout is an important tool for addressing the [the bias-variance trade off](https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/statistics.html#the-bias-variance-trade-off).\n", "\n", ">**NOTE** \n", ">If you want to test that your model is training on your custom dataset, you can decrease the number of `epochs` down to **1** in the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Piano Roll Input Dimensions\n", "input_dim = (Constants.bars * Constants.beats_per_bar * Constants.beat_resolution, \n", " Constants.number_of_pitches, \n", " Constants.number_of_channels)\n", "# Number of Filters In The Convolution\n", "num_filters = 32\n", "# Growth Rate Of Number Of Filters At Each Convolution\n", "growth_factor = 2\n", "# Number Of Encoder And Decoder Layers\n", "num_layers = 5\n", "# A List Of Dropout Values At Each Encoder Layer\n", "dropout_rate_encoder = [0, 0.5, 0.5, 0.5, 0.5]\n", "# A List Of Dropout Values At Each Decoder Layer\n", "dropout_rate_decoder = [0.5, 0.5, 0.5, 0.5, 0]\n", "# A List Of Flags To Ensure If batch_normalization Should be performed At Each Encoder\n", "batch_norm_encoder = [True, True, True, True, False]\n", "# A List Of Flags To Ensure If batch_normalization Should be performed At Each Decoder\n", "batch_norm_decoder = [True, True, True, True, False]\n", "# Path to Pretrained Model If You Want To Initialize Weights Of The Network With The Pretrained Model\n", "pre_trained = False\n", "# Learning Rate Of The Model\n", "learning_rate = 0.001\n", "# Optimizer To Use While Training The Model\n", "optimizer_enum = OptimizerType.ADAM\n", "# Batch Size\n", "batch_size = 32\n", "# Number Of Epochs\n", "epochs = 500" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The Number of Batch Iterations Before A Training Epoch Is Considered Finished\n", "steps_per_epoch = int(\n", " len(training_samples) * Constants.samples_per_ground_truth_data_item / int(batch_size))\n", "\n", "print(\"The Total Number Of Steps Per Epoch Are: \"+ str(steps_per_epoch))\n", "\n", "# Total Number Of Time Steps\n", "n_timesteps = Constants.bars * Constants.beat_resolution * Constants.beats_per_bar" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating the data generators that perform data augmentation\n", "\n", "To create the *input piano rolls* during training, we need data generators for both the training and validation samples. For our purposes, we use a custom data generator to perform data augmentation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Training Data Generator\n", "training_data_generator = PianoRollGenerator(sample_list=training_samples,\n", " sampling_lower_bound_remove = sampling_lower_bound_remove,\n", " sampling_upper_bound_remove = sampling_upper_bound_remove,\n", " sampling_lower_bound_add = sampling_lower_bound_add,\n", " sampling_upper_bound_add = sampling_upper_bound_add,\n", " batch_size = batch_size,\n", " bars = Constants.bars,\n", " samples_per_data_item = Constants.samples_per_ground_truth_data_item,\n", " beat_resolution = Constants.beat_resolution,\n", " beats_per_bar = Constants.beats_per_bar,\n", " number_of_pitches = Constants.number_of_pitches,\n", " number_of_channels = Constants.number_of_channels)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Validation Data Generator\n", "validation_data_generator = PianoRollGenerator(sample_list = validation_samples,\n", " sampling_lower_bound_remove = sampling_lower_bound_remove,\n", " sampling_upper_bound_remove = sampling_upper_bound_remove,\n", " sampling_lower_bound_add = sampling_lower_bound_add,\n", " sampling_upper_bound_add = sampling_upper_bound_add,\n", " batch_size = batch_size, \n", " bars = Constants.bars,\n", " samples_per_data_item = Constants.samples_per_ground_truth_data_item,\n", " beat_resolution = Constants.beat_resolution,\n", " beats_per_bar = Constants.beats_per_bar, \n", " number_of_pitches = Constants.number_of_pitches,\n", " number_of_channels = Constants.number_of_channels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating callbacks for the model\n", "\n", "Callbacks are used to get a view of the internal states and statistics of the model during training. This can include saving the model as in a checkpoint file after each successful epoch, monitoring the change in loss from the training and validation sets, generating plots, and adjusting the learning rates over time.\n", "\n", "In our model we use two callbacks to:\n", " \n", "1. Create *training vs validation* loss plots during training.\n", " These graphs plot the loss after each epoch of the training and validation sets is complete. \n", " - The loss values can vary widely based on the parameters that you have chosen and the dataset that you use. \n", "\n", "2. Save model checkpoints based on the *best validation loss*. We save the best model checkpoint so that it can be used for inference." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Callback For Loss Plots \n", "plot_losses = GenerateTrainingPlots()\n", "## Checkpoint Path\n", "checkpoint_filepath = 'checkpoints/-best-model-epoch:{epoch:04d}.hdf5'\n", "\n", "# Callback For Saving Model Checkpoints \n", "model_checkpoint_callback = keras.callbacks.ModelCheckpoint(\n", " filepath=checkpoint_filepath,\n", " save_weights_only=False,\n", " monitor='val_loss',\n", " mode='min',\n", " save_best_only=True)\n", "\n", "# Create A List Of Callbacks\n", "callbacks_list = [plot_losses, model_checkpoint_callback]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create A Model Instance\n", "MusicModel = ArCnnModel(input_dim = input_dim,\n", " num_filters = num_filters,\n", " growth_factor = growth_factor,\n", " num_layers = num_layers,\n", " dropout_rate_encoder = dropout_rate_encoder,\n", " dropout_rate_decoder = dropout_rate_decoder,\n", " batch_norm_encoder = batch_norm_encoder,\n", " batch_norm_decoder = batch_norm_decoder,\n", " pre_trained = pre_trained,\n", " learning_rate = learning_rate,\n", " optimizer_enum = optimizer_enum)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = MusicModel.build_model()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Starting training \n", "\n", "In the following cell, you start training your model.\n", "\n", ">**NOTE**: Training times can vary greatly based on the parameters that you have chosen and the notebook instance type that you chose when launching this notebook. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Start Training\n", "history = model.fit_generator(training_data_generator,\n", " validation_data = validation_data_generator,\n", " steps_per_epoch = steps_per_epoch,\n", " epochs = epochs,\n", " callbacks = callbacks_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performing inference \n", "\n", "Congratulations! You have now trained your very own AR-CNN model to generate music. Now you can see how well your model will perform with an input melody. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How to change the *inference parameters* when you perform inference \n", "\n", "The model performs inference by sampling from its predicted probability distribution across the entire piano roll. \n", "\n", "Inference is an iterative process. After adding or removing a note from the input, the model feeds this new input back into itself. The model has been trained to both remove and add notes, so it can improve the input melody and correct mistakes that it may have made in earlier iterations.\n", "\n", "You also can change the *inference parameters* to observe differences in the quality of the music generated: \n", "\n", "- Sampling iterations (`samplingIterations`): The number of iterations performed during inference. A higher number of sampling iterations gives the model more time to improve the input melody.\n", "\n", "- Maximum notes to remove (`maxPercentageOfInitialNotesRemoved`): The maximum percentage of notes that can be removed during inference. Setting this value to 0% prevents the model from removing notes from your input melody.\n", "\n", "- Maximum notes to add (`maxNotesAdded`): The maximum percentage of notes that can be added during inference. Setting this value to 0% means no notes will be added to your input melody\n", "\n", ">**NOTE:** If you restrict your model's ability to add and remove notes, you risk creating poor compositions. \n", "\n", "- Creativity, `temperature`: To create the output probability distribution, the final layer uses a softmax activation. You can change the temperature for the softmax to produce different levels of creativity in the outputs generated by the model.\n", "\n", "\n", "#### To change the inference parameters:\n", "\n", "1. Open the `inference_parameters.json` file. \n", "2. Update the variables.\n", "3. Save and close the `inference_parameters.json` file. \n", "4. Run the following cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load The Inference-Related Parameters\n", "with open('inference_parameters.json') as json_file:\n", " inference_params = json.load(json_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading a saved checkpoint file\n", "\n", "To use your trained model, you will need to update the `checkpoint_var` variable in the cell below. To see the checkpoint files that you have created, uncomment and then run the following cell." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# !ls -ltr checkpoints/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next code cell, replace the string in variable `checkpoint_var` with the filename, `checkpoints/foo-bar.hdf5` " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create An Inference Object\n", "inference_obj = Inference()\n", "# Load The Checkpoint\n", "checkpoint_var = 'checkpoints/-best-model-epoch:0001.hdf5'\n", "inference_obj.load_model(checkpoint_var) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### To choose a new input melody\n", "\n", "1. In a different tab, switch back to the Jupyter console\n", "2. Open the `sample_inputs` directory. It contains six sample melodies. \n", "3. Note the name of the file that you want to use, for example, 'new_world.midi'. \n", "4. In the following cell, replace **'sample_inputs/ode_to_joy.midi'** with the name of the file and run the cell below" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate The Composition\n", "input_melody = 'sample_inputs/ode_to_joy.midi' \n", "inference_obj.generate_composition(input_melody, inference_params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### To listen to your composition\n", "\n", "1. Run the following cell. It lists our compositions. \n", "2. Note the filename of the composition that you want to listen to, for example, \"output_2.mid\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls -ltr outputs/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. In the following cell, replace **'outputs/output_0.mid'** with the name of the file and run the cell below" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "output_melody = 'outputs/output_0.mid'\n", "play_midi(output_melody)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">**NOTE**: Compositions are automatically saved. To download a composition that you have created, open the `outputs` directory. Choose your composition, then choose **Download**. ![downloading-your-compositions](images/download-composition.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluating your results\n", "\n", "Now that you've generated a composition, let's find out how you did by running the code cells below. The code cells below provide you with some model metrics and a visualization of the piano rolls created. \n", "\n", "We'll analyze the composition using the following metrics: \n", "\n", "- *Empty bar rate:* The ratio of empty bars to the total number of bars. \n", "- *Pitch histogram distance:* The distribution and position of pitches.\n", "- *In scale ratio:* The ratio of the number of notes, in the C major key, to the total number of notes. \n", "\n", "\n", "### Visualizing the results\n", "After computing the metrics, let's also visualize the *input piano roll* and compare it with the generated output piano roll to see which notes have been changed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Input Midi Metrics:\n", "print(\"The input midi metrics are:\")\n", "get_music_metrics(input_melody, beat_resolution=4)\n", "\n", "print(\"\\n\")\n", "# Generated Output Midi Metrics:\n", "print(\"The generated output midi metrics are:\")\n", "get_music_metrics(output_melody, beat_resolution=4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Convert The Input and Generated Midi To Tensors (a matrix)\n", "input_pianoroll = process_midi(input_melody, beat_resolution=4)\n", "output_pianoroll = process_midi(output_melody, beat_resolution=4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot Input Piano Roll\n", "plot_pianoroll(input_pianoroll, beat_resolution=4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot Output Piano Roll\n", "plot_pianoroll(output_pianoroll, beat_resolution=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Submitting to the *Spin the Model* Chartbusters challnege\n", "\n", "To submit your composition(s) and model to the *Spin the model* chartbusters challenge you will first need to create a public repository on [GitHub](https://github.com/). Then download your notebook, checkpoint files, and compositions from SageMaker, and upload them to your public repository. Use the link from your public repository to make your submission to the Chartbusters challenge! " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleaning up \n", "\n", "After completing this notebook, make sure that you stop your Amazon SageMaker notebook instance so that you don't incur unexpected costs. \n", "\n", "#### To stop an Amazon SageMaker notebook instance \n", "\n", "1. Open the [Amazon SageMaker console](https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/dashboard).\n", "\n", "2. In the navigation pane, choose **Notebook instances**.\n", "\n", "3. Choose the notebook instance that you want to stop. \n", "\n", "4. From the **Actions** menu, choose **Stop**.\n", "\n", ">**NOTE**: When your notebook instance stops, its status changes from **In service** to **Stopped**. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# More info\n", "\n", "For more open-source implementations of generative models for music, see the following:\n", "\n", "- [MuseNet](https://openai.com/blog/musenet/): Uses GPT2, a large-scale Transformer model, to predict the next token in a sequence\n", "- [Jukebox](https://openai.com/blog/jukebox/): Uses various neural nets to generate music, including rudimentary singing, as raw audio in a variety of genres and artist styles\n", "- [Music Transformer](https://github.com/tensorflow/magenta/tree/master/magenta/models/score2perf): Uses transformers to generate music\n", "\n", "## References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "1. [MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment.](https://arxiv.org/abs/1709.06298)\n", "2. [MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation.](https://arxiv.org/abs/1703.10847)\n", "3. [A Hierarchical Recurrent Neural Network for Symbolic Melody Generation.](https://pubmed.ncbi.nlm.nih.gov/31796422/)\n", "4. [Counterpoint by Convolution](https://arxiv.org/abs/1903.07227)\n", "5. [MusicTransformer:Generating Music With Long-Term Structure](https://arxiv.org/abs/1809.04281)\n", "6. [Conditional Image Generation with PixelCNN Decoders](https://arxiv.org/abs/1606.05328)\n", "7. [Neural Autoregressive Distribution Estimation](https://arxiv.org/abs/1605.02226)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "conda_tensorflow_p36", "language": "python", "name": "conda_tensorflow_p36" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10" }, "pycharm": { "stem_cell": { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [] } } }, "nbformat": 4, "nbformat_minor": 4 }