We compiled our own dataset for use in this project. The data is MIDI files collected online from sources under licenses allowing free usage for research purposes. The data we collected has reached over 100,000 songs, and has been pushed on to github here mostly organized by genre. We use the python library mido to read in the MIDI files
A MIDI data file comprise of an array of messages, each consisting of the channel number (0 to 15); the note value; the velocity, representing how hard a note was hit; whether the message indicates that the note is being turned on or off; and a time-delta, representing the amount of time (in units of ticks) since the last message on the same channel. MIDI files also include other metadata such as the tempo and time signature.
These files were pre-processed into a feature vector that could learn the aspects of music we detail in table 1. We determined that there was a minimum of only three variables that were necessary to learn these aspects of music: note, velocity (how hard a note is struck), and the time between the current and previous note.
The figure below shows a representation of a typical MIDI file and corresponding feature vectors:
We can extract these features from midi files by running:
def getFeatureVector(midiFileName): file = MidiFile(midiFileName) #Get All the messages messages =  for message in file: if message.type in ['note_on','note_off']: messages.append(message) features =  note = None duration = None timeSincePreviousNote = 0.0 i = 0 for message in messages: try: timeSincePreviousNote += message.time except: pass # determine if the message is the start of the note if (message.type not in ['note_on']): continue # don't use the percussion channel if (message.channel >9): continue note = message.note newFeatureVector = [note, timeSincePreviousNote, message.velocity] features.append(newFeatureVector) timeSincePreviousNote = 0.0 i += 1 return np.array(features)
The values in the feature vector were further pre-processed using a Standard Scaler in scikit-learn, which keeps the mean and variance to use it the inverse scaling afterwards on the predictions
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() notes = scaler.fit_transform(notes)
keras to implement the following network architecture:
|Layer (type)||Output shape||Parameter Number|
|lstm_1 (LSTM)||(None, 48)||9984|
|dense_1 (Dense)||(None, 24)||1176|
|dense_2 (Dense)||(None, 3)||75|
We can do this in practice by running:
from keras.layers import LSTM, Dense model = keras.Sequential() model.add(LSTM(48, input_shape=(3, 3))) # input shape: (number of features, lookback) model.add(Dense(24)) model.add(Dense(3, activation='linear')) model.compile(loss=loss_func, optimizer='sgd') # see below for definition of loss_func
A state, , is passed from cell to cell and provides a mechanism for the network to learn long-term relationships in the data. Some information, however, is not needed for long-term usage and so the forget gate decides what information is thrown away from the previous cell state. Weights, , are learned through the back-propagation algorithm, with a stochastic gradient descent optimizer, to determine which information should be forgotten from the cell state.
Input gates decide what new information we’re going to store in the cell state. Two temporary variables are used to calculate the new cell state, . The variable represents which values in the cell state we are going to update. is a vector of candidate values for the new cell state.
We can then calculate the new cell state, , by deleting the unimportant information, and adding the new information.
In order to calculate the output of the cell, we first take the sigmoid value of the current input and previous cell’s output multiplied by learned weights, $W_o$. Then we multiply this vector by the cell state to obtain the output, .
Our neural network is built with one LSTM layer, followed by two dense (fully-connected) layers, as detailed in table above. The network has a look-back of 3, thus using three previous feature vectors to predict the next. This architecture is chosen for multiple tests on its ability to predict correct notes after being fed with a sequence of feature vectors. We adopt this relatively simple network because it reaches the best balance of accurate prediction and consumption of computational resources.
Read more about LSTMs here from Google Brain’s Christopher Olah
Most typical loss functions used in machine learning reward close values and penalize far values from the truth. This is a meaningful approach to take with velocity and time deltas, and thus we used the mean squared error loss function for those two features. Musical notes, however, are usually on certain scales, and sound better when they stay on those scales. Usually, this means striking the true, 4th, or 7th note away from the true note will sound good, while others will not. This pattern also repeats every 12 notes (since this is a 12-base scale, the -5, and -8 would be good predictions, rather than -4 and -7.) Figure 3 shows the loss function which was created by fitting a high-order polynomial, using scipy, to a set of points that satisfy these conditions. The three losses were combined by adding them together after weighing the velocity and time delta losses as 20% of the amplitude of the note-loss, which was deemed more important
This can be implemented by manually choosing appropriate loss values as a function of the difference between the predicted note and the true note, like the plot below:
To run the model using a stochastic gradient descent optimizer, we need to make our loss function continuous and differentiable. We do this by fitting the points in the plot above to a high-order polynomial:
import scipy fit = scipy.poly1d(scipy.polyfit(x,y,deg=40))
we used a 40th order polynomial, which fits our purposes because it has a large capacity for being “wiggly”, I would like to note though that you should never use a polynomial of such a high-order if you are trying to make predictions directly with it!
This turns the loss function into this:
In Keras, we can simply run:
model.fit(X, Y, epochs=1, batch_size=3)
to train the model on X representing feature vectors of shape (3,3), and Y representing the following note, velocity, and delta time actually played in the midi files (ie shape (3,))
Finally, we can generate music by running
Y_hat = model.predict(X2)
where X2 are new feature vectors, this has to go in a loop and get updated at every prediction, since X2 should include the predicted notes.
Now, we need to inverse the effect of the standard scaler
Finally, we need to turn this Y_hat into a midi file
def featuresToMidi(features, fileName): directory = "featureToMidi" mid = MidiFile(type=0) track = MidiTrack() mid.tracks.append(track) withTimeFromStart =  timeFromStart = 0 for feature in features: note = feature deltaTime = feature timeFromStart += deltaTime velocity = feature withTimeFromStart.append(np.array(['note_on', note, deltaTime, velocity, timeFromStart])) withTimeFromStart = np.array(withTimeFromStart) currTime = 0.0 for m in withTimeFromStart: prevTime = currTime currTime = float(m) n = int(round(float(m))) t = currTime - prevTime tempo = mido.bpm2tempo(128) t = mido.second2tick(t, 500 , tempo) #second, ticks_per_beat, tempo) track.append(Message(type=m, note=n, velocity=int(round(float(m))), channel=4, time=abs(int(round(t))))) cwd = os.getcwd() rootdir = (cwd + '/' + directory + '/') mid.save(directory + "/" + fileName)
You can listen to some of the results in the GitHub repo, but here’s an excerpt of the generated music