Making Music: When Simple Probabilities Outperform Deep Learning

By Haebichan Jung

When a simple contestant competes against a deep army.
The Problem: how I discovered problems of using deep learning techniques for generating pop music.
The Solution: how I built an original music making machine that could rival deep learning but with simpler solutions.
Evaluation: how I created an evaluation metric that could mathematically prove my music to “sound more like pop” than ones by deep learning.
Generalization: how I discovered a way to generalize my own model, to be applied to situations other than generating music.

Please also check out the YouTube version of this project:

I made a simple probabilistic model that generates pop music. And with an objective metric, I can say that the model generates music that sounds more like pop than the some of ones made by deep learning techniques. How did I do this? I did this partly by focusing on what I thought was at the heart of pop music: the statistical relationship between the harmony and the melody.

Melody is the vocals, the tunes. Harmony is the chords, the chord progression. In the piano, the melody is played by the right hand, and the harmony by the left.

Before delving into their relationship, let me first define the problem. I began this project with the simple desire to generate pop music using deep learning, or ‘A.I.’ as laymen call it. This quickly led me to LSTMs (Long Short-Term Memory units), a particular version of a Recurrent Neural Network (RNN) that is very popular for generating texts and making music.

But as I read more into the subject, I began to question the logic of applying RNNs and their variants to generate pop music. The logic seemed to based on several assumptions about the internal structure of (pop) music that I did not fully agree with.

One specific assumption is the independent relationship between the harmony and the melody (description of the two is above).

Take for instance the 2017 publication from the University of Toronto: Song from Pi: A Musically Plausible Network for Pop Music Generation (Hang Chu, et al). In this article, the authors explicitly “assume…the chords are independent given the melody” (3, italics mine). Based on this specification, the authors build a complex and multi-layered RNN model. The melody has its own layer for generating notes (the key and the press layer), which is autonomous from the chord layer. On top of the independence, this particular model conditions the harmony on the melody for generation. This just means that the harmony is dependent on the melody for note generation.

Hang Chu, et al.’s stacked RNN model. Each layer is responsible for addressing different aspect of a song.

This kind of modeling feels odd to me, as it does not seem to approximate how humans would approach composing popular music. Speaking personally as a classically trained pianist, I would never consider writing down melody notes without first considering the harmony notes. This is because the harmony notes both define and limit what my melody notes can be. Axis of Awesome, in their once viral YouTube video, demonstrated this idea long ago.

Video demonstrating how different pop melodies are all dependent on the same four chords.

Their video displays a defining attribute of western pop music: that harmony, or those four chords, strongly determine what the melody will be. In data science language, we can say that a conditional probability regulates and resolves the statistical relationship between the harmony and the melody. This becomes the case as the melody notes are naturally dependent on what the harmony notes are. One could thus argue that the harmony notes both inherently limit and enable which melody notes can be chosen in a particular song.

I love building my own solutions to solve complex problems. As such, I decided to construct my own model that might capture the rich underlying structure of musical data in my own way. I began doing so by focusing on the predetermined probabilistic fate governing the relationship between different kinds of musical notes. An example is what I mentioned above — the “vertical” relationship between the harmony and the melody.

For the data, I utilized 20 different western pop songs in midi-format (the complete list of the songs can be found here:

Using the music21 python library, I processed these midi files largely (but not completely) based on the Markov Process. This allowed me to extract the statistical relationship between different types of notes in my input data. Specifically, I calculated the transition probabilities of my musical notes. This basically means that as notes transition from one to the next, we can get the probability of that transition happening. (More in-depth explanation down below)

midi: a digitized version of a song.

First, I extracted the “vertical” transition probabilities between the harmony notes and the melody notes. I also calculated all the “horizontal” transition probabilities among the melody notes according to the dataset. I completed this task for the harmony notes as well. The chart below demonstrates an example of three different transitional matrices between different types of notes in the musical data.

Transition Probabilities, examples. Top: between Harmony and Melody notes — Middle: between Melody notes — Bottom: between Harmony notes

Using these three probability matrices, my model will follow these simple directions.

1. Select a random Harmony Note available from the data.
2. Select a Melody Note based on that Harmony Note using the first probability matrix seen above.
3. Select a Melody Note based on that Melody Note using the second probability matrix seen above.
4. Repeat Step 3 until a certain cut off line.
Steps 1–4.
5. Select a new Harmony Note based on the previous Harmony Note using the third probability matrix .
6. Repeat Steps 1–4 until a certain cut off line.
Steps 5–6.

Here is a specific example of these 6 simple steps.

  1. The machine randomly chooses Harmony Note F .
  2. Harmony Note F has 4 Melody Notes to choose from. Using the first transition matrix, it might choose Melody Note C given that Melody Note C has a relatively high likelihood (24.5% chance of being selected).
  3. Melody Note C will turn to the second transition matrix to select the next melody note. It might choose Melody Note A due to its high probability (88%).
  4. Step 3 will continue generating new melody notes until a preset cut off line.
  5. Harmony Note F will turn to the third transition matrix to select the next harmony note. It might choose Harmony Note F or Harmony Note C based on their relatively high likelihoods.
  6. Steps 1–4 will repeat until a certain preset cut off line.

Here is are examples of pop music generated through this architecture (from

Now comes the difficult part — how to evaluate different models. After all, my article claims that simple probabilities can outperform neural networks. But how do we evaluate my model from a neural network model? How can we objectively declare that my music is “more like pop” than an A.I.-made music?

To answer this question, we must first ask what exactly defines pop music in the first place. I gave the first definition already: the statistical relationship between the harmony and the melody. But there is another defining element of pop music. This is how pop music has a clearly defined beginning, middle, and an end (intro, verse, bridge, chorus, outro, etc.) that repeat multiple times within a song.

For instance, “Let it go, let it go, can’t hold it back anymore…” is in the middle segment of the music, rather than the beginning and the end. And this section repeats three times in the song.

With this in mind, we can use what is called a self-similarity matrix. In very simple terms, the self-similarity matrix mathematically visualizes that beginning, middle, and end of a song. Below is a self-similarity matrix of the song, Falling Slowly from the movie Once.

Each tiny block represents every note played in four beats of time in the song. Each big block in 45 degree angle represents a segment of a song.

The first blue cluster represents the beginning portion of a song, while the next yellow cluster represents another segment of that song. The first and the third clusters are shaded the same due to their (self)similarity. The second and the fourth cluster are shaded the same due to their own self(similarity).

I made twenty of these self-similarity matrices of the twenty pop songs I used as input data. I then made my machine copy (the average of) their structures as faithfully as it can (for more details, please ask in the comments!).

The results are telling. Before the self-similarity matrix, my machine produced sounds that had no internal repetitive structure. But after copying the structures of the input data, you can then see those boundaries in my generated music shown below.

Before and after utilizing the self-similarity matrix.

Compared to this, the self-similarity matrix of the pop music produced by the neural network at the University of Toronto looks like this:

And this is how you can compare and evaluate different models — based on the boundaries of their self-similarity matrices!

The final problem that I wanted to solve was generalization. By generalization, I ask: how can we generalize my data-driven music model so that it can be applied to situations other than making pop music? In other words, is there another human-made invention which shares the architecture of my pop music maker?

After much deliberation, I discovered that there is one other human cultural creation that does have this structure as internal to its data — and it’s pop lyrics!

Take for example I’ll Be by Edward McCain. A snippet of it goes like this:

I’ll be your cryin’ shoulder
I’ll be love suicide
I’ll be better when I’m older
I’ll be the greatest fan of your life

Let us break down the lyrics, using the same context of generation in machine learning. We might associate ‘I’ll be’ as the first input word in the language model. This bigram will be used to generate ‘your’, which generates ‘crying’, which leads to ‘shoulder’.

Then comes the very important question: is the first word of the next sentence (another ‘I’ll be’) dependent on the last word, ‘shoulder’? In other words, is there any relationship between the last word of the first sentence and the first word of the next sentence?

To me, the answer is NO. As the sentence terminates with ‘shoulder’, the next word becomes generated based on the previous word ‘I’ll be’. This is because first words of each sentence are deliberately repeated, signifying that there a similar conditional relationship exists between the first words of each sentence. These first words become the trigger point for a sequence of the next words.

I find this to be a fascinating discovery. Both pop music and pop lyrics seem to have this architecture as internal to their data! Isn’t that super fascinating?

You can visit my website to generate both pop music and pop lyrics.

The github for this project is here:

If you have any questions, feel free to reach me @