Deep DX – Part 5: Getting Ready to Generate

In previous parts of this ongoing series on how to use a generative neural network to create new patches for the Yamaha DX7 synthesizer, we have looked into how to qualify data for training, how to represent patches in a way that is suitable for finding patterns as well as visualizing them, and now the time has come to start looking at how they are actually generated!

Patterns of Patchmaking

Visualizing the encoded parameters of a bank of 32 sounds illustrates how the parameters are grouped in areas, corresponding to the operators

Is there a pattern to good-sounding DX7 sounds? Over the years, many good practices around creating useful or interesting sounds out of FM synthesis have emerged, and with the release of instruments such as Elektron’s Digitone and the Kodamo EssenceFM (to name a few) interest in FM as a sound source has had a resurgence in recent years. I firmly believe there are even more patterns to successful “patchmaking” to be found, and what better tools than deep learning networks to aid us with this?

The most popular model for this type of job as of writing this is a GAN, or a Generative Adversary Network as you get two functions from training it: Being able to recognize useful DX7 patches, and (hopefully) being able to generate them as well.

Sorting the Good from the Bad

One part of the GAN is often referred to as the discriminator which is a binary classifier that when being provided with something posing as a DX7 patch in our internal format, responds by stating it’s the real deal or a fake patch. Classifiers are known mainly for their ability to respond to input data by describing it from a fixed set of features, often using probabilities. If you have a surveillance camera that can distinguish between vehicles, animals and humans, and set a different response to each (“turn on floodlights”, “alert me”, and so on) it contains a classifier, and they became famous early on through their ability to describe the contents of a photograph (“a gorilla in a recliner, holding a glass of beer”).

In our case, we don’t need more than a single output, indicating with a probability rounded to one or zero (true or false) whether the patch data we present seem to be genuine or rubbish.
To train a classifier, you present it with a steady flow of real patch data mixed with made-up patches, and you tag them with being real or fake. By trying to minimize the error rate, the classifier (that we call discriminator here) gradually learns to recognize patches.

Combining Forces

If all you needed was a way to tell real DX sounds from fake, or bad ones, that would describe our entire setup. (It’s not quite as easy, as we need a way first to cull proper patch data from the large data set containing numerous erroneous ones, but we already talked about that earlier at length, so I will spare you the details).

Discriminator waiting for the Generator to finish before judging the resulting artwork

But we also want to train another network to actually generate patches that make sense to load into a DX7, in other words being good enough to pass as real patches. Therefore, the traditional way of doing this is to feed the discriminator fake patches generated by the generator combined with actual real patches from the training data set. While the discriminator is rewarded for discerning real patches from generated ones, the generator is rewarded whenever it manages to fool the discriminator into passing one of its’ own as “real”!

This way, you can train both networks in parallel.

To speed things up however, it’s common practice to either pre-train the discriminator to a certain degree, or train it at twice the rate of the generator. (Remember that the generator performance can only be improved if the discriminator knows roughly what it’s doing)

The generator typically starts from a “seed” which takes the form of a vector of random values, and exploring these and in particular the space between different seed values is an interesting topics in itself, which we will dive into later.

What to Learn and What to Ignore

When I started this experiment, I realized that it could be made as an iterative process, where I could start by trying to generate simple patches, adding features and aspects of FM synthesis as I improved the algorithm. To get a working proof of concept, I needed something akin to an MVP – “Minimum Viable Product” in product development terms – the minimum set of patch data needed to create useful FM sounds!

For the first version, I settled on the following parameters to include:

Frequency (coarse and fine)
Output level
Amplitude Envelope
Velocity Sensitivity (see below)

While Velocity Sensitivity might seem like a frivolous addition at first glance (especially by all you DX21 and DX9 owners out there), I would argue that velocity expressiveness (as well as aftertouch) was one of the key success factors in introducing the DX7 to musicians, especially since FM synthesis is extremely well suited for timbre modulation that “makes sense” such as striking the keys harder adds more overtones, dissonance, fizzle and so on.

Feedback on Your Pitch

So, what about the other parameters? Feedback was not part of the proof of concept, but added later when the results started to look (or rather: sound) promising. Feedback is arguably a fairly important parameter for FM sound design so adding it broadened the sound palette (although for many it is probably better known as the “harshness control”). This, however, came with a problem of its own when it came to the actual data representation – more on that later.

Neither was key scaling, detune nor operator mode. Working with fixed frequency operators is also a well known trick for FM sound design, but felt like a different type kettle of fish altogether. How many sounds even use that technique to be honest? (Quite a few as it turns out, more on that later)

Two parameters ended up being treated very differently in the end. The first of them, pitch envelope, was not used at all. The reasoning behind excluding pitch envelope is that the way it has been implemented in the DX7, it is a global patch parameter, and as such only really useful for effect sounds. Having it available per operator (or even individually adjustable per operator) would have made a world of difference, but alas, that would not be introduced until the SY77.

This leaves with one final parameter, which requires particular consideration:

The Algorithm.

Avoiding Discontinuity

If we are being honest for a minute, algorithm is more of a fundament to the entire sound, rather than just another parameter. The algorithm number encodes the entire structure of the patch, how operators are connected to each other, and ultimately to a certain extent it dictates the range of possible sounds achievable by that particular patch.

If you ever played with the algorithm parameter, you know what this means – it creates abrupt changes in the sound, the entire character of the sound even.

The SuperMAX expansion for the original DX7 offered an opportunity to mathematically “morph” between patches, exploring interpolated sounds between two patch settings. Later, the Yamaha Montage (and MODX) synthesizers inherited a modern version where they used some type of machine learning algorithm to enable morphing or interpolation between patches. It is a really interesting way to find new sounds, but as soon as you try to smoothly change between two patches with different algorithms, it fails. Somewhere along the line, it has to switch from one algorithm to the other, which invariably induces an abrupt change to the otherwise smooth interpolation – a discontinuity that ruins the experience.

For this reason, I opted to exclude algorithm from the model entirely, and to treat it differently.

The Complete Representation

Treating the algorithm as the fundament for generating patches simplified matters. As you might remember, each operator is represented by a block containing all relevant parameter data as a byte, creating six blocks of similar structure. As the data representation was focused on creating a structure that would allow the model to find patterns in and between these blocks, any parameter that is global rather than included in each operator block poses a problem. How would you represent it to reflect its’ global status?

With the algorithm, we dodged the bullet by removing it entirely, but as we saw earlier, feedback was indeed included quite early on, and it is also a “global” parameter in the sense that it has one setting for the entire patch. Early tests were made to repeat it on each operator block, using the same value over and over, but the model failed to pick up on the connection and promptly generated different feedback values for each position, causing unreliable results.

At the time of writing, in the current (third) version of the data model, feedback is stored as the first byte of data, separated from the operator blocks. It is apparent that the model does not fully learn how to use it, and in some cases, the sound is improved by manually adjusting it slightly, meaning that this is on the backlog for improvement in future iterations! For the proof of concept, the most commonly used algorithm in the training data was used, and the model was subsequently trained only on patches built by this algorithm.

So, which is the most commonly used algorithm over 40 years of patches for the Yamaha DX7?

In the next article, we will answer that question, look at the various GAN architectures used, and start to see some results!