If I were to ask you to think about the most beautiful scenery you have ever seen, you would probably build in your mind an image of your experience. If you took a picture of that moment, you would probably agree that it represents your memory well. The reason is that, when it comes to visual experience, images are the most natural representation for humans.

Figure 1 : Le Grand Combin, (CC BY 2.0, Lionel Roux)
Let us repeat this thought experiment with a piece of music. We can ask ourselves:
What is the most natural representation for sound?
Here the answer is not trivial. Sound is a pressure wave that propagates through the air. Hence, it is often represented as the time series that a microphone would capture (See Figure 2).

Figure 2: Time-series of the signal Glockenspiel. We show 3 different scales of the signal.
Nevertheless, while this representation is standard for any signal processing task, it does not make a good job of representing what we hear as human. Here are a few reasons why this is so.
- Phase-dependant: If the sign of the signal is changed, it will not make any audible difference. In general, the phase contains no useful information for mono signals, while making the representation more complex.
- Multiscale: The representation contains multiple scales from tenth of a millisecond to minutes. These scales are untrivially linked together. In practice, this complexity seems unnecessary as we can only perceive things starting in the range of tenth of a second.
- Unnatural: It is impossible to tell how a sound wave will sound by looking at it, even for an expert.

What is the phase of a signal?
The sound wave of a pure tone is an oscillation that can be parameterized by a sine function: . Here
is the frequency of the signal and
its phase. As humans, we cannot directly perceive the phase
When it comes to generate sound with a computer, this last point does not matter as computers have a different way of handling the information than humans. Nevertheless, the complexity added by the phase and the multiscale properties of the signal makes it way more challenging to build a generative model.
Fortunately, there exist other representations for sound. For example, the Fourier transform decompose the signal into frequencies, which are “more” interpretable as they represent the different pitches of a sound. The Fourier transform allows to represent the phase independently of the magnitude, making it simple to discard. Unfortunately, the time component of the sound is lost making it even more complex to use in generative modeling than the time series. To solve this last issue, it is possible to use the Short Time Fourier Transform (STFT), which consists of performing a Fourier transform for each part the sliced (windowed) signal. This transform outputs a Time-Frequency (TF) representation of the signal as depicted in Figure 3.


Figure 3: Time-Frequency transform (Magnitude) of the signal Glockenspiel. The vertical lines are the precise timing where the bar of glockenspiel is hit. The long horizontal line represents the base frequency of the bar resonating. The small horizontal lines are the harmonics.
The TF representation solves most of the issues that were discussed previously as it is:
- More natural : It is easier to interpret as it is some kind of musical score of the signal. See for example Figure 3.
- Less multiscale:The smallest scales are aligned what humans perceive.
- Phase-free(discrimination): The amplitude of the TF transform is invariant with respect of the signal phase.Because of these properties, TF representations (or different variants of them) are standard in many audio problems such as discrimination, source separation or inpainting.
Why TF representations are generally not used for audio generation?
With this in mind, it can be surprising to notice that, when it comes to sound, most generative models still use the time representation (Wavenet [5], WaveGAN [3]). In fact, the problem mainly comes from the phase. While it can basically be dropped for a discrimination task, it has to be recovered or produced, for a generation task as it is needed to eventually reconstruct the time representation. Problematically, this is a very subtle operation as `phase recovery` remains an unsolved problem despite decades of research.


Figure 4: Raw phase of the signal. The phase structure is not easily understandable. One reason why this representation is hard to grasp is that the phase is wrapped between
In practice, there are mainly three issues when it comes to handling the phase.
- An improperly reconstructed phase leads to audiblereconstruction artifacts.
- The reconstruction is not simple and often involves time-consuming recursive algorithms.
- In the case where the TF amplitude is not properly estimated, it is in general not possible to find a good phase. In that case, we say that the magnitude is not consistent.



What is consistency?
The TF representation is usually redundant, meaning that it is a larger representation than time. For a signal of size L, the representation we have
In practice, there are mainly three issues when it comes to handling the phase.
- An improperly reconstructed phase leads to audiblereconstruction artifacts.
- The reconstruction is not simple and often involves time-consuming recursive algorithms.
- In the case where the TF amplitude is not properly estimated, it is in general not possible to find a good phase. In that case, we say that the magnitude is not consistent.



What is consistency?
The TF representation is usually redundant, meaning that it is a larger representation than time. For a signal of size L, the representation we have
What did we do to solve the phase problem with generative models?
Math. While, in the discrete case, the phase recovery is still an unsolved problem, in the continuous case, there exist a particular TF transform that has an analytic relation linking its phase and its magnitude (See the math box).




Figure 5: Here we see the consistency score through the training process. In grey, is a failling network.
TiFGAN: Time-Frequency Generative Adversarial Network
We present TiFGAN, a generative adversarial network (GAN) that produce audio snippets of 1 seconds using a TF representation. Our pipeline, summarized in Figure 6, consists of 3 steps. First we design an appropriate TF transform and compute its representation for the dataset. Second, we train training a Generative Adversarial Network (GAN). Eventually, we recover the phase using PGHI and reconstruct the audio signal.


To evaluate our network, we performed a psychoacoustic experiment where the subjects were asked to choose the most realistic audio signal from two candidates. This allows comparing our network with other methods and with respect of real audio signals. Numerical results can be found in Table 1, that we took directly from [1] (Hence eq. (11) refers to [1]). We observe that TiFGAN is preferred 75% of the time over WaveGAN. Furthermore, 14% of the time, the listener preferred the audio generated than the real signals. Overall, we found out that, in terms of quality, it outperforms the previous state-of-the-art GAN[3].


Just judge the result by yourself… Here we generated 1 second audio snippets. The goal is to make them indistinguishable from the original one.
To go further you can
- Check the code
- Read the ICML paper
- Look at the webpage