If I were to ask you to think about the most beautiful scenery you have ever seen, you would probably build in your mind an image of your experience. If you took a picture of that moment, you would probably agree that it represents your memory well. The reason is that, when it comes to visual experience, images are the most natural representation for humans.

Figure 1 : Le Grand Combin, (CC BY 2.0, Lionel Roux)

Let us repeat this thought experiment with a piece of music. We can ask ourselves:

What is the most natural representation for sound?

Here the answer is not trivial. Sound is a pressure wave that propagates through the air. Hence, it is often represented as the time series that a microphone would capture (See Figure 2).

Figure 2: Time-series of the signal Glockenspiel. We show 3 different scales of the signal.

Nevertheless, while this representation is standard for any signal processing task, it does not make a good job of representing what we hear as human. Here are a few reasons why this is so.

  • Phase-dependant: If the sign of the signal is changed, it will not make any audible difference. In general, the phase contains no useful information for mono signals, while making the representation more complex.
  • Multiscale: The representation contains multiple scales from tenth of a millisecond to minutes. These scales are untrivially linked together. In practice, this complexity seems unnecessary as we can only perceive things starting in the range of tenth of a second.
  • Unnatural: It is impossible to tell how a sound wave will sound by looking at it, even for an expert.

What is the phase of a signal?


The sound wave of a pure tone is an oscillation that can be parameterized by a sine function: s(t) = \sin(2 \pi f t + \phi). Here f is the frequency of the signal and \phi its phase. As humans, we cannot directly perceive the phase \phi. For example, the two blue signals sound identically. Furthermore, the L2 norm cannot be used to measure the similarity between two signals. Even though we perceive the red and the blue signal as very different sounds, their relative difference is \sqrt{2}, a significantly lower value than the relative difference between the two blue signals, which is 2.

When it comes to generate sound with a computer, this last point does not matter as computers have a different way of handling the information than humans. Nevertheless, the complexity added by the phase and the multiscale properties of the signal makes it way more challenging to build a generative model.

Fortunately, there exist other representations for sound. For example, the Fourier transform decompose the signal into frequencies, which are “more” interpretable as they represent the different pitches of a sound. The Fourier transform allows to represent the phase independently of the magnitude, making it simple to discard. Unfortunately, the time component of the sound is lost making it even more complex to use in generative modeling than the time series.
To solve this last issue, it is possible to use the Short Time Fourier Transform (STFT), which consists of performing a Fourier transform for each part the sliced (windowed) signal. This transform outputs a Time-Frequency (TF) representation of the signal as depicted in Figure 3.

Figure 3: Time-Frequency transform (Magnitude) of the signal Glockenspiel. The vertical lines are the precise timing where the bar of glockenspiel is hit. The long horizontal line represents the base frequency of the bar resonating. The small horizontal lines are the harmonics.

The TF representation solves most of the issues that were discussed previously as it is:

  • More natural : It is easier to interpret as it is some kind of musical score of the signal. See for example Figure 3.
  • Less multiscale:The smallest scales are aligned what humans perceive.
  • Phase-free(discrimination): The amplitude of the TF transform is invariant with respect of the signal phase.Because of these properties, TF representations (or different variants of them) are standard in many audio problems such as discrimination, source separation or inpainting.

Why TF representations are generally not used for audio generation?

With this in mind, it can be surprising to notice that, when it comes to sound, most generative models still use the time representation (Wavenet [5], WaveGAN [3]). In fact, the problem mainly comes from the phase. While it can basically be dropped for a discrimination task, it has to be recovered or produced, for a generation task as it is needed to eventually reconstruct the time representation. Problematically, this is a very subtle operation as `phase recovery` remains an unsolved problem despite decades of research.

Figure 4: Raw phase of the signal. The phase structure is not easily understandable. One reason why this representation is hard to grasp is that the phase is wrapped between 0 and 2\pi.

In practice, there are mainly three issues when it comes to handling the phase.

  1. An improperly reconstructed phase leads to audiblereconstruction artifacts.
  2. The reconstruction is not simple and often involves time-consuming recursive algorithms.
  3. In the case where the TF amplitude is not properly estimated, it is in general not possible to find a good phase. In that case, we say that the magnitude is not consistent.

What is consistency?

The TF representation is usually redundant, meaning that it is a larger representation than time. For a signal of size L, the representation we have r\cdot L elements, with r being the redundancy having typically a value of 2,4 or 8. As the set of TF representation is larger than the set of time signals, there exist some TF representations that do not have a corresponding signal. These TF representations are called inconsistent.

In practice, there are mainly three issues when it comes to handling the phase.

  1. An improperly reconstructed phase leads to audiblereconstruction artifacts.
  2. The reconstruction is not simple and often involves time-consuming recursive algorithms.
  3. In the case where the TF amplitude is not properly estimated, it is in general not possible to find a good phase. In that case, we say that the magnitude is not consistent.

What is consistency?

The TF representation is usually redundant, meaning that it is a larger representation than time. For a signal of size L, the representation we have r\cdot L elements, with r being the redundancy having typically a value of 2,4 or 8. As the set of TF representation is larger than the set of time signals, there exist some TF representations that do not have a corresponding signal. These TF representations are called inconsistent.

What did we do to solve the phase problem with generative models?

 

Math. While, in the discrete case, the phase recovery is still an unsolved problem, in the continuous case, there exist a particular TF transform that has an analytic relation linking its phase and its magnitude (See the math box).

Based on the discretization of this case, a special noniterative phase reconstruction method called PGHI was recently proposed[2].

Our work[1] defines some practical rules to build a TF representation for generative models making them optimally work with PGHI. Following these rules, the phase can be reconstructed efficiently from the amplitudes without undesired artifacts. This solves the phase reconstruction problem.

The math box
The rules we propose are geared towards allowing the network to generate consistent spectrogram, but we still need to make sure that this is the case. Even PGHI will not generate a good phase for an inconsistent spectrogram. To make sure our spectrograms are consistent, we propose a consistency score using the discretization of second order analytic relations (See equation 3 in the math box). This allows the quality assessment of the produced magnitudes and is a good proxy for the final audio quality of the signals.

Figure 5: Here we see the consistency score through the training process. In grey, is a failling network.

TiFGAN: Time-Frequency Generative Adversarial Network

We present TiFGAN, a generative adversarial network (GAN) that produce audio snippets of 1 seconds using a TF representation. Our pipeline,  summarized in Figure 6, consists of 3 steps. First we design an appropriate TF transform and compute its representation for the dataset. Second, we train training a Generative Adversarial Network (GAN). Eventually, we recover the phase using PGHI and reconstruct the audio signal.

Figure 6: TiFGAN pipeline. The GAN learns to generate amplitude.

To evaluate our network, we performed a psychoacoustic experiment where the subjects were asked to choose the most realistic audio signal from two candidates. This allows comparing our network with other methods and with respect of real audio signals. Numerical results can be found in Table 1, that we took directly from [1] (Hence eq. (11) refers to [1]). We observe that TiFGAN is preferred 75% of the time over WaveGAN. Furthermore, 14% of the time, the listener preferred the audio generated than the real signals. Overall, we found out that, in terms of quality, it outperforms the previous state-of-the-art GAN[3].

Just judge the result by yourself… Here we generated 1 second audio snippets. The goal is to make them indistinguishable from the original one.

Command dataset

Piano dataset

Original
WaveGAN
Ours

Command dataset

Piano dataset

Original
WaveGAN
Ours

To go further you can