Previous page    Harmony Assistant    Next page 

What's new ?
Virtual Singer
General points
Quick creation
Shaped notes
Jazz Scat
Midi & ABC
Rules for writing
Technical  background
SAMPA notation
Summary of commands
Real Singer
Recording voice
Adjusting selection
Advanced recording
Phonetic adjustments
Technical background
Software license
Technical support
Printable manual

symbol marks changed chapters.



Technical background

Note: This page is only a brief overview of the methods used by RealSinger to produce a voice.
It is not necessary to read this chapter to use RealSinger. This chapter is indended to answer technical questions some users might have about the internal algorithms, and is not needed to use the product.


To synthesize a realistic singing voice, the first idea that comes to the mind of the programmer is to use a collection of recorded phonemes to generate the voice.

Three problems quickly become apparent:

  1. The algorithm used must be able to generate the phoneme at any pitch (fundamental frequency). Recording every phoneme at all possible pitches is not feasible, because it would lead to a long and complex recording process, as well as huge voice files.
  2. The algorithm must be able to elongate, or stretch, the phoneme to any duration.
  3. The algorithm must be able to generate a smooth transition from one phoneme to another, in order to simulate the coarticulation phenomenon (the next phoneme starts to be heard before the current one is completely terminated).
A solution can be found for each of these problems in the published computer literature.

Efficient algorithms have been developed to solve problems 1 and 2. They process the recorded sample's digital data directly, allowing the programmer to change its frequency (pitch) as well as its duration. These algorithms are used in most popular sound editors to change the pitch and speed of a sound file independently. They are also used successfully in speech synthesis, because speech frequency (pitch) variations are quite small.
However, in the case of the sung voice, these algorithms cannot be used, because they are not efficient when the pitch shift is too large. The result is not "wrong" as such, but the voice is distorted, just as when playing a magnetic tape at too high a speed (chipmunk voice).

For problem 3, a common solution is to record not only the individual phonemes of a language, but all possible combinations of two or three phonemes (diphonemes/triphonemes). This system stores the coarticulation effect and makes the synthesized voice more realistic. However, here again, the recording process is quite difficult and extensive, sometimes requiring several hours of work for the speaker or singer. The resulting voice file is often quite large (several megabytes).

RealSinger uses original algorithms to solve all three of these problems at the same time, by manipulating frequency spectra.
Some speech synthesizers have tried to use voice frequency spectra to generate voice in the past.
However, this method proved to be difficult to implement, because recreating a signal from a processed spectrum using an inverse Fast Fourier Transform (IFFT) requires that the "phase" values be reajusted properly.  If they are not properly adjusted, consecutive pieces of signal won't join and an unwanted background noise will be heard.

Voice spectrum

In speech or song, the glottal source waveform (the sound produced by vocal chords when excited by the air stream from the lungs) is a combination of harmonics (frequency multiples of the fundamental frequency f0).

On a power/frequency graph, this glottal source sound looks like a comb, with each tooth of the comb located at a frequency that is a multiple of the fundamental f0:

When the voice pitch increases, the fundamental frequency f0 shifts to the right (higher frequency), and the frequency offset between two consecutive harmonics increases too, to remain equal to f0.

In passing through the vocal tract, some frequencies are enhanced by cavity resonances, and others are softened. The result is that certain harmonics are loud, and others are softer. This vocal tract spectrum depends on the phoneme being said or sung, and is more or less unchanged when frequency (pitch) increases or decreases.
The convolution of these two spectra (glottal source and vocal tract) gives the resulting spectrum, in which the listener can determine both the phoneme (what is said) and the pitch (sung note).

RealSinger basics

The aim of RealSinger is, for each phoneme of a given language, to apply a deconvolution to the recorded signal in order to separate the glottal source and vocal tract spectra. Then it stores only the vocal tract spectrum, and will apply a generated glottal source to this spectrum to simulate the original recorded phoneme being sung at any pitch.

Learning process

  • The speaker is asked to pronounce a word for each phoneme of the chosen language.
  • Each word is recorded as regular sound data.
  • Then, the phoneme is isolated within the word and the signal is cropped to keep only this part.
  • An average frequency spectrum of the sound is computed.
  • This spectrum is deconvoluted to delete the glottal source influence, and keep only the vocal tract resonator frequency curve.
  • This pseudo-spectrum is stored (less than 100 floating-point values for each phoneme).
  • For time-varying phonemes like plosives, several pseudo-spectra are stored to keep information about changes in the spectrum.
This algorithm enables RealSinger to store only a few values for each phoneme, which means very short voice files (less than 40 Kb once compressed).

Generating the voice

  • For each phoneme to be sung, the matching pseudo-spectrum is extracted. In transitional sections between two phonemes, both pseudo-spectra are distorted, then merged together, to simulate the coarticulation process.
  • A synthetic glottal source is generated, at the required pitch. The glottal source spectrum can  be easily modified to change the overall voice timbre (for equalization or for applying various vocoder effects).
  • This source is re-convoluted with the phoneme pseudo-spectrum.
  • The spectrum is then processed by a phase-free inverse transform to generate regular sound data.
This algorithm simulates coarticulation effects. Therefore it is not necessary to record the whole set of diphonemes or triphonemes. Only pure phonemes are required.

(c) Myriad 2012 - All rights reserved